Suspicious near-instant results with NWChem long t4

Message boards : Number crunching : Suspicious near-instant results with NWChem long t4
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Alien Seeker
Avatar

Send message
Joined: 5 Mar 20
Posts: 13
Credit: 805,400
RAC: 0
Message 753 - Posted: 12 Apr 2020, 3:10:41 UTC

Two tasks supposedly terminated "successfully" after only a few seconds on one of my computers: 2276475 and 2278138. Both tasks were using the t4 version of NWChem long, and I suspect there's an error somewhere which wasn't detected properly.

The wingmates are still to return their results but the instant execution looks suspicious.
ID: 753 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
fzs600

Send message
Joined: 31 Jul 19
Posts: 3
Credit: 3,006,937
RAC: 0
Message 754 - Posted: 12 Apr 2020, 7:28:54 UTC - in response to Message 753.  

Two tasks supposedly terminated "successfully" after only a few seconds on one of my computers: 2276475 and 2278138. Both tasks were using the t4 version of NWChem long, and I suspect there's an error somewhere which wasn't detected properly.

The wingmates are still to return their results but the instant execution looks suspicious.

looking at me too I find several wu

7 Apr 2020, 10:21:20 UTC 7 Apr 2020, 15:17:09 UTC Terminé, en attente de validation 4.13 1.83 en attente NWChem long v0.19 (t8)
x86_64-pc-linux-gnu

7 Apr 2020, 10:21:20 UTC 7 Apr 2020, 15:17:09 UTC Terminé, en attente de validation 4.20 1.77 en attente NWChem long v0.19 (t8)
x86_64-pc-linux-gnu

5 Apr 2020, 8:40:17 UTC 5 Apr 2020, 8:40:33 UTC Terminé, en attente de validation 4.07 1.21 en attente NWChem long v0.19 (t8)
x86_64-pc-linux-gnu

3 Apr 2020, 21:56:21 UTC 4 Apr 2020, 12:05:17 UTC Terminé, en attente de validation 3.14 1.15 en attente NWChem long v0.19 (t8)
x86_64-pc-linux-gnu
ID: 754 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zalster

Send message
Joined: 16 Dec 19
Posts: 25
Credit: 11,938,843
RAC: 0
Message 756 - Posted: 12 Apr 2020, 13:49:19 UTC - in response to Message 753.  
Last modified: 12 Apr 2020, 13:55:44 UTC

Two tasks supposedly terminated "successfully" after only a few seconds on one of my computers: 2276475 and 2278138. Both tasks were using the t4 version of NWChem long, and I suspect there's an error somewhere which wasn't detected properly.

The wingmates are still to return their results but the instant execution looks suspicious.


It's either the work units are faulty or it's a lack of resources. I noticed that both are 4 thread work units and that machine is only a 4 thread CPU. Unfortunately, the wingman of the first unit is hidden so no telling when it will get done (or what machine it might be until after it's processed) The second work unit we will have to see. Might have to wait for a 3rd wingman in both cases.

For fzs600. All his computers and therefore, his work units are too.

Edit...

The other issue with hidden hosts, if you are paired with one for multiple work units and find their machine is faulty there is no way to contact them so they can look at their machine and figure out the issue. So their machines continue to fill the data base with faulty results. I'm unfortunate in having that happen for a lot of work units. So I'm forced to wait for a 3rd wingman to process the work units.
ID: 756 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alien Seeker
Avatar

Send message
Joined: 5 Mar 20
Posts: 13
Credit: 805,400
RAC: 0
Message 757 - Posted: 12 Apr 2020, 16:27:25 UTC - in response to Message 756.  

It's either the work units are faulty or it's a lack of resources. I noticed that both are 4 thread work units and that machine is only a 4 thread CPU.


I can now confirm the problem came from the execution and not the WU itself: result 2278138 failed to validate now the wingmate has returned their result. It should appear as a computing error, there must be a sanity check missing somewhere in the app.

I agree the 4 threads version of the app is the likely culprit; a t2 is currently running successfully on the same host. If it happens again with more t4 works, I'll limit max_cpus, I assume it'll stop the server from sending me t4 tasks?
ID: 757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zalster

Send message
Joined: 16 Dec 19
Posts: 25
Credit: 11,938,843
RAC: 0
Message 758 - Posted: 12 Apr 2020, 16:32:49 UTC - in response to Message 757.  

If it happens again with more t4 works, I'll limit max_cpus, I assume it'll stop the server from sending me t4 tasks?


We will have to see. This is first time I've seen a project with a set max cpus in the preferences page. It does seem to work. I've limited my machines to 90% of all cpus but you could set it to 75% so it only uses 3 out of 4 threads, or however you want to set it. Not sure how it would react to 90% of 4 threads. Would it round down to the nearest thread? Would be interesting to see how it responds. Good luck with the rest of your work units.
ID: 758 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alien Seeker
Avatar

Send message
Joined: 5 Mar 20
Posts: 13
Credit: 805,400
RAC: 0
Message 759 - Posted: 12 Apr 2020, 16:58:16 UTC - in response to Message 758.  

Now you mention it, I've played a lot with max_ncpus_pct from the global preferences in the last days while trying to get a setting that worked for me. (To answer your interrogation, BOINC rounds to the higher number below the threshold, so 90% of 4 CPUs would be 3 threads running.) It may have happened that the t4 tasks failed when I allowed fewer than 100% of CPUs. Now I have a configuration I'm happy with, I'll keep an eye out for more t4 tasks.

The failed tasks should still have appeared as errors though, it would make debugging easier.
ID: 759 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 769 - Posted: 16 Apr 2020, 8:59:11 UTC - in response to Message 759.  

Yes, I noticed this problem some time ago, but the problem comes from a third party software (it returns a success even though it crashes). I'm looking for a workaround to detect this.
ID: 769 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 7 Nov 19
Posts: 31
Credit: 4,245,903
RAC: 0
Message 802 - Posted: 23 Apr 2020, 9:25:07 UTC

Why has server not sent them yet to 3rd wingman after 10 days? I have the same problem.
ID: 802 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 808 - Posted: 25 Apr 2020, 9:41:29 UTC - in response to Message 802.  

I don't know... It's the official code that manage this part.
ID: 808 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alien Seeker
Avatar

Send message
Joined: 5 Mar 20
Posts: 13
Credit: 805,400
RAC: 0
Message 815 - Posted: 26 Apr 2020, 17:02:36 UTC

I've had the problem again, this time on the other computer and with only 1 core per task. I suspect the reason this time was a full /tmp; although I didn't check the size, the problem vanished when I removed the many leftover /tmp/ompi.hostname.123/pid.1234 directories from previous computations.

I think tasks should clean up after themselves when they end; even if each directory is rather small, they pile up after a while and the /tmp partition isn't meant to be very big.
ID: 815 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
crashtech

Send message
Joined: 9 Dec 19
Posts: 11
Credit: 19,162,966
RAC: 0
Message 884 - Posted: 13 Jun 2020, 20:10:25 UTC

Has there been a resolution to this issue? One of my computers only runs WUs for a few seconds, then marks them as complete

https://quchempedia.univ-angers.fr/athome/results.php?hostid=1227
ID: 884 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 885 - Posted: 13 Jun 2020, 21:26:14 UTC

I run all of my work units as t1, by setting "Max # CPUs 1" in the preferences.

I have never seen the problem of the short runs that I can remember.
https://quchempedia.univ-angers.fr/athome/results.php?hostid=2356
ID: 885 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
crashtech

Send message
Joined: 9 Dec 19
Posts: 11
Credit: 19,162,966
RAC: 0
Message 886 - Posted: 14 Jun 2020, 5:17:06 UTC - in response to Message 885.  
Last modified: 14 Jun 2020, 5:20:04 UTC

I run all of my work units as t1, by setting "Max # CPUs 1" in the preferences.

I have never seen the problem of the short runs that I can remember.
https://quchempedia.univ-angers.fr/athome/results.php?hostid=2356


Hi, based on your post, I set up a location in the preferences page here to only allow one CPU, but all the WUs still end prematurely. For now I can't run the project on it, but would like to figure out why the work is failing. It runs other projects without issues, so I don't think it's hardware related.
ID: 886 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 887 - Posted: 14 Jun 2020, 9:49:27 UTC - in response to Message 886.  
Last modified: 14 Jun 2020, 9:55:43 UTC

Very strange. I don't see anything wrong with your machines.
Maybe memory? Overclocking? It must be something different about that one.

Sometimes files get corrupted though. I would detach from the project, and then re-attach.
ID: 887 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
crashtech

Send message
Joined: 9 Dec 19
Posts: 11
Credit: 19,162,966
RAC: 0
Message 888 - Posted: 14 Jun 2020, 15:44:58 UTC - in response to Message 887.  

Very strange. I don't see anything wrong with your machines.
Maybe memory? Overclocking? It must be something different about that one.

Sometimes files get corrupted though. I would detach from the project, and then re-attach.

Thanks, I have done so more than once, checking the second time to be sure that the project directory was actually removed. There seems to be something about that particular host's configuration that causes QuChemPedIA to fail.
ID: 888 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 889 - Posted: 14 Jun 2020, 16:11:54 UTC - in response to Message 888.  

Possibly there is a problem with the BOINC installation itself.
It would probably be easier just to upgrade to the latest version, which you can do with this PPA:

sudo add-apt-repository ppa:costamagnagianfranco/boinc
sudo apt-get update

https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc
ID: 889 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
crashtech

Send message
Joined: 9 Dec 19
Posts: 11
Credit: 19,162,966
RAC: 0
Message 890 - Posted: 15 Jun 2020, 14:50:51 UTC - in response to Message 889.  
Last modified: 15 Jun 2020, 15:01:19 UTC

Possibly there is a problem with the BOINC installation itself.
It would probably be easier just to upgrade to the latest version, which you can do with this PPA:

sudo add-apt-repository ppa:costamagnagianfranco/boinc
sudo apt-get update

https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc


Thanks, I have done so, and verified it in the Event Log:
Mon 15 Jun 2020 08:42:37 AM MDT |  | Starting BOINC client version 7.17.0 for x86_64-pc-linux-gnu


Alas, the tasks still error out immediately. There don't seem to be any clues in the stderr output of the failed tasks, either. I wonder if there aren't some installed libraries that this project relies on that I might check and/or re-install.
ID: 890 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 24
Credit: 68,559,000
RAC: 0
Message 894 - Posted: 21 Jun 2020, 7:37:50 UTC - in response to Message 815.  

Alien Seeker wrote:
I've had the problem again, this time on the other computer and with only 1 core per task. I suspect the reason this time was a full /tmp; although I didn't check the size, the problem vanished when I removed the many leftover /tmp/ompi.hostname.123/pid.1234 directories from previous computations.

I think tasks should clean up after themselves when they end; even if each directory is rather small, they pile up after a while and the /tmp partition isn't meant to be very big.

crashtech wrote:
Has there been a resolution to this issue? One of my computers only runs WUs for a few seconds, then marks them as complete

https://quchempedia.univ-angers.fr/athome/results.php?hostid=1227

@crashtech, maybe this host has a full /tmp (like Alien Seeker suspected with the own host). Check with "df -h /tmp" for example.

Or the boinc-client service on this host is set up in a way which does not permit it to create files outside of its data directory, or at least not in /tmp. What does /lib/systemd/system/boinc-client.service contain on this host?
ID: 894 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 7 Nov 19
Posts: 31
Credit: 4,245,903
RAC: 0
Message 895 - Posted: 21 Jun 2020, 9:15:12 UTC

This happened to me for the first time too.
My computer usually run QuChemPedIA with success.
Yesterday I increased work cache to 10 days. I had ~80 in-progress tasks downloaded at ~19:45.
This morning at 4 AM all failed (they went to pending/invalid).

Use this result to see my host: https://quchempedia.univ-angers.fr/athome/result.php?resultid=2386836


P.S. please, don't care about errors. They are caused by bash crashes and I solved it with os restart. ;)
ID: 895 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 24
Credit: 68,559,000
RAC: 0
Message 896 - Posted: 21 Jun 2020, 9:47:54 UTC
Last modified: 21 Jun 2020, 9:48:58 UTC

Besides a full /tmp, or lacking access permissions to /tmp, another potential problem source could be issues with the TCP port which MPI (Open MPI?) uses.

I have one nwchem_long task running so far, and this for example occupies the port 38253.
This may show you what ports are (or were) in use:
cat /tmp/ompi.*/pid.*/contact.txt
So, maybe those who had failures after a few seconds run time had some conflict which prevented the use of the TCP port?


Luigi R. wrote:
P.S. please, don't care about errors. They are caused by bash crashes and I solved it with os restart. ;)
But maybe those bash crashes were caused by nwchem_long not cleaning up properly.
ID: 896 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Suspicious near-instant results with NWChem long t4

©2024 Benoit DA MOTA - LERIA, University of Angers, France