Message boards :
Number crunching :
Suspicious near-instant results with NWChem long t4
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Mar 20 Posts: 13 Credit: 805,400 RAC: 0 |
Two tasks supposedly terminated "successfully" after only a few seconds on one of my computers: 2276475 and 2278138. Both tasks were using the t4 version of NWChem long, and I suspect there's an error somewhere which wasn't detected properly. The wingmates are still to return their results but the instant execution looks suspicious. |
Send message Joined: 31 Jul 19 Posts: 3 Credit: 3,006,937 RAC: 0 |
Two tasks supposedly terminated "successfully" after only a few seconds on one of my computers: 2276475 and 2278138. Both tasks were using the t4 version of NWChem long, and I suspect there's an error somewhere which wasn't detected properly. looking at me too I find several wu 7 Apr 2020, 10:21:20 UTC 7 Apr 2020, 15:17:09 UTC Terminé, en attente de validation 4.13 1.83 en attente NWChem long v0.19 (t8) 7 Apr 2020, 10:21:20 UTC 7 Apr 2020, 15:17:09 UTC Terminé, en attente de validation 4.20 1.77 en attente NWChem long v0.19 (t8) 5 Apr 2020, 8:40:17 UTC 5 Apr 2020, 8:40:33 UTC Terminé, en attente de validation 4.07 1.21 en attente NWChem long v0.19 (t8) 3 Apr 2020, 21:56:21 UTC 4 Apr 2020, 12:05:17 UTC Terminé, en attente de validation 3.14 1.15 en attente NWChem long v0.19 (t8) |
Send message Joined: 16 Dec 19 Posts: 25 Credit: 11,938,843 RAC: 0 |
Two tasks supposedly terminated "successfully" after only a few seconds on one of my computers: 2276475 and 2278138. Both tasks were using the t4 version of NWChem long, and I suspect there's an error somewhere which wasn't detected properly. It's either the work units are faulty or it's a lack of resources. I noticed that both are 4 thread work units and that machine is only a 4 thread CPU. Unfortunately, the wingman of the first unit is hidden so no telling when it will get done (or what machine it might be until after it's processed) The second work unit we will have to see. Might have to wait for a 3rd wingman in both cases. For fzs600. All his computers and therefore, his work units are too. Edit... The other issue with hidden hosts, if you are paired with one for multiple work units and find their machine is faulty there is no way to contact them so they can look at their machine and figure out the issue. So their machines continue to fill the data base with faulty results. I'm unfortunate in having that happen for a lot of work units. So I'm forced to wait for a 3rd wingman to process the work units. |
Send message Joined: 5 Mar 20 Posts: 13 Credit: 805,400 RAC: 0 |
It's either the work units are faulty or it's a lack of resources. I noticed that both are 4 thread work units and that machine is only a 4 thread CPU. I can now confirm the problem came from the execution and not the WU itself: result 2278138 failed to validate now the wingmate has returned their result. It should appear as a computing error, there must be a sanity check missing somewhere in the app. I agree the 4 threads version of the app is the likely culprit; a t2 is currently running successfully on the same host. If it happens again with more t4 works, I'll limit max_cpus, I assume it'll stop the server from sending me t4 tasks? |
Send message Joined: 16 Dec 19 Posts: 25 Credit: 11,938,843 RAC: 0 |
If it happens again with more t4 works, I'll limit max_cpus, I assume it'll stop the server from sending me t4 tasks? We will have to see. This is first time I've seen a project with a set max cpus in the preferences page. It does seem to work. I've limited my machines to 90% of all cpus but you could set it to 75% so it only uses 3 out of 4 threads, or however you want to set it. Not sure how it would react to 90% of 4 threads. Would it round down to the nearest thread? Would be interesting to see how it responds. Good luck with the rest of your work units. |
Send message Joined: 5 Mar 20 Posts: 13 Credit: 805,400 RAC: 0 |
Now you mention it, I've played a lot with max_ncpus_pct from the global preferences in the last days while trying to get a setting that worked for me. (To answer your interrogation, BOINC rounds to the higher number below the threshold, so 90% of 4 CPUs would be 3 threads running.) It may have happened that the t4 tasks failed when I allowed fewer than 100% of CPUs. Now I have a configuration I'm happy with, I'll keep an eye out for more t4 tasks. The failed tasks should still have appeared as errors though, it would make debugging easier. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Yes, I noticed this problem some time ago, but the problem comes from a third party software (it returns a success even though it crashes). I'm looking for a workaround to detect this. |
Send message Joined: 7 Nov 19 Posts: 31 Credit: 4,245,903 RAC: 0 |
Why has server not sent them yet to 3rd wingman after 10 days? I have the same problem. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
I don't know... It's the official code that manage this part. |
Send message Joined: 5 Mar 20 Posts: 13 Credit: 805,400 RAC: 0 |
I've had the problem again, this time on the other computer and with only 1 core per task. I suspect the reason this time was a full /tmp; although I didn't check the size, the problem vanished when I removed the many leftover /tmp/ompi.hostname.123/pid.1234 directories from previous computations. I think tasks should clean up after themselves when they end; even if each directory is rather small, they pile up after a while and the /tmp partition isn't meant to be very big. |
Send message Joined: 9 Dec 19 Posts: 11 Credit: 19,162,966 RAC: 0 |
Has there been a resolution to this issue? One of my computers only runs WUs for a few seconds, then marks them as complete https://quchempedia.univ-angers.fr/athome/results.php?hostid=1227 |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
I run all of my work units as t1, by setting "Max # CPUs 1" in the preferences. I have never seen the problem of the short runs that I can remember. https://quchempedia.univ-angers.fr/athome/results.php?hostid=2356 |
Send message Joined: 9 Dec 19 Posts: 11 Credit: 19,162,966 RAC: 0 |
I run all of my work units as t1, by setting "Max # CPUs 1" in the preferences. Hi, based on your post, I set up a location in the preferences page here to only allow one CPU, but all the WUs still end prematurely. For now I can't run the project on it, but would like to figure out why the work is failing. It runs other projects without issues, so I don't think it's hardware related. |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
Very strange. I don't see anything wrong with your machines. Maybe memory? Overclocking? It must be something different about that one. Sometimes files get corrupted though. I would detach from the project, and then re-attach. |
Send message Joined: 9 Dec 19 Posts: 11 Credit: 19,162,966 RAC: 0 |
Very strange. I don't see anything wrong with your machines. Thanks, I have done so more than once, checking the second time to be sure that the project directory was actually removed. There seems to be something about that particular host's configuration that causes QuChemPedIA to fail. |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
Possibly there is a problem with the BOINC installation itself. It would probably be easier just to upgrade to the latest version, which you can do with this PPA: sudo add-apt-repository ppa:costamagnagianfranco/boinc sudo apt-get update https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc |
Send message Joined: 9 Dec 19 Posts: 11 Credit: 19,162,966 RAC: 0 |
Possibly there is a problem with the BOINC installation itself. Thanks, I have done so, and verified it in the Event Log: Mon 15 Jun 2020 08:42:37 AM MDT | | Starting BOINC client version 7.17.0 for x86_64-pc-linux-gnu Alas, the tasks still error out immediately. There don't seem to be any clues in the stderr output of the failed tasks, either. I wonder if there aren't some installed libraries that this project relies on that I might check and/or re-install. |
Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0 |
Alien Seeker wrote: I've had the problem again, this time on the other computer and with only 1 core per task. I suspect the reason this time was a full /tmp; although I didn't check the size, the problem vanished when I removed the many leftover /tmp/ompi.hostname.123/pid.1234 directories from previous computations. crashtech wrote: Has there been a resolution to this issue? One of my computers only runs WUs for a few seconds, then marks them as complete @crashtech, maybe this host has a full /tmp (like Alien Seeker suspected with the own host). Check with "df -h /tmp" for example. Or the boinc-client service on this host is set up in a way which does not permit it to create files outside of its data directory, or at least not in /tmp. What does /lib/systemd/system/boinc-client.service contain on this host? |
Send message Joined: 7 Nov 19 Posts: 31 Credit: 4,245,903 RAC: 0 |
This happened to me for the first time too. My computer usually run QuChemPedIA with success. Yesterday I increased work cache to 10 days. I had ~80 in-progress tasks downloaded at ~19:45. This morning at 4 AM all failed (they went to pending/invalid). Use this result to see my host: https://quchempedia.univ-angers.fr/athome/result.php?resultid=2386836 P.S. please, don't care about errors. They are caused by bash crashes and I solved it with os restart. ;) |
Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0 |
Besides a full /tmp, or lacking access permissions to /tmp, another potential problem source could be issues with the TCP port which MPI (Open MPI?) uses. I have one nwchem_long task running so far, and this for example occupies the port 38253. This may show you what ports are (or were) in use: cat /tmp/ompi.*/pid.*/contact.txtSo, maybe those who had failures after a few seconds run time had some conflict which prevented the use of the TCP port? Luigi R. wrote: P.S. please, don't care about errors. They are caused by bash crashes and I solved it with os restart. ;)But maybe those bash crashes were caused by nwchem_long not cleaning up properly. |
©2024 Benoit DA MOTA - LERIA, University of Angers, France