Suspicious near-instant results with NWChem long t4

Author	Message
Alien Seeker Send message Joined: 5 Mar 20 Posts: 13 Credit: 805,400 RAC: 0	Message 753 - Posted: 12 Apr 2020, 3:10:41 UTC Two tasks supposedly terminated "successfully" after only a few seconds on one of my computers: 2276475 and 2278138. Both tasks were using the t4 version of NWChem long, and I suspect there's an error somewhere which wasn't detected properly. The wingmates are still to return their results but the instant execution looks suspicious. ID: 753 · Rating: 0 · rate: / Reply Quote

fzs600 Send message Joined: 31 Jul 19 Posts: 3 Credit: 3,006,937 RAC: 0	Message 754 - Posted: 12 Apr 2020, 7:28:54 UTC - in response to Message 753. Two tasks supposedly terminated "successfully" after only a few seconds on one of my computers: 2276475 and 2278138. Both tasks were using the t4 version of NWChem long, and I suspect there's an error somewhere which wasn't detected properly. The wingmates are still to return their results but the instant execution looks suspicious. looking at me too I find several wu 7 Apr 2020, 10:21:20 UTC 7 Apr 2020, 15:17:09 UTC Terminé, en attente de validation 4.13 1.83 en attente NWChem long v0.19 (t8) x86_64-pc-linux-gnu 7 Apr 2020, 10:21:20 UTC 7 Apr 2020, 15:17:09 UTC Terminé, en attente de validation 4.20 1.77 en attente NWChem long v0.19 (t8) x86_64-pc-linux-gnu 5 Apr 2020, 8:40:17 UTC 5 Apr 2020, 8:40:33 UTC Terminé, en attente de validation 4.07 1.21 en attente NWChem long v0.19 (t8) x86_64-pc-linux-gnu 3 Apr 2020, 21:56:21 UTC 4 Apr 2020, 12:05:17 UTC Terminé, en attente de validation 3.14 1.15 en attente NWChem long v0.19 (t8) x86_64-pc-linux-gnu ID: 754 · Rating: 0 · rate: / Reply Quote

Zalster Send message Joined: 16 Dec 19 Posts: 25 Credit: 11,938,843 RAC: 0	Message 756 - Posted: 12 Apr 2020, 13:49:19 UTC - in response to Message 753. Last modified: 12 Apr 2020, 13:55:44 UTC Two tasks supposedly terminated "successfully" after only a few seconds on one of my computers: 2276475 and 2278138. Both tasks were using the t4 version of NWChem long, and I suspect there's an error somewhere which wasn't detected properly. The wingmates are still to return their results but the instant execution looks suspicious. It's either the work units are faulty or it's a lack of resources. I noticed that both are 4 thread work units and that machine is only a 4 thread CPU. Unfortunately, the wingman of the first unit is hidden so no telling when it will get done (or what machine it might be until after it's processed) The second work unit we will have to see. Might have to wait for a 3rd wingman in both cases. For fzs600. All his computers and therefore, his work units are too. Edit... The other issue with hidden hosts, if you are paired with one for multiple work units and find their machine is faulty there is no way to contact them so they can look at their machine and figure out the issue. So their machines continue to fill the data base with faulty results. I'm unfortunate in having that happen for a lot of work units. So I'm forced to wait for a 3rd wingman to process the work units. ID: 756 · Rating: 0 · rate: / Reply Quote

Alien Seeker Send message Joined: 5 Mar 20 Posts: 13 Credit: 805,400 RAC: 0	Message 757 - Posted: 12 Apr 2020, 16:27:25 UTC - in response to Message 756. It's either the work units are faulty or it's a lack of resources. I noticed that both are 4 thread work units and that machine is only a 4 thread CPU. I can now confirm the problem came from the execution and not the WU itself: result 2278138 failed to validate now the wingmate has returned their result. It should appear as a computing error, there must be a sanity check missing somewhere in the app. I agree the 4 threads version of the app is the likely culprit; a t2 is currently running successfully on the same host. If it happens again with more t4 works, I'll limit max_cpus, I assume it'll stop the server from sending me t4 tasks? ID: 757 · Rating: 0 · rate: / Reply Quote

Zalster Send message Joined: 16 Dec 19 Posts: 25 Credit: 11,938,843 RAC: 0	Message 758 - Posted: 12 Apr 2020, 16:32:49 UTC - in response to Message 757. If it happens again with more t4 works, I'll limit max_cpus, I assume it'll stop the server from sending me t4 tasks? We will have to see. This is first time I've seen a project with a set max cpus in the preferences page. It does seem to work. I've limited my machines to 90% of all cpus but you could set it to 75% so it only uses 3 out of 4 threads, or however you want to set it. Not sure how it would react to 90% of 4 threads. Would it round down to the nearest thread? Would be interesting to see how it responds. Good luck with the rest of your work units. ID: 758 · Rating: 0 · rate: / Reply Quote

Alien Seeker Send message Joined: 5 Mar 20 Posts: 13 Credit: 805,400 RAC: 0	Message 759 - Posted: 12 Apr 2020, 16:58:16 UTC - in response to Message 758. Now you mention it, I've played a lot with max_ncpus_pct from the global preferences in the last days while trying to get a setting that worked for me. (To answer your interrogation, BOINC rounds to the higher number below the threshold, so 90% of 4 CPUs would be 3 threads running.) It may have happened that the t4 tasks failed when I allowed fewer than 100% of CPUs. Now I have a configuration I'm happy with, I'll keep an eye out for more t4 tasks. The failed tasks should still have appeared as errors though, it would make debugging easier. ID: 759 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 769 - Posted: 16 Apr 2020, 8:59:11 UTC - in response to Message 759. Yes, I noticed this problem some time ago, but the problem comes from a third party software (it returns a success even though it crashes). I'm looking for a workaround to detect this. ID: 769 · Rating: 0 · rate: / Reply Quote

Luigi R. Send message Joined: 7 Nov 19 Posts: 31 Credit: 4,245,903 RAC: 0	Message 802 - Posted: 23 Apr 2020, 9:25:07 UTC Why has server not sent them yet to 3rd wingman after 10 days? I have the same problem. ID: 802 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 808 - Posted: 25 Apr 2020, 9:41:29 UTC - in response to Message 802. I don't know... It's the official code that manage this part. ID: 808 · Rating: 0 · rate: / Reply Quote

Alien Seeker Send message Joined: 5 Mar 20 Posts: 13 Credit: 805,400 RAC: 0	Message 815 - Posted: 26 Apr 2020, 17:02:36 UTC I've had the problem again, this time on the other computer and with only 1 core per task. I suspect the reason this time was a full /tmp; although I didn't check the size, the problem vanished when I removed the many leftover /tmp/ompi.hostname.123/pid.1234 directories from previous computations. I think tasks should clean up after themselves when they end; even if each directory is rather small, they pile up after a while and the /tmp partition isn't meant to be very big. ID: 815 · Rating: 0 · rate: / Reply Quote

crashtech Send message Joined: 9 Dec 19 Posts: 11 Credit: 19,162,966 RAC: 0	Message 884 - Posted: 13 Jun 2020, 20:10:25 UTC Has there been a resolution to this issue? One of my computers only runs WUs for a few seconds, then marks them as complete https://quchempedia.univ-angers.fr/athome/results.php?hostid=1227 ID: 884 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0	Message 885 - Posted: 13 Jun 2020, 21:26:14 UTC I run all of my work units as t1, by setting "Max # CPUs 1" in the preferences. I have never seen the problem of the short runs that I can remember. https://quchempedia.univ-angers.fr/athome/results.php?hostid=2356 ID: 885 · Rating: 0 · rate: / Reply Quote

crashtech Send message Joined: 9 Dec 19 Posts: 11 Credit: 19,162,966 RAC: 0	Message 886 - Posted: 14 Jun 2020, 5:17:06 UTC - in response to Message 885. Last modified: 14 Jun 2020, 5:20:04 UTC I run all of my work units as t1, by setting "Max # CPUs 1" in the preferences. I have never seen the problem of the short runs that I can remember. https://quchempedia.univ-angers.fr/athome/results.php?hostid=2356 Hi, based on your post, I set up a location in the preferences page here to only allow one CPU, but all the WUs still end prematurely. For now I can't run the project on it, but would like to figure out why the work is failing. It runs other projects without issues, so I don't think it's hardware related. ID: 886 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0	Message 887 - Posted: 14 Jun 2020, 9:49:27 UTC - in response to Message 886. Last modified: 14 Jun 2020, 9:55:43 UTC Very strange. I don't see anything wrong with your machines. Maybe memory? Overclocking? It must be something different about that one. Sometimes files get corrupted though. I would detach from the project, and then re-attach. ID: 887 · Rating: 0 · rate: / Reply Quote

crashtech Send message Joined: 9 Dec 19 Posts: 11 Credit: 19,162,966 RAC: 0	Message 888 - Posted: 14 Jun 2020, 15:44:58 UTC - in response to Message 887. Very strange. I don't see anything wrong with your machines. Maybe memory? Overclocking? It must be something different about that one. Sometimes files get corrupted though. I would detach from the project, and then re-attach. Thanks, I have done so more than once, checking the second time to be sure that the project directory was actually removed. There seems to be something about that particular host's configuration that causes QuChemPedIA to fail. ID: 888 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0	Message 889 - Posted: 14 Jun 2020, 16:11:54 UTC - in response to Message 888. Possibly there is a problem with the BOINC installation itself. It would probably be easier just to upgrade to the latest version, which you can do with this PPA: sudo add-apt-repository ppa:costamagnagianfranco/boinc sudo apt-get update https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc ID: 889 · Rating: 0 · rate: / Reply Quote

crashtech Send message Joined: 9 Dec 19 Posts: 11 Credit: 19,162,966 RAC: 0	Message 890 - Posted: 15 Jun 2020, 14:50:51 UTC - in response to Message 889. Last modified: 15 Jun 2020, 15:01:19 UTC Possibly there is a problem with the BOINC installation itself. It would probably be easier just to upgrade to the latest version, which you can do with this PPA: sudo add-apt-repository ppa:costamagnagianfranco/boinc sudo apt-get update https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc Thanks, I have done so, and verified it in the Event Log: Mon 15 Jun 2020 08:42:37 AM MDT \| \| Starting BOINC client version 7.17.0 for x86_64-pc-linux-gnu Alas, the tasks still error out immediately. There don't seem to be any clues in the stderr output of the failed tasks, either. I wonder if there aren't some installed libraries that this project relies on that I might check and/or re-install. ID: 890 · Rating: 0 · rate: / Reply Quote

xii5ku Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0	Message 894 - Posted: 21 Jun 2020, 7:37:50 UTC - in response to Message 815. Alien Seeker wrote: I've had the problem again, this time on the other computer and with only 1 core per task. I suspect the reason this time was a full /tmp; although I didn't check the size, the problem vanished when I removed the many leftover /tmp/ompi.hostname.123/pid.1234 directories from previous computations. I think tasks should clean up after themselves when they end; even if each directory is rather small, they pile up after a while and the /tmp partition isn't meant to be very big. crashtech wrote: Has there been a resolution to this issue? One of my computers only runs WUs for a few seconds, then marks them as complete https://quchempedia.univ-angers.fr/athome/results.php?hostid=1227 @crashtech, maybe this host has a full /tmp (like Alien Seeker suspected with the own host). Check with "df -h /tmp" for example. Or the boinc-client service on this host is set up in a way which does not permit it to create files outside of its data directory, or at least not in /tmp. What does /lib/systemd/system/boinc-client.service contain on this host? ID: 894 · Rating: 0 · rate: / Reply Quote

Luigi R. Send message Joined: 7 Nov 19 Posts: 31 Credit: 4,245,903 RAC: 0	Message 895 - Posted: 21 Jun 2020, 9:15:12 UTC This happened to me for the first time too. My computer usually run QuChemPedIA with success. Yesterday I increased work cache to 10 days. I had ~80 in-progress tasks downloaded at ~19:45. This morning at 4 AM all failed (they went to pending/invalid). Use this result to see my host: https://quchempedia.univ-angers.fr/athome/result.php?resultid=2386836 P.S. please, don't care about errors. They are caused by bash crashes and I solved it with os restart. ;) ID: 895 · Rating: 0 · rate: / Reply Quote

xii5ku Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0	Message 896 - Posted: 21 Jun 2020, 9:47:54 UTC Last modified: 21 Jun 2020, 9:48:58 UTC Besides a full /tmp, or lacking access permissions to /tmp, another potential problem source could be issues with the TCP port which MPI (Open MPI?) uses. I have one nwchem_long task running so far, and this for example occupies the port 38253. This may show you what ports are (or were) in use: cat /tmp/ompi./pid./contact.txt So, maybe those who had failures after a few seconds run time had some conflict which prevented the use of the TCP port? Luigi R. wrote: P.S. please, don't care about errors. They are caused by bash crashes and I solved it with os restart. ;) But maybe those bash crashes were caused by nwchem_long not cleaning up properly. ID: 896 · Rating: 0 · rate: / Reply Quote