Posts by xii5ku

21) Message boards : Number crunching : "Multithreading" in prefs (Message 900)
Posted 21 Jun 2020 by xii5ku
Post:
Sorry for reviving an old thread. But the issue at hand is still applicable, so here we go:

dannyridel wrote:
I've set the multithread settings to a max of 4 CPUS. Somehow I keep on getting work that runs on 1 CPU core only.
Check the apps.php page, i.e. "Computing -> Applications" at the top of the web page. At this time, this hints that
    the "NWChem" application is single-thread,
    the "NWChem long" application is single-threaded on Windows,
    the "NWChem long" application is currently single-threaded on Linux unless "Run test applications?" is switched to Yes at the project preferences, until the 2t/ 4t/ 8t versions are promoted out of beta status.

At least that's how I understand it.

22) Message boards : Number crunching : Suspicious near-instant results with NWChem long t4 (Message 899)
Posted 21 Jun 2020 by xii5ku
Post:
Luigi R. wrote:
xii5ku wrote:
Besides a full /tmp, or lacking access permissions to /tmp, another potential problem source could be [...]
What do you mean for full /tmp? 0byte?
This morning I had 600MB free space.
I deleted some log files and now it is 3.2GB.
On my host, each nwchem_long task takes 8.2 MBytes in /tmp. (BTW, I completed three tasks by now, and out of these three, one did not remove its "pid.*" subdirectory in /tmp/ompi.*/.)

8.2 MBytes is not much obviously. If there is no space left for this small amount in /tmp anymore, the host may exhibit serious other problems outside of boinc as well.
23) Message boards : Number crunching : Suspicious near-instant results with NWChem long t4 (Message 896)
Posted 21 Jun 2020 by xii5ku
Post:
Besides a full /tmp, or lacking access permissions to /tmp, another potential problem source could be issues with the TCP port which MPI (Open MPI?) uses.

I have one nwchem_long task running so far, and this for example occupies the port 38253.
This may show you what ports are (or were) in use:
cat /tmp/ompi.*/pid.*/contact.txt
So, maybe those who had failures after a few seconds run time had some conflict which prevented the use of the TCP port?


Luigi R. wrote:
P.S. please, don't care about errors. They are caused by bash crashes and I solved it with os restart. ;)
But maybe those bash crashes were caused by nwchem_long not cleaning up properly.
24) Message boards : Number crunching : Suspicious near-instant results with NWChem long t4 (Message 894)
Posted 21 Jun 2020 by xii5ku
Post:
Alien Seeker wrote:
I've had the problem again, this time on the other computer and with only 1 core per task. I suspect the reason this time was a full /tmp; although I didn't check the size, the problem vanished when I removed the many leftover /tmp/ompi.hostname.123/pid.1234 directories from previous computations.

I think tasks should clean up after themselves when they end; even if each directory is rather small, they pile up after a while and the /tmp partition isn't meant to be very big.

crashtech wrote:
Has there been a resolution to this issue? One of my computers only runs WUs for a few seconds, then marks them as complete

https://quchempedia.univ-angers.fr/athome/results.php?hostid=1227

@crashtech, maybe this host has a full /tmp (like Alien Seeker suspected with the own host). Check with "df -h /tmp" for example.

Or the boinc-client service on this host is set up in a way which does not permit it to create files outside of its data directory, or at least not in /tmp. What does /lib/systemd/system/boinc-client.service contain on this host?


Previous 20

©2024 Benoit DA MOTA - LERIA, University of Angers, France