Suspicious near-instant results with NWChem long t4

Message boards : Number crunching : Suspicious near-instant results with NWChem long t4
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Luigi R.

Send message
Joined: 7 Nov 19
Posts: 31
Credit: 4,245,903
RAC: 0
Message 897 - Posted: 21 Jun 2020, 11:13:29 UTC - in response to Message 896.  
Last modified: 21 Jun 2020, 11:22:56 UTC

Besides a full /tmp, or lacking access permissions to /tmp, another potential problem source could be issues with the TCP port which MPI (Open MPI?) uses.
What do you mean for full /tmp? 0byte?
This morning I had 600MB free space.
I deleted some log files and now it is 3.2GB.


I have one nwchem_long task running so far, and this for example occupies the port 38253.
This may show you what ports are (or were) in use:
cat /tmp/ompi.*/pid.*/contact.txt
So, maybe those who had failures after a few seconds run time had some conflict which prevented the use of the TCP port?
2950037504.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:60553
2084

2383937536.0;tcp://192.168.1.6,192.168.1.15,172.17.0.1:47451
10730

3311796224.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:39214
25236

3311337472.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:55077
25261

Random ports, I guess.
I will check port of failed tasks, if it happens again and it is possible to do after failure. Otherwise we need to log used ports.


But maybe those bash crashes were caused by nwchem_long not cleaning up properly.
I don't know. I thought it isn't a problem related to this one.
ID: 897 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 7 Nov 19
Posts: 31
Credit: 4,245,903
RAC: 0
Message 898 - Posted: 21 Jun 2020, 11:19:47 UTC - in response to Message 897.  

me wrote:
2950037504.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:60553
2084

2383937536.0;tcp://192.168.1.6,192.168.1.15,172.17.0.1:47451
10730

3311796224.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:39214
25236

3311337472.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:55077
25261

Note that 192.168.1.6 is eth0 ip and 192.168.1.4 is wlan0 ip.
192.168.1.15 is wlan0 ip too, but that task is the oldest one and it's running for 11hours.
ID: 898 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 24
Credit: 68,559,000
RAC: 0
Message 899 - Posted: 21 Jun 2020, 16:38:33 UTC - in response to Message 897.  
Last modified: 21 Jun 2020, 16:39:23 UTC

Luigi R. wrote:
xii5ku wrote:
Besides a full /tmp, or lacking access permissions to /tmp, another potential problem source could be [...]
What do you mean for full /tmp? 0byte?
This morning I had 600MB free space.
I deleted some log files and now it is 3.2GB.
On my host, each nwchem_long task takes 8.2 MBytes in /tmp. (BTW, I completed three tasks by now, and out of these three, one did not remove its "pid.*" subdirectory in /tmp/ompi.*/.)

8.2 MBytes is not much obviously. If there is no space left for this small amount in /tmp anymore, the host may exhibit serious other problems outside of boinc as well.
ID: 899 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
crashtech

Send message
Joined: 9 Dec 19
Posts: 11
Credit: 19,162,966
RAC: 0
Message 902 - Posted: 21 Jun 2020, 18:40:33 UTC - in response to Message 894.  


crashtech wrote:
Has there been a resolution to this issue? One of my computers only runs WUs for a few seconds, then marks them as complete

https://quchempedia.univ-angers.fr/athome/results.php?hostid=1227

@crashtech, maybe this host has a full /tmp (like Alien Seeker suspected with the own host). Check with "df -h /tmp" for example.

Taking your suggestions one at a time, it looks as if "df -h /tmp" is not doing what is intended to do in this case, which is to give the size of /tmp. What the command does do, after further experimentation, is give the total usage of /dev/sda5, at least when exucuted on this particular host. It does this no matter which directory is input as a target:

ga7pxsl@GAX570UD_test:~$ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda5       228G   37G  180G  17% /
ga7pxsl@GAX570UD_test:~$ df -h /home/ga7pxsl
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda5       228G   37G  180G  17% /
ga7pxsl@GAX570UD_test:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda5       228G   37G  180G  17% /

It does do something different if no target directory is given, which might provide a clue to someone who knows something:

ga7pxsl@GAX570UD_test:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             16G     0   16G   0% /dev
tmpfs           3.2G  2.0M  3.2G   1% /run
/dev/sda5       228G   37G  180G  17% /
tmpfs            16G  208K   16G   1% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/sda1       511M  6.1M  505M   2% /boot/efi
tmpfs           3.2G   32K  3.2G   1% /run/user/1000

But, looking at /tmp in the graphical file manager (the thing I sort of know how to use) as root, the Properties tab tells me there is less than 100KB in /tmp.

Or the boinc-client service on this host is set up in a way which does not permit it to create files outside of its data directory, or at least not in /tmp. What does /lib/systemd/system/boinc-client.service contain on this host?


[Unit]
Description=Berkeley Open Infrastructure Network Computing Client
Documentation=man:boinc(1)
After=network-online.target

[Service]
Type=simple
ProtectHome=true
PrivateTmp=true
ProtectSystem=strict
ProtectControlGroups=true
ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
Nice=10
User=boinc
WorkingDirectory=/var/lib/boinc
ExecStart=/usr/bin/boinc
ExecStop=/usr/bin/boinccmd --quit
ExecReload=/usr/bin/boinccmd --read_cc_config
ExecStopPost=/bin/rm -f lockfile
IOSchedulingClass=idle
# The following options prevent setuid root as they imply NoNewPrivileges=true
# Since Atlas requires setuid root, they break Atlas
# In order to improve security, if you're not using Atlas,
# Add these options to the [Service] section of an override file using
# sudo systemctl edit boinc-client.service
#NoNewPrivileges=true
#ProtectKernelModules=true
#ProtectKernelTunables=true
#RestrictRealtime=true
#RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
#RestrictNamespaces=true
#PrivateUsers=true
#CapabilityBoundingSet=
#MemoryDenyWriteExecute=true

[Install]
WantedBy=multi-user.target
ID: 902 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 24
Credit: 68,559,000
RAC: 0
Message 903 - Posted: 21 Jun 2020, 20:41:43 UTC - in response to Message 902.  
Last modified: 21 Jun 2020, 20:48:06 UTC

@crashtech, "df" reports "file system disk space usage", i.e. the used space and available space in the filesystem in which the optionally given file or directory resides. My main intention was to verify how much free space is left in your /tmp. We now know that there is plenty of space left in it. (There are 180 GBytes available in /tmp.)

As for the boinc-client.service unit file: Compared with the boinc-client.service file on my computers, yours has several extra lines. The following four, explained in "man systemd.exec", stick out to me:

ProtectHome=true
    Most likely harmless to the NWChem (...long) application.


PrivateTmp=true

    In theory this should be OK for NWChem long.


ProtectSystem=strict

    This is probably the culprit! As I understand the documentation, this will make /tmp read-only.
    Either relax this from strict to full, or append
      -/tmp

    to the ReadWritePaths line.
    Then restart the boinc-client service. Or maybe you even need to reboot, I don't know.
    Then fetch one QuChemPedIA task and see if it runs normally.


ProtectControlGroups=true

    In theory this should be OK.


(Documentation of the systemd service file format is spread over "man systemd.unit", "man systemd.service", and "man systemd.exec".)

ID: 903 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
crashtech

Send message
Joined: 9 Dec 19
Posts: 11
Credit: 19,162,966
RAC: 0
Message 904 - Posted: 22 Jun 2020, 17:20:03 UTC - in response to Message 903.  

@crashtech, "df" reports "file system disk space usage", i.e. the used space and available space in the filesystem in which the optionally given file or directory resides. My main intention was to verify how much free space is left in your /tmp. We now know that there is plenty of space left in it. (There are 180 GBytes available in /tmp.)

As for the boinc-client.service unit file: Compared with the boinc-client.service file on my computers, yours has several extra lines. The following four, explained in "man systemd.exec", stick out to me:

ProtectHome=true
    Most likely harmless to the NWChem (...long) application.


PrivateTmp=true

    In theory this should be OK for NWChem long.


ProtectSystem=strict

    This is probably the culprit! As I understand the documentation, this will make /tmp read-only.
    Either relax this from strict to full, or append
      -/tmp

    to the ReadWritePaths line.
    Then restart the boinc-client service. Or maybe you even need to reboot, I don't know.
    Then fetch one QuChemPedIA task and see if it runs normally.


ProtectControlGroups=true

    In theory this should be OK.


(Documentation of the systemd service file format is spread over "man systemd.unit", "man systemd.service", and "man systemd.exec".)


Thank you xii5ku! First I appended -/tmp to the ReadWritePaths line and rebooted, but QuChemPedIA would not run. Then I changed "strict" to "full" and rebooted, but it still won't run! It's a real puzzle.
ID: 904 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 7 Nov 19
Posts: 31
Credit: 4,245,903
RAC: 0
Message 905 - Posted: 22 Jun 2020, 18:38:01 UTC - in response to Message 896.  

Luigi R. wrote:
P.S. please, don't care about errors. They are caused by bash crashes and I solved it with os restart. ;)
But maybe those bash crashes were caused by nwchem_long not cleaning up properly.
Maybe it's OT, but I found _bin_bash.1000.crash in /var/crash about the last bash crash.
https://pastebin.com/j70fnPxW
ID: 905 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 24
Credit: 68,559,000
RAC: 0
Message 906 - Posted: 22 Jun 2020, 20:30:14 UTC - in response to Message 904.  

@crashtech, in addition to ProtectSystem=full, you could try: PrivateTmp=false
ID: 906 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
crashtech

Send message
Joined: 9 Dec 19
Posts: 11
Credit: 19,162,966
RAC: 0
Message 908 - Posted: 23 Jun 2020, 5:05:12 UTC - in response to Message 906.  
Last modified: 23 Jun 2020, 5:05:46 UTC

@crashtech, in addition to ProtectSystem=full, you could try: PrivateTmp=false

Done, still nothing! One of the other things I tried was comparing boinc-client.service on a working host with the one on the non-working host, and commenting out all of the extra lines that are found in the non-working one. That also did not work. The temptation for me is to move my BOINC data directories to temporary storage, then "nuke and pave" the installation and start fresh. I realize that is more something out of the Windows noob playbook and is possibly offensive to a Linux pro.
ID: 908 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 24
Credit: 68,559,000
RAC: 0
Message 909 - Posted: 24 Jun 2020, 18:04:07 UTC - in response to Message 908.  
Last modified: 24 Jun 2020, 18:07:47 UTC

@crashtech:
It looks like you have three "good" hosts with Mint 19.3 and boinc version 7.9.3,
and two "bad" hosts with Mint 19.3 and boinc version 7.17.0.
Right?

(On the other hand, when I look at wingmen of my own results, there are circa two hosts which are persistently spamming the project recently with bogus few-seconds results, and these two hosts have Mint 19.3 and boinc version 7.9.3. Their owner is anonymous, hence we have no way to wake up the pilot.)
ID: 909 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
crashtech

Send message
Joined: 9 Dec 19
Posts: 11
Credit: 19,162,966
RAC: 0
Message 910 - Posted: 26 Jun 2020, 20:00:35 UTC - in response to Message 909.  

@crashtech:
It looks like you have three "good" hosts with Mint 19.3 and boinc version 7.9.3,
and two "bad" hosts with Mint 19.3 and boinc version 7.17.0.
Right?

(On the other hand, when I look at wingmen of my own results, there are circa two hosts which are persistently spamming the project recently with bogus few-seconds results, and these two hosts have Mint 19.3 and boinc version 7.9.3. Their owner is anonymous, hence we have no way to wake up the pilot.)

I'm pretty sure those are two client instances on the same host.
ID: 910 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
crashtech

Send message
Joined: 9 Dec 19
Posts: 11
Credit: 19,162,966
RAC: 0
Message 919 - Posted: 2 Jul 2020, 16:06:01 UTC

@xii5ku , I'm out of ideas on this one. Thanks for your help, though.
ID: 919 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Suspicious near-instant results with NWChem long t4

©2024 Benoit DA MOTA - LERIA, University of Angers, France