Message boards :
Number crunching :
Suspicious near-instant results with NWChem long t4
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 7 Nov 19 Posts: 31 Credit: 4,245,903 RAC: 0 |
Besides a full /tmp, or lacking access permissions to /tmp, another potential problem source could be issues with the TCP port which MPI (Open MPI?) uses.What do you mean for full /tmp? 0byte? This morning I had 600MB free space. I deleted some log files and now it is 3.2GB. I have one nwchem_long task running so far, and this for example occupies the port 38253. 2950037504.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:60553 2084 2383937536.0;tcp://192.168.1.6,192.168.1.15,172.17.0.1:47451 10730 3311796224.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:39214 25236 3311337472.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:55077 25261 Random ports, I guess. I will check port of failed tasks, if it happens again and it is possible to do after failure. Otherwise we need to log used ports. But maybe those bash crashes were caused by nwchem_long not cleaning up properly.I don't know. I thought it isn't a problem related to this one. |
Send message Joined: 7 Nov 19 Posts: 31 Credit: 4,245,903 RAC: 0 |
me wrote: 2950037504.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:60553 2084 2383937536.0;tcp://192.168.1.6,192.168.1.15,172.17.0.1:47451 10730 3311796224.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:39214 25236 3311337472.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:55077 25261 Note that 192.168.1.6 is eth0 ip and 192.168.1.4 is wlan0 ip. 192.168.1.15 is wlan0 ip too, but that task is the oldest one and it's running for 11hours. |
Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0 |
Luigi R. wrote: xii5ku wrote:On my host, each nwchem_long task takes 8.2 MBytes in /tmp. (BTW, I completed three tasks by now, and out of these three, one did not remove its "pid.*" subdirectory in /tmp/ompi.*/.)Besides a full /tmp, or lacking access permissions to /tmp, another potential problem source could be [...]What do you mean for full /tmp? 0byte? 8.2 MBytes is not much obviously. If there is no space left for this small amount in /tmp anymore, the host may exhibit serious other problems outside of boinc as well. |
Send message Joined: 9 Dec 19 Posts: 11 Credit: 19,162,966 RAC: 0 |
Taking your suggestions one at a time, it looks as if "df -h /tmp" is not doing what is intended to do in this case, which is to give the size of /tmp. What the command does do, after further experimentation, is give the total usage of /dev/sda5, at least when exucuted on this particular host. It does this no matter which directory is input as a target: ga7pxsl@GAX570UD_test:~$ df -h /tmp Filesystem Size Used Avail Use% Mounted on /dev/sda5 228G 37G 180G 17% / ga7pxsl@GAX570UD_test:~$ df -h /home/ga7pxsl Filesystem Size Used Avail Use% Mounted on /dev/sda5 228G 37G 180G 17% / ga7pxsl@GAX570UD_test:~$ df -h / Filesystem Size Used Avail Use% Mounted on /dev/sda5 228G 37G 180G 17% / It does do something different if no target directory is given, which might provide a clue to someone who knows something: ga7pxsl@GAX570UD_test:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 16G 0 16G 0% /dev tmpfs 3.2G 2.0M 3.2G 1% /run /dev/sda5 228G 37G 180G 17% / tmpfs 16G 208K 16G 1% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/sda1 511M 6.1M 505M 2% /boot/efi tmpfs 3.2G 32K 3.2G 1% /run/user/1000 But, looking at /tmp in the graphical file manager (the thing I sort of know how to use) as root, the Properties tab tells me there is less than 100KB in /tmp. Or the boinc-client service on this host is set up in a way which does not permit it to create files outside of its data directory, or at least not in /tmp. What does /lib/systemd/system/boinc-client.service contain on this host? [Unit] Description=Berkeley Open Infrastructure Network Computing Client Documentation=man:boinc(1) After=network-online.target [Service] Type=simple ProtectHome=true PrivateTmp=true ProtectSystem=strict ProtectControlGroups=true ReadWritePaths=-/var/lib/boinc -/etc/boinc-client Nice=10 User=boinc WorkingDirectory=/var/lib/boinc ExecStart=/usr/bin/boinc ExecStop=/usr/bin/boinccmd --quit ExecReload=/usr/bin/boinccmd --read_cc_config ExecStopPost=/bin/rm -f lockfile IOSchedulingClass=idle # The following options prevent setuid root as they imply NoNewPrivileges=true # Since Atlas requires setuid root, they break Atlas # In order to improve security, if you're not using Atlas, # Add these options to the [Service] section of an override file using # sudo systemctl edit boinc-client.service #NoNewPrivileges=true #ProtectKernelModules=true #ProtectKernelTunables=true #RestrictRealtime=true #RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX #RestrictNamespaces=true #PrivateUsers=true #CapabilityBoundingSet= #MemoryDenyWriteExecute=true [Install] WantedBy=multi-user.target |
Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0 |
@crashtech, "df" reports "file system disk space usage", i.e. the used space and available space in the filesystem in which the optionally given file or directory resides. My main intention was to verify how much free space is left in your /tmp. We now know that there is plenty of space left in it. (There are 180 GBytes available in /tmp.) As for the boinc-client.service unit file: Compared with the boinc-client.service file on my computers, yours has several extra lines. The following four, explained in "man systemd.exec", stick out to me: ProtectHome=true
Either relax this from strict to full, or append
to the ReadWritePaths line.
|
Send message Joined: 9 Dec 19 Posts: 11 Credit: 19,162,966 RAC: 0 |
@crashtech, "df" reports "file system disk space usage", i.e. the used space and available space in the filesystem in which the optionally given file or directory resides. My main intention was to verify how much free space is left in your /tmp. We now know that there is plenty of space left in it. (There are 180 GBytes available in /tmp.) Thank you xii5ku! First I appended -/tmp to the ReadWritePaths line and rebooted, but QuChemPedIA would not run. Then I changed "strict" to "full" and rebooted, but it still won't run! It's a real puzzle. |
Send message Joined: 7 Nov 19 Posts: 31 Credit: 4,245,903 RAC: 0 |
Luigi R. wrote:Maybe it's OT, but I found _bin_bash.1000.crash in /var/crash about the last bash crash.P.S. please, don't care about errors. They are caused by bash crashes and I solved it with os restart. ;)But maybe those bash crashes were caused by nwchem_long not cleaning up properly. https://pastebin.com/j70fnPxW |
Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0 |
@crashtech, in addition to ProtectSystem=full, you could try: PrivateTmp=false |
Send message Joined: 9 Dec 19 Posts: 11 Credit: 19,162,966 RAC: 0 |
@crashtech, in addition to ProtectSystem=full, you could try: PrivateTmp=false Done, still nothing! One of the other things I tried was comparing boinc-client.service on a working host with the one on the non-working host, and commenting out all of the extra lines that are found in the non-working one. That also did not work. The temptation for me is to move my BOINC data directories to temporary storage, then "nuke and pave" the installation and start fresh. I realize that is more something out of the Windows noob playbook and is possibly offensive to a Linux pro. |
Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0 |
@crashtech: It looks like you have three "good" hosts with Mint 19.3 and boinc version 7.9.3, and two "bad" hosts with Mint 19.3 and boinc version 7.17.0. Right? (On the other hand, when I look at wingmen of my own results, there are circa two hosts which are persistently spamming the project recently with bogus few-seconds results, and these two hosts have Mint 19.3 and boinc version 7.9.3. Their owner is anonymous, hence we have no way to wake up the pilot.) |
Send message Joined: 9 Dec 19 Posts: 11 Credit: 19,162,966 RAC: 0 |
@crashtech: I'm pretty sure those are two client instances on the same host. |
Send message Joined: 9 Dec 19 Posts: 11 Credit: 19,162,966 RAC: 0 |
@xii5ku , I'm out of ideas on this one. Thanks for your help, though. |
©2024 Benoit DA MOTA - LERIA, University of Angers, France