Posts by Luigi R.

21) Message boards : Number crunching : Please Don't Abort WUs (Message 933)
Posted 13 Jul 2020 by Luigi R.
Post:
Finally server has started resends for my old inconclusive workunits. :)
22) Message boards : Number crunching : Suspicious near-instant results with NWChem long t4 (Message 905)
Posted 22 Jun 2020 by Luigi R.
Post:
Luigi R. wrote:
P.S. please, don't care about errors. They are caused by bash crashes and I solved it with os restart. ;)
But maybe those bash crashes were caused by nwchem_long not cleaning up properly.
Maybe it's OT, but I found _bin_bash.1000.crash in /var/crash about the last bash crash.
https://pastebin.com/j70fnPxW
23) Message boards : Number crunching : Suspicious near-instant results with NWChem long t4 (Message 898)
Posted 21 Jun 2020 by Luigi R.
Post:
me wrote:
2950037504.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:60553
2084

2383937536.0;tcp://192.168.1.6,192.168.1.15,172.17.0.1:47451
10730

3311796224.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:39214
25236

3311337472.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:55077
25261

Note that 192.168.1.6 is eth0 ip and 192.168.1.4 is wlan0 ip.
192.168.1.15 is wlan0 ip too, but that task is the oldest one and it's running for 11hours.
24) Message boards : Number crunching : Suspicious near-instant results with NWChem long t4 (Message 897)
Posted 21 Jun 2020 by Luigi R.
Post:
Besides a full /tmp, or lacking access permissions to /tmp, another potential problem source could be issues with the TCP port which MPI (Open MPI?) uses.
What do you mean for full /tmp? 0byte?
This morning I had 600MB free space.
I deleted some log files and now it is 3.2GB.


I have one nwchem_long task running so far, and this for example occupies the port 38253.
This may show you what ports are (or were) in use:
cat /tmp/ompi.*/pid.*/contact.txt
So, maybe those who had failures after a few seconds run time had some conflict which prevented the use of the TCP port?
2950037504.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:60553
2084

2383937536.0;tcp://192.168.1.6,192.168.1.15,172.17.0.1:47451
10730

3311796224.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:39214
25236

3311337472.0;tcp://192.168.1.6,192.168.1.4,172.17.0.1:55077
25261

Random ports, I guess.
I will check port of failed tasks, if it happens again and it is possible to do after failure. Otherwise we need to log used ports.


But maybe those bash crashes were caused by nwchem_long not cleaning up properly.
I don't know. I thought it isn't a problem related to this one.
25) Message boards : Number crunching : Suspicious near-instant results with NWChem long t4 (Message 895)
Posted 21 Jun 2020 by Luigi R.
Post:
This happened to me for the first time too.
My computer usually run QuChemPedIA with success.
Yesterday I increased work cache to 10 days. I had ~80 in-progress tasks downloaded at ~19:45.
This morning at 4 AM all failed (they went to pending/invalid).

Use this result to see my host: https://quchempedia.univ-angers.fr/athome/result.php?resultid=2386836


P.S. please, don't care about errors. They are caused by bash crashes and I solved it with os restart. ;)
26) Message boards : Number crunching : 2,5 days long and counting... (Message 874)
Posted 11 Jun 2020 by Luigi R.
Post:
So, the only explanation, since I use the official Wrapper, is that the error comes from the wrapper wrote by a David somewhere (Boinc Team ?)
Yeah, I did a google search and it's plenty of errors from different boinc projects that use Virtualbox.

https://www.google.com/search?client=ubuntu&channel=fs&q=c%3A%5Cusers%5Cdavid%5Cdocuments%5Cboinc_git%5Cboinc%5Csamples%5Cvboxwrapper%5Cvbox_mscom_impl&ie=utf-8&oe=utf-8
27) Message boards : Number crunching : Please Don't Abort WUs (Message 865)
Posted 10 Jun 2020 by Luigi R.
Post:
I don't believe it is a real problem. Project server will resend aborted/errored WUs at the end of this batch.

@damotbe could answer better than me.


Anyway

Luigi R. wrote:
Why has server not sent them yet to 3rd wingman after 10 days? I have the same problem.
damotbe wrote:

I don't know... It's the official code that manage this part.
https://quchempedia.univ-angers.fr/athome/forum_thread.php?id=85&postid=802#802
28) Message boards : Number crunching : Credits for t8=t4=t2=5.000 seriously? (Message 833)
Posted 6 May 2020 by Luigi R.
Post:
Hi Diplomat, I don't understand what's (or if there is something) wrong with your host.

So try to edit your project preferences to get t1 tasks only, as ProDigit suggested.



If your RAC improves, you will know your problem is solved or at least you will have bypassed that. ;)
29) Message boards : Number crunching : Credits for t8=t4=t2=5.000 seriously? (Message 820)
Posted 27 Apr 2020 by Luigi R.
Post:
Have you tried to monitor processes list? Are there 20 (8+8+4) nwchem running at 100%?
30) Message boards : Number crunching : Credits for t8=t4=t2=5.000 seriously? (Message 805)
Posted 24 Apr 2020 by Luigi R.
Post:
It does not look your cpu used 8 threads for red circled task because runtime=cputime.
Runtime should be way lower than cputime, as you see on t2 tasks.
A reason could be that your cpu is running too many processes/tasks so that every one claims 8 threads, but it uses only 1.
31) Message boards : Number crunching : Suspicious near-instant results with NWChem long t4 (Message 802)
Posted 23 Apr 2020 by Luigi R.
Post:
Why has server not sent them yet to 3rd wingman after 10 days? I have the same problem.


Previous 20

©2024 Benoit DA MOTA - LERIA, University of Angers, France