2,5 days long and counting...

Message boards : Number crunching : 2,5 days long and counting...
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
ProDigit

Send message
Joined: 16 Nov 19
Posts: 44
Credit: 21,290,949
RAC: 0
Message 746 - Posted: 10 Apr 2020, 13:10:28 UTC

Current WU has been running for 2 days and 15 hours and counting.
It's at 99.070%, and the counter goes very, very slow.
Cancel, or is this ok for a long WU on a 3,5-4Ghz CPU?

Application
NWChem long 0.19 (t1)
Name
BTXv2_athome_b3lyp-321gd_long,batch02,000001835,nwchem_long,1586073408
State
Running
Received
Tue 07 Apr 2020 01:41:44 PM EDT
Report deadline
Mon 06 Jul 2020 01:41:43 PM EDT
Estimated computation size
500,000 GFLOPs
CPU time
2d 14:00:57
CPU time since checkpoint
00:00:03
Elapsed time
2d 15:13:59
Estimated time remaining
00:35:37
Fraction done
99.070%
Virtual memory size
1.28 GB
Working set size
133.29 MB
Directory
slots/17
Process ID
46850
Progress rate
1.440% per hour
Executable
wrapper_26014_x86_64-pc-linux-gnu
ID: 746 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 13 Oct 19
Posts: 87
Credit: 6,026,455
RAC: 0
Message 748 - Posted: 10 Apr 2020, 21:43:47 UTC

I have the same questions as I am seeing something similar on my Windows system. My resource monitor says I am using 83% of my CPU, consistent with crunching on 5 of 6 cores.
A couple have been aborted but I hate to keep doing that.
ID: 748 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zalster

Send message
Joined: 16 Dec 19
Posts: 25
Credit: 11,938,843
RAC: 0
Message 749 - Posted: 10 Apr 2020, 21:53:00 UTC - in response to Message 746.  

Could be. I had one similar that ran 15.5 hours on 8 threads. How many threads is it running on? I'm on linux so that helps a some.
ID: 749 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProDigit

Send message
Joined: 16 Nov 19
Posts: 44
Credit: 21,290,949
RAC: 0
Message 750 - Posted: 11 Apr 2020, 2:34:47 UTC - in response to Message 749.  

Thread count shouldn't matter, as each WU runs in it's own thread,
But to answer the question, 24.
My other PC with 32 threads is currently also having 2 WUs of 1+ days.
ID: 750 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProDigit

Send message
Joined: 16 Nov 19
Posts: 44
Credit: 21,290,949
RAC: 0
Message 751 - Posted: 12 Apr 2020, 2:06:48 UTC

Without any official representation here from QuChemPedIA, and my question unanswered,
and without there being any higher PPD allocated for projects running longer than the 'long' ones,

I will abort any WU I see running past 1 day, unless someone official can assure me there's nothing wrong with these long WUs where the percentage counter doesn't seem to work..
We shoot ourselves both in the foot this way, but most 'long' WUs on a 3,5Ghz CPU run no longer than 17 hours.
Even on a 3Ghz CPU you'd be finishing a WU in 20 hours.

So cancelling it is!
ID: 751 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 13 Oct 19
Posts: 87
Credit: 6,026,455
RAC: 0
Message 755 - Posted: 12 Apr 2020, 12:42:10 UTC

I am going to let my "long" long tasks run, they are completing and validating.
ID: 755 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProDigit

Send message
Joined: 16 Nov 19
Posts: 44
Credit: 21,290,949
RAC: 0
Message 763 - Posted: 13 Apr 2020, 23:03:31 UTC - in response to Message 755.  

I am going to let my "long" long tasks run, they are completing and validating.

Yes, but at 2,5 days, you could have crunched and validated 3 long WUs, and got triple the score.
ID: 763 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 772 - Posted: 16 Apr 2020, 9:11:11 UTC - in response to Message 763.  

(copy/paste from other thread)

Since the lockdown, we must adapt all our teachings... I'm trying to catch up on the forum.

The problem of task duration has already been addressed many times! Computation times are unpredictable and highly variable. The project credits well to compensate for this inconvenience... Some rare short workunits may require more than 100 hours of calculation, but we cannot know this in advance. Cancelling units because they last too long is a behaviour that starts to impact the quality of our results! It is these borderline cases that will allow us in the future to train an artificial intelligence capable of predicting these cases and also capable of identifying stable chemical boundaries.

For long WUs, computation time can really be enormous ! Maybe two weeks or more for extreme case.
ID: 772 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProDigit

Send message
Joined: 16 Nov 19
Posts: 44
Credit: 21,290,949
RAC: 0
Message 794 - Posted: 19 Apr 2020, 21:36:21 UTC
Last modified: 19 Apr 2020, 21:37:13 UTC

Thanks, but the percentage bar is not working correctly on those.
it works like normal long units, and starts slowing down near to 80-90%, ever getting slower the closer to 100%. At 99% or over, it doesn't appear to move at all.
I'm running 2 units for 3+ days now!
More and more of these are hogging up my CPU queue, meaning, they're processing not allowing other WUs to be processed.
You must find a solution to when this is happening, that the result will be uploaded before these wus stretch themselves to infinity time.
My electricity is valuable, and I don't want to spend a week running a task that in the end gets aborted!
ID: 794 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
js

Send message
Joined: 3 May 20
Posts: 2
Credit: 400
RAC: 0
Message 834 - Posted: 7 May 2020, 21:15:59 UTC - in response to Message 772.  

(copy/paste from other thread)

The problem of task duration has already been addressed many times! Computation times are unpredictable and highly variable. ... Some rare short workunits may require more than 100 hours of calculation, but we cannot know this in advance.


I'm aware of this, but I currently have a WU (od9_athome_b3lyp_321gd,batch78,000788954,nwchem,1581806840_6) that has been running for over 3 days (CPU time) and hardly uses any CPU at all any more! It just sits there, seemingly idling. When I pause all other jobs in BOINC, I see about 3% CPU usage from that WU, and no noticeable network or disk activity. This is on BOINC 7.14.4, VirtualBox 6.1.6, MacOS 10.15.4 and an Intel I5 9600K CPU.

Please clarify whether (and why) it is still worthwile keeping such a WU running. This doen't look like any further progress can be expected from the WU.

Are there any logs that one could check?
ID: 834 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 13 Oct 19
Posts: 87
Credit: 6,026,455
RAC: 0
Message 835 - Posted: 7 May 2020, 23:54:50 UTC - in response to Message 834.  
Last modified: 8 May 2020, 0:06:25 UTC

When a WU shows as running for a long time but using no CPU, it should be aborted.
ID: 835 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
js

Send message
Joined: 3 May 20
Posts: 2
Credit: 400
RAC: 0
Message 838 - Posted: 12 May 2020, 10:45:45 UTC

Alright, thanks for the info. This is a bit problematic for the project, though: If people don't bother to actively and regularly check for issues like this, the project will fall behind in productivity since the allocated time slots on users' machines won't be used productively.
ID: 838 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 839 - Posted: 12 May 2020, 13:16:05 UTC

I am not seeing these problems, maybe because I don't look too hard for them. But I don't need to.
I have zero errors at the moment. My longest "long" task is only about 2 days on a Ryzen 2600 (Ubuntu 18.04.4).
https://quchempedia.univ-angers.fr/athome/results.php?hostid=1814&offset=0&show_names=0&state=4&appid=3

But I run only t1, which I have found to help in the past.

If you have errors, there may be machine problems as due to overclocking, memory, etc.
They might cause hangups too?
ID: 839 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 840 - Posted: 14 May 2020, 15:17:58 UTC - in response to Message 839.  

I have now switched over to a Ryzen 3900X.
We will see how it goes.
https://quchempedia.univ-angers.fr/athome/results.php?hostid=2137
ID: 840 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 845 - Posted: 19 May 2020, 12:01:28 UTC - in response to Message 840.  

I did get a long runner. It ran for 4 days 10 hours.
I was not sure it would complete, but since it was still making progress (slowly), I let it run.
It is now PV.
https://quchempedia.univ-angers.fr/athome/result.php?resultid=2335263

I am now on an i7-8700, and will see if the CPU type makes a difference.
ID: 845 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProDigit

Send message
Joined: 16 Nov 19
Posts: 44
Credit: 21,290,949
RAC: 0
Message 846 - Posted: 21 May 2020, 3:43:18 UTC

I got about 7 WUs of over 3.5 days on a Ryzen 3900x, and several on my 3950x.
I would plea to make such long runners higher in PPD.
They're worth more than the 5000 credit assigned to them!
ID: 846 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damienh

Send message
Joined: 5 Jan 20
Posts: 7
Credit: 100,435,425
RAC: 0
Message 857 - Posted: 8 Jun 2020, 17:38:04 UTC
Last modified: 8 Jun 2020, 17:41:08 UTC

My record is a currently-running WU that is registering 13d 16h run time so far. Based on damotbe's feedback, this is probably very unusual but also not unexpected ... ?
ID: 857 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 866 - Posted: 10 Jun 2020, 13:18:18 UTC - in response to Message 857.  

indeed, it can happen and be normal
ID: 866 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Henk Haneveld

Send message
Joined: 6 Nov 19
Posts: 8
Credit: 156,845
RAC: 0
Message 871 - Posted: 11 Jun 2020, 10:32:16 UTC

I don't mind the long runtime. however what is really bad is the lack of checkpoints.

Yesterday my running result was at a runtime of over 2 days. Because of a system reboot it jumped back to a runtime of about 8 hrs and started again from that point.
ID: 871 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Henk Haneveld

Send message
Joined: 6 Nov 19
Posts: 8
Credit: 156,845
RAC: 0
Message 872 - Posted: 11 Jun 2020, 12:20:20 UTC - in response to Message 871.  
Last modified: 11 Jun 2020, 12:36:02 UTC

I don't mind the long runtime. however what is really bad is the lack of checkpoints.

Yesterday my running result was at a runtime of over 2 days. Because of a system reboot it jumped back to a runtime of about 8 hrs and started again from that point.

Edit to post:

After some checking if found this is in the stderr.txt file

2020-06-10 12:45:10 (224): Restore from previously saved snapshot.
2020-06-10 12:45:10 (224): Error 0x80010105 in vbox52::VBOX_VM::restore_snapshot (c:\users\david\documents\boinc_git\boinc\samples\vboxwrapper\vbox_mscom_impl.cpp:1835)
2020-06-10 12:45:10 (224): Error: Getting Error Info! hr = 0x1

User david does not exists on my system. Should it not be the user ID of the person who is running the Boinc program on the local system?

Why is the snapshot saved in a documents directory? It shoud be in a slot directory under the Boinc_data directory.
ID: 872 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : 2,5 days long and counting...

©2024 Benoit DA MOTA - LERIA, University of Angers, France