Short Tasks run for 14 hours and counting.

Message boards : Number crunching : Short Tasks run for 14 hours and counting.
Message board moderation

To post messages, you must log in.

AuthorMessage
Dayle Diamond

Send message
Joined: 6 Apr 20
Posts: 5
Credit: 0
RAC: 0
Message 725 - Posted: 7 Apr 2020, 13:17:41 UTC
Last modified: 7 Apr 2020, 13:28:10 UTC

Hello,

I recently joined QuChemPedIA@Home.
I'm running a 32 thread, 1950x Threadripper with 32 GB of RAM, and I've sent my priority to 1%, just to work out the bugs before committing more cores.

So naturally BOINC downloaded 28 tasks and eventually tried to run them all at once.
Everything looked okay when I went to take a nap yesterday, with several scheduled to wrap up within fifteen minutes of my departure.

I awoke to a mess. Nothing was finished.
Some tasks were paused, waiting for more RAM, even as they only used 30 MB apiece.
Most had been running for many, many hours. The two that were nearest completion were still nearest completion, with a "minute" left.
The seconds count down slower as they reach the finish line, as if algorithmicly getting "nearly there" but never actually approaching 100%.

I've written this whole post while od9_athome_b3lyp-321gd,batch82,000822664,nwchem,1582561504_0 has eleven seconds left.
I'd just restart, but when I checked the last time it saved, it says:

CPU time since checkpoint 00:00:00
Elapsed time 14:36:51
Estimated time remaining 00:00:11


Which is nonsense. Certainly they're not idle, as my CPU usage is hovering around 95%.
I'd just scrap the "broken" unit but this appears to be happening to all of them? Or at least certainly to the "short" tasks.

I've got 3 days, 12 hours and 54 minutes of CPU time racked up on these suckers, and if they're VALID but unintentionally large files, I'd hate to spend all this time and energy only to have the next person do it too.

And if they're errors, from what I gather, it's better they fail than that the data I'm holding be discarded.

Please advise as soon as you can!
ID: 725 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProDigit

Send message
Joined: 16 Nov 19
Posts: 44
Credit: 21,290,949
RAC: 0
Message 726 - Posted: 7 Apr 2020, 18:35:16 UTC - in response to Message 725.  
Last modified: 7 Apr 2020, 18:51:36 UTC

You say 'short tasks', but I only know of standard and 'long' tasks they're sending out.
The standard tasks take about 1 hour to finish on a 3,5-4Ghz processor.
Then there are 'long' files, manager displays them as 'NWChem long 0.19 (t1)'.
Which on a 1950x CPU should run an average of about 20 hours.
If your CPU tasks are 100% occupied, it means they're still crunching.

There are 2 things I would look at.
If there's a way you can check your CPU temperature? If it's above 80C you may have a cooling issue to look at. that cause the CPU to lower frequency. Above 90-95C the CPU would run at a very low speed (like 700Mhz or so).
in this case either increase cooling capability, or decrease CPU voltage/frequency.

Second I would look at, is if you're running any GPUs. If you're doing 32 out of 32 threads for QuChemPedIA, but have some mis-configured GPU projects running (that eg: say 0,2 CPU, but use much more than that), you may overload your CPU. Seeing Boinc tells me you're running Win 10 and 1 GPU, in that case, you may try to either configure the GPU project, or just set your CPU usage to 99% instead of 100%. That should give the CPU enough breathing room.
This would work if your CPU is having a misconfigured GPU project that overloads the CPU.
ID: 726 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dayle Diamond

Send message
Joined: 6 Apr 20
Posts: 5
Credit: 0
RAC: 0
Message 727 - Posted: 7 Apr 2020, 19:48:32 UTC - in response to Message 726.  
Last modified: 7 Apr 2020, 19:57:54 UTC

My CPU is a steady 64C.
To limit operational costs, I have not had a GPU project attached on this machine in a while.

I say short because to me, an hourlong task is short, and to contrast them with the longer tasks.
The worst of the 'short' tasks is now at zero seconds remaining, and "waiting to run" after 19 hours, 44 minutes. It is listed as 99.998 percent complete.
Computer is doing some WCG tasks to reflect my project priorities.

Edit: If I could make it finish ASAP and get a result file, I would.
I'm temporarily suspending network activity to drain the queue so this task is re-prioritized ASAP.

If you have any other ideas in the meantime, I'm open to hearing them.
ID: 727 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 13 Oct 19
Posts: 87
Credit: 6,026,455
RAC: 0
Message 728 - Posted: 7 Apr 2020, 20:47:16 UTC - in response to Message 727.  
Last modified: 7 Apr 2020, 21:23:51 UTC

Regular tasks that run up to 99.999% and then sit for hours should be aborted. Look through the other threads in the number crunching board and you'll find many discussions about why.
If you look at how others have fared with these WUs, they've already been aborted once or will not validate.
Personally, on my system, I never let an od9 work unit run longer than 6 hours and I haven't been burned yet.
ID: 728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dayle Diamond

Send message
Joined: 6 Apr 20
Posts: 5
Credit: 0
RAC: 0
Message 731 - Posted: 8 Apr 2020, 9:56:39 UTC

Well, the long work units started to run longer than average too.

I'm disconnecting from the project and aborting my units.
Days and days of work with 0 results is too much to ask.

I hope QuChemPedIA works out, but if glitched files with infinite runtimes is normal, the person sending them should be responding to errors rather than letting fellow volunteers guess at the problem.
ID: 731 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zalster

Send message
Joined: 16 Dec 19
Posts: 25
Credit: 11,938,843
RAC: 0
Message 732 - Posted: 8 Apr 2020, 14:38:19 UTC
Last modified: 8 Apr 2020, 14:46:55 UTC

I have to wonder if it's your machine and not the long task.
ID: 732 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dayle Diamond

Send message
Joined: 6 Apr 20
Posts: 5
Credit: 0
RAC: 0
Message 733 - Posted: 8 Apr 2020, 17:04:28 UTC - in response to Message 732.  

You "have to wonder" because the project administrator is staying silent.

My machine is been reliable with quite a few different BOINC tasks over the years.
The only constant I can tell is that in every project forum there's always somebody who appoints themselves defender of the status quo.
ID: 733 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alien Seeker
Avatar

Send message
Joined: 5 Mar 20
Posts: 13
Credit: 805,400
RAC: 0
Message 734 - Posted: 8 Apr 2020, 17:18:52 UTC - in response to Message 733.  

My machine is been reliable with quite a few different BOINC tasks over the years.


It may not be your machine as such but a combination between your machine, VirtualBox and QuChemPedIA tasks. If you have a look at the forums, you'll see VirtualBox has caused a lot of headaches.
ID: 734 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zalster

Send message
Joined: 16 Dec 19
Posts: 25
Credit: 11,938,843
RAC: 0
Message 735 - Posted: 8 Apr 2020, 19:46:24 UTC - in response to Message 734.  

Plus that is a first gen Ryzen which has issues with memory accessing. I agree Virtual Box is also highly flawed.
ID: 735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProDigit

Send message
Joined: 16 Nov 19
Posts: 44
Credit: 21,290,949
RAC: 0
Message 740 - Posted: 9 Apr 2020, 8:34:38 UTC - in response to Message 735.  
Last modified: 9 Apr 2020, 8:47:16 UTC

I agree Virtual Box is also highly flawed.

And slow!
I prefer to just install a Ubuntu on USB and install boinc from there.

Like another user posted, did you verify if you have VT (for VirtualBox) and XMP enabled in Bios, and enough RAM to run that many threads? (you'll need about 24-32GB of RAM, if you want to run 32 threads of QuChemPedIA tasks).

One of the reasons mine was working so slow, was because I had only 16GB of RAM, for 24 threads, and it was doing a lot of disk accessing (about 2 GB of swap).
ID: 740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProDigit

Send message
Joined: 16 Nov 19
Posts: 44
Credit: 21,290,949
RAC: 0
Message 752 - Posted: 12 Apr 2020, 2:13:35 UTC
Last modified: 12 Apr 2020, 2:13:50 UTC

After facing a few of my own, (not virtualbox, just native Linux)
It appears that some WUs created by QCP (including long ones) happen to slow down near to the last 10%, to where they run 'days' to complete.

Without any official word from QCP,
If a short WU runs for more than 2 hour on a 3,5Ghz, or more than 4 hours on a 2GHZ CPU,
I personally would say to cancel the WU.
The runtimes for short WUs are 1 hour on 3,5-4Ghz, and 2 hours on a 2Ghz CPU.
I'd give them the benefit of no more than twice the runtime.

If any long WU runs for more than 1 day on a 3,5Ghz, or 1,5days on a 2Ghz CPU, I would do the same.
Standard WUs run 17 hours on a 3,5-4Ghz CPU and should run about 20 hours on a 2Ghz CPU.
ID: 752 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 768 - Posted: 16 Apr 2020, 8:54:53 UTC - in response to Message 752.  

Since the lockdown, we must adapt all our teachings... I'm trying to catch up on the forum.

The problem of task duration has already been addressed many times! Computation times are unpredictable and highly variable. The project credits well to compensate for this inconvenience... Some rare short workunits may require more than 100 hours of calculation, but we cannot know this in advance. Cancelling units because they last too long is a behaviour that starts to impact the quality of our results! It is these borderline cases that will allow us in the future to train an artificial intelligence capable of predicting these cases and also capable of identifying stable chemical boundaries.
ID: 768 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Short Tasks run for 14 hours and counting.

©2024 Benoit DA MOTA - LERIA, University of Angers, France