Message boards :
Number crunching :
Short Tasks run for 14 hours and counting.
Message board moderation
Author | Message |
---|---|
Send message Joined: 6 Apr 20 Posts: 5 Credit: 0 RAC: 0 |
Hello, I recently joined QuChemPedIA@Home. I'm running a 32 thread, 1950x Threadripper with 32 GB of RAM, and I've sent my priority to 1%, just to work out the bugs before committing more cores. So naturally BOINC downloaded 28 tasks and eventually tried to run them all at once. Everything looked okay when I went to take a nap yesterday, with several scheduled to wrap up within fifteen minutes of my departure. I awoke to a mess. Nothing was finished. Some tasks were paused, waiting for more RAM, even as they only used 30 MB apiece. Most had been running for many, many hours. The two that were nearest completion were still nearest completion, with a "minute" left. The seconds count down slower as they reach the finish line, as if algorithmicly getting "nearly there" but never actually approaching 100%. I've written this whole post while od9_athome_b3lyp-321gd,batch82,000822664,nwchem,1582561504_0 has eleven seconds left. I'd just restart, but when I checked the last time it saved, it says: CPU time since checkpoint 00:00:00 Elapsed time 14:36:51 Estimated time remaining 00:00:11 Which is nonsense. Certainly they're not idle, as my CPU usage is hovering around 95%. I'd just scrap the "broken" unit but this appears to be happening to all of them? Or at least certainly to the "short" tasks. I've got 3 days, 12 hours and 54 minutes of CPU time racked up on these suckers, and if they're VALID but unintentionally large files, I'd hate to spend all this time and energy only to have the next person do it too. And if they're errors, from what I gather, it's better they fail than that the data I'm holding be discarded. Please advise as soon as you can! |
Send message Joined: 16 Nov 19 Posts: 44 Credit: 21,290,949 RAC: 0 |
You say 'short tasks', but I only know of standard and 'long' tasks they're sending out. The standard tasks take about 1 hour to finish on a 3,5-4Ghz processor. Then there are 'long' files, manager displays them as 'NWChem long 0.19 (t1)'. Which on a 1950x CPU should run an average of about 20 hours. If your CPU tasks are 100% occupied, it means they're still crunching. There are 2 things I would look at. If there's a way you can check your CPU temperature? If it's above 80C you may have a cooling issue to look at. that cause the CPU to lower frequency. Above 90-95C the CPU would run at a very low speed (like 700Mhz or so). in this case either increase cooling capability, or decrease CPU voltage/frequency. Second I would look at, is if you're running any GPUs. If you're doing 32 out of 32 threads for QuChemPedIA, but have some mis-configured GPU projects running (that eg: say 0,2 CPU, but use much more than that), you may overload your CPU. Seeing Boinc tells me you're running Win 10 and 1 GPU, in that case, you may try to either configure the GPU project, or just set your CPU usage to 99% instead of 100%. That should give the CPU enough breathing room. This would work if your CPU is having a misconfigured GPU project that overloads the CPU. |
Send message Joined: 6 Apr 20 Posts: 5 Credit: 0 RAC: 0 |
My CPU is a steady 64C. To limit operational costs, I have not had a GPU project attached on this machine in a while. I say short because to me, an hourlong task is short, and to contrast them with the longer tasks. The worst of the 'short' tasks is now at zero seconds remaining, and "waiting to run" after 19 hours, 44 minutes. It is listed as 99.998 percent complete. Computer is doing some WCG tasks to reflect my project priorities. Edit: If I could make it finish ASAP and get a result file, I would. I'm temporarily suspending network activity to drain the queue so this task is re-prioritized ASAP. If you have any other ideas in the meantime, I'm open to hearing them. |
Send message Joined: 13 Oct 19 Posts: 87 Credit: 6,026,455 RAC: 0 |
Regular tasks that run up to 99.999% and then sit for hours should be aborted. Look through the other threads in the number crunching board and you'll find many discussions about why. If you look at how others have fared with these WUs, they've already been aborted once or will not validate. Personally, on my system, I never let an od9 work unit run longer than 6 hours and I haven't been burned yet. |
Send message Joined: 6 Apr 20 Posts: 5 Credit: 0 RAC: 0 |
Well, the long work units started to run longer than average too. I'm disconnecting from the project and aborting my units. Days and days of work with 0 results is too much to ask. I hope QuChemPedIA works out, but if glitched files with infinite runtimes is normal, the person sending them should be responding to errors rather than letting fellow volunteers guess at the problem. |
Send message Joined: 16 Dec 19 Posts: 25 Credit: 11,938,843 RAC: 0 |
I have to wonder if it's your machine and not the long task. |
Send message Joined: 6 Apr 20 Posts: 5 Credit: 0 RAC: 0 |
You "have to wonder" because the project administrator is staying silent. My machine is been reliable with quite a few different BOINC tasks over the years. The only constant I can tell is that in every project forum there's always somebody who appoints themselves defender of the status quo. |
Send message Joined: 5 Mar 20 Posts: 13 Credit: 805,400 RAC: 0 |
My machine is been reliable with quite a few different BOINC tasks over the years. It may not be your machine as such but a combination between your machine, VirtualBox and QuChemPedIA tasks. If you have a look at the forums, you'll see VirtualBox has caused a lot of headaches. |
Send message Joined: 16 Dec 19 Posts: 25 Credit: 11,938,843 RAC: 0 |
Plus that is a first gen Ryzen which has issues with memory accessing. I agree Virtual Box is also highly flawed. |
Send message Joined: 16 Nov 19 Posts: 44 Credit: 21,290,949 RAC: 0 |
I agree Virtual Box is also highly flawed. And slow! I prefer to just install a Ubuntu on USB and install boinc from there. Like another user posted, did you verify if you have VT (for VirtualBox) and XMP enabled in Bios, and enough RAM to run that many threads? (you'll need about 24-32GB of RAM, if you want to run 32 threads of QuChemPedIA tasks). One of the reasons mine was working so slow, was because I had only 16GB of RAM, for 24 threads, and it was doing a lot of disk accessing (about 2 GB of swap). |
Send message Joined: 16 Nov 19 Posts: 44 Credit: 21,290,949 RAC: 0 |
After facing a few of my own, (not virtualbox, just native Linux) It appears that some WUs created by QCP (including long ones) happen to slow down near to the last 10%, to where they run 'days' to complete. Without any official word from QCP, If a short WU runs for more than 2 hour on a 3,5Ghz, or more than 4 hours on a 2GHZ CPU, I personally would say to cancel the WU. The runtimes for short WUs are 1 hour on 3,5-4Ghz, and 2 hours on a 2Ghz CPU. I'd give them the benefit of no more than twice the runtime. If any long WU runs for more than 1 day on a 3,5Ghz, or 1,5days on a 2Ghz CPU, I would do the same. Standard WUs run 17 hours on a 3,5-4Ghz CPU and should run about 20 hours on a 2Ghz CPU. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Since the lockdown, we must adapt all our teachings... I'm trying to catch up on the forum. The problem of task duration has already been addressed many times! Computation times are unpredictable and highly variable. The project credits well to compensate for this inconvenience... Some rare short workunits may require more than 100 hours of calculation, but we cannot know this in advance. Cancelling units because they last too long is a behaviour that starts to impact the quality of our results! It is these borderline cases that will allow us in the future to train an artificial intelligence capable of predicting these cases and also capable of identifying stable chemical boundaries. |
©2024 Benoit DA MOTA - LERIA, University of Angers, France