Message boards :
Number crunching :
2,5 days long and counting...
Message board moderation
Author | Message |
---|---|
Send message Joined: 16 Nov 19 Posts: 44 Credit: 21,290,949 RAC: 0 |
Current WU has been running for 2 days and 15 hours and counting. It's at 99.070%, and the counter goes very, very slow. Cancel, or is this ok for a long WU on a 3,5-4Ghz CPU? Application NWChem long 0.19 (t1) Name BTXv2_athome_b3lyp-321gd_long,batch02,000001835,nwchem_long,1586073408 State Running Received Tue 07 Apr 2020 01:41:44 PM EDT Report deadline Mon 06 Jul 2020 01:41:43 PM EDT Estimated computation size 500,000 GFLOPs CPU time 2d 14:00:57 CPU time since checkpoint 00:00:03 Elapsed time 2d 15:13:59 Estimated time remaining 00:35:37 Fraction done 99.070% Virtual memory size 1.28 GB Working set size 133.29 MB Directory slots/17 Process ID 46850 Progress rate 1.440% per hour Executable wrapper_26014_x86_64-pc-linux-gnu |
Send message Joined: 13 Oct 19 Posts: 87 Credit: 6,026,455 RAC: 0 |
I have the same questions as I am seeing something similar on my Windows system. My resource monitor says I am using 83% of my CPU, consistent with crunching on 5 of 6 cores. A couple have been aborted but I hate to keep doing that. |
Send message Joined: 16 Dec 19 Posts: 25 Credit: 11,938,843 RAC: 0 |
Could be. I had one similar that ran 15.5 hours on 8 threads. How many threads is it running on? I'm on linux so that helps a some. |
Send message Joined: 16 Nov 19 Posts: 44 Credit: 21,290,949 RAC: 0 |
Thread count shouldn't matter, as each WU runs in it's own thread, But to answer the question, 24. My other PC with 32 threads is currently also having 2 WUs of 1+ days. |
Send message Joined: 16 Nov 19 Posts: 44 Credit: 21,290,949 RAC: 0 |
Without any official representation here from QuChemPedIA, and my question unanswered, and without there being any higher PPD allocated for projects running longer than the 'long' ones, I will abort any WU I see running past 1 day, unless someone official can assure me there's nothing wrong with these long WUs where the percentage counter doesn't seem to work.. We shoot ourselves both in the foot this way, but most 'long' WUs on a 3,5Ghz CPU run no longer than 17 hours. Even on a 3Ghz CPU you'd be finishing a WU in 20 hours. So cancelling it is! |
Send message Joined: 13 Oct 19 Posts: 87 Credit: 6,026,455 RAC: 0 |
I am going to let my "long" long tasks run, they are completing and validating. |
Send message Joined: 16 Nov 19 Posts: 44 Credit: 21,290,949 RAC: 0 |
I am going to let my "long" long tasks run, they are completing and validating. Yes, but at 2,5 days, you could have crunched and validated 3 long WUs, and got triple the score. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
(copy/paste from other thread) Since the lockdown, we must adapt all our teachings... I'm trying to catch up on the forum. The problem of task duration has already been addressed many times! Computation times are unpredictable and highly variable. The project credits well to compensate for this inconvenience... Some rare short workunits may require more than 100 hours of calculation, but we cannot know this in advance. Cancelling units because they last too long is a behaviour that starts to impact the quality of our results! It is these borderline cases that will allow us in the future to train an artificial intelligence capable of predicting these cases and also capable of identifying stable chemical boundaries. For long WUs, computation time can really be enormous ! Maybe two weeks or more for extreme case. |
Send message Joined: 16 Nov 19 Posts: 44 Credit: 21,290,949 RAC: 0 |
Thanks, but the percentage bar is not working correctly on those. it works like normal long units, and starts slowing down near to 80-90%, ever getting slower the closer to 100%. At 99% or over, it doesn't appear to move at all. I'm running 2 units for 3+ days now! More and more of these are hogging up my CPU queue, meaning, they're processing not allowing other WUs to be processed. You must find a solution to when this is happening, that the result will be uploaded before these wus stretch themselves to infinity time. My electricity is valuable, and I don't want to spend a week running a task that in the end gets aborted! |
Send message Joined: 3 May 20 Posts: 2 Credit: 400 RAC: 0 |
(copy/paste from other thread) I'm aware of this, but I currently have a WU (od9_athome_b3lyp_321gd,batch78,000788954,nwchem,1581806840_6) that has been running for over 3 days (CPU time) and hardly uses any CPU at all any more! It just sits there, seemingly idling. When I pause all other jobs in BOINC, I see about 3% CPU usage from that WU, and no noticeable network or disk activity. This is on BOINC 7.14.4, VirtualBox 6.1.6, MacOS 10.15.4 and an Intel I5 9600K CPU. Please clarify whether (and why) it is still worthwile keeping such a WU running. This doen't look like any further progress can be expected from the WU. Are there any logs that one could check? |
Send message Joined: 13 Oct 19 Posts: 87 Credit: 6,026,455 RAC: 0 |
When a WU shows as running for a long time but using no CPU, it should be aborted. |
Send message Joined: 3 May 20 Posts: 2 Credit: 400 RAC: 0 |
Alright, thanks for the info. This is a bit problematic for the project, though: If people don't bother to actively and regularly check for issues like this, the project will fall behind in productivity since the allocated time slots on users' machines won't be used productively. |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
I am not seeing these problems, maybe because I don't look too hard for them. But I don't need to. I have zero errors at the moment. My longest "long" task is only about 2 days on a Ryzen 2600 (Ubuntu 18.04.4). https://quchempedia.univ-angers.fr/athome/results.php?hostid=1814&offset=0&show_names=0&state=4&appid=3 But I run only t1, which I have found to help in the past. If you have errors, there may be machine problems as due to overclocking, memory, etc. They might cause hangups too? |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
I have now switched over to a Ryzen 3900X. We will see how it goes. https://quchempedia.univ-angers.fr/athome/results.php?hostid=2137 |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
I did get a long runner. It ran for 4 days 10 hours. I was not sure it would complete, but since it was still making progress (slowly), I let it run. It is now PV. https://quchempedia.univ-angers.fr/athome/result.php?resultid=2335263 I am now on an i7-8700, and will see if the CPU type makes a difference. |
Send message Joined: 16 Nov 19 Posts: 44 Credit: 21,290,949 RAC: 0 |
I got about 7 WUs of over 3.5 days on a Ryzen 3900x, and several on my 3950x. I would plea to make such long runners higher in PPD. They're worth more than the 5000 credit assigned to them! |
Send message Joined: 5 Jan 20 Posts: 7 Credit: 100,435,425 RAC: 0 |
My record is a currently-running WU that is registering 13d 16h run time so far. Based on damotbe's feedback, this is probably very unusual but also not unexpected ... ? |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
indeed, it can happen and be normal |
Send message Joined: 6 Nov 19 Posts: 8 Credit: 156,845 RAC: 0 |
I don't mind the long runtime. however what is really bad is the lack of checkpoints. Yesterday my running result was at a runtime of over 2 days. Because of a system reboot it jumped back to a runtime of about 8 hrs and started again from that point. |
Send message Joined: 6 Nov 19 Posts: 8 Credit: 156,845 RAC: 0 |
I don't mind the long runtime. however what is really bad is the lack of checkpoints. Edit to post: After some checking if found this is in the stderr.txt file 2020-06-10 12:45:10 (224): Restore from previously saved snapshot. 2020-06-10 12:45:10 (224): Error 0x80010105 in vbox52::VBOX_VM::restore_snapshot (c:\users\david\documents\boinc_git\boinc\samples\vboxwrapper\vbox_mscom_impl.cpp:1835) 2020-06-10 12:45:10 (224): Error: Getting Error Info! hr = 0x1 User david does not exists on my system. Should it not be the user ID of the person who is running the Boinc program on the local system? Why is the snapshot saved in a documents directory? It shoud be in a slot directory under the Boinc_data directory. |
©2024 Benoit DA MOTA - LERIA, University of Angers, France