Long work units.

Message boards : Number crunching : Long work units.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
adrianxw
Avatar

Send message
Joined: 3 Oct 19
Posts: 33
Credit: 197,169
RAC: 0
Message 977 - Posted: 27 Jul 2020, 14:13:31 UTC

>>> It would not be the first time someone has had a very long running task.

Indeed. I well recall climate prediction work units running for months.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yavanius
Avatar

Send message
Joined: 22 Jul 20
Posts: 10
Credit: 21,000
RAC: 0
Message 978 - Posted: 27 Jul 2020, 15:07:10 UTC - in response to Message 977.  

>>> It would not be the first time someone has had a very long running task.

Indeed. I well recall climate prediction work units running for months.



That was intentionally designed like that. You couldn't just model it over night...except maybe if you go a desert or ice planet. <Annoying System Alert> Your people have died.
ID: 978 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw
Avatar

Send message
Joined: 3 Oct 19
Posts: 33
Credit: 197,169
RAC: 0
Message 979 - Posted: 27 Jul 2020, 15:37:38 UTC

The time remaining has dropped to zero now, yet the task continues to run, certainly rather odd work units.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 979 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mikey
Avatar

Send message
Joined: 12 Oct 19
Posts: 5
Credit: 337,959
RAC: 0
Message 980 - Posted: 27 Jul 2020, 16:17:53 UTC - in response to Message 979.  
Last modified: 27 Jul 2020, 16:28:00 UTC

The time remaining has dropped to zero now, yet the task continues to run, certainly rather odd work units.


I have 2 units running on a new laptop with an AMD 4500H cpu and they are using 0.2% of the cpu to crunch, almost nothing else is using the pc and memory is not an issue, one was at 99% complete at midnight last night and right now at noon today, 12 hours later it's at 99.691% so it IS moving but it's been running for 2 days 13+ hours so far!!! This is NOT a "long" workunit, this can't be normal CAN IT? It says it has 11 minutes and 17 seconds to go but that is not even close to be real!!!

The other workunit was at 99.5??% at midnight, it has also been running for over 2 days and 20 hours and today at noon it's at 99.853% complete, it says it has 6 minutes and 4 seconds left but I KNOW that isn't true either!!! When I click on Properties for the workunits they both say they are progressing at 1.8% per hour, the exact same thing they said last night so that isn't reliable either!!

Needless to say I have turned off getting any future work as this is just not normal to me. Is there a way to use a config file to get the pc to use multiple cpu cores on each workunit? Because if so I would gladly do that to get thru these wu's faster.
ID: 980 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yavanius
Avatar

Send message
Joined: 22 Jul 20
Posts: 10
Credit: 21,000
RAC: 0
Message 981 - Posted: 28 Jul 2020, 0:08:40 UTC - in response to Message 980.  

I got two like that. One's been at 99.999 for a few hours now

Another hasn't run quite as long is 99.997.

Approaching 1 day 8 hours and 1 day 4 hours respectively.


I noticed WUs were running longer, but this is ridiculous. Also: 200 points for 24 hour WUs?
ID: 981 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yavanius
Avatar

Send message
Joined: 22 Jul 20
Posts: 10
Credit: 21,000
RAC: 0
Message 982 - Posted: 28 Jul 2020, 2:15:51 UTC

Not sure if it's related but I'm seeing errors when viewing the console:

udhcpc: sending discover  {5x}
udhcpc failed to get a lease
udhcpc: forking to background
...
 * MOUNT VBOXSF ...
/lib/rc/sh/openrc-run.sh: line 18: ebgin: not found
* RUN WORKER SCRIPT
/run.sh: line 9: taskset: command not found
/run.sh: line 11: taskset: command not found
ID: 982 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yavanius
Avatar

Send message
Joined: 22 Jul 20
Posts: 10
Credit: 21,000
RAC: 0
Message 983 - Posted: 28 Jul 2020, 2:31:22 UTC - in response to Message 981.  
Last modified: 28 Jul 2020, 2:37:55 UTC


Approaching 1 day 8 hours and 1 day 4 hours respectively.


I gave up and made them crash by shutting them down in Vbox. ;)


I discovered on reviewing my Tasks on the website that they barely did any computation:

Run time        CPU time
114,840.73 	680.75
 96,949.64 	95.30



The last successful ones ran longer than a Long WU. Something is obviously misconfigured on the project's backend that the units are spinning but not getting very far and that the shorts are running longer than the Long WUs now (which means essentially getting only 200 WUs for what's basically a Long WU).


For those of us getting these abnormal WUs, getting long WU credit for trying to stick with them would be nice.
ID: 983 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw
Avatar

Send message
Joined: 3 Oct 19
Posts: 33
Credit: 197,169
RAC: 0
Message 984 - Posted: 28 Jul 2020, 7:58:18 UTC
Last modified: 28 Jul 2020, 8:00:43 UTC

I've set no new tasks. There is clearly something wrong here. The job I had that had 100.000% complete, and remaining 00:00:00 yesterday, is still there and "running" this morning, it is approaching 9 CPU days now. The task manager is not showing 100% system in use though, I've stopped and started other projects work units, nothing brings it back up to 100%. I'm aborting it now.

19 Jul 2020, 10:18:01 UTC 28 Jul 2020, 7:55:37 UTC Aborted 752,166.99 55,497.80 --- NWChem long v0.11 (vbox64_t1)
windows_x86_64
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 984 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 4 Jan 20
Posts: 60
Credit: 516,736
RAC: 0
Message 987 - Posted: 28 Jul 2020, 19:55:55 UTC - in response to Message 984.  
Last modified: 28 Jul 2020, 19:59:00 UTC

The best way to know if a work unit has a normal behavior, on windows hosts with VirtualBox application is to check it in the Boinc manager.

You can't determine it , viewing the bar progress.(The work unit of this project are not necessarily finishable, in theory).

The method :

    You select a work unit running.

    You click , in the left panel , on the propriety button.

    You check in the windows pop-up , the elapsed times and the run times displayed.



If there is a big discrepancy between these two values , then there is a big chance that the work unit is broken (either busy host , or ram management unefficient , or power micro waves,...).

To be sure , you do the same action 15 min after ,if the cpu times doesn't increase , between two reads ,then you can abort the work unit , without fear ,because it will never end.

I know it's not very fun to baby-site the work units , but it's the only way not to waste your power electricity.

Unless , one day , someone , succeed to write a reliable script , to take care of the running work units and perhaps include it in Boinc or VirtualBox.It is curious , it has never be tried at my knowledge...(Maybe too much complicated...)

It could simplify life of many crunchers.

ID: 987 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw
Avatar

Send message
Joined: 3 Oct 19
Posts: 33
Credit: 197,169
RAC: 0
Message 988 - Posted: 29 Jul 2020, 8:32:04 UTC
Last modified: 29 Jul 2020, 8:40:24 UTC

This machine has 15 projects in its portfolio, probably half of which have work running at any one time, and show no problems, so I doubt it is...

>>> (either busy host , or ram management unefficient , or power micro waves,...)

The other machine has a similar portfolio but without GPU projects, the GPU in that machine is older and showing signs of trouble, (arrays of black spots on part of the screen, the parts with spots move around over time, RAM issue I suspect).

The problem for me is that when your task gets into that state, it is preventing another project from usig the resource. Looking at the line I posted above, that might be for a very long time. I am not available to watch it 24/7. I'll leave it at "no new tasks" for now

Best of luck.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 988 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 998 - Posted: 3 Aug 2020, 15:28:00 UTC - in response to Message 988.  
Last modified: 3 Aug 2020, 15:36:32 UTC

This issue has been addressed many times before. Calculation times are unpredictable and sometimes extremely long. As long as there is CPU activity, nothing is lost.

EDIT: If CPU activity is low, something goes wrong (known bug, already mentionned in other threads, we are looking for explanation). I this case, cancel the job.
ID: 998 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw
Avatar

Send message
Joined: 3 Oct 19
Posts: 33
Credit: 197,169
RAC: 0
Message 1003 - Posted: 4 Aug 2020, 10:22:39 UTC

When you have several machines attached to numerous projects, to spend time absorbing all threads on all forums of all projects is a non starter. As I said, best of luck, I have detached my systems from the project.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 1003 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Long work units.

©2024 Benoit DA MOTA - LERIA, University of Angers, France