Message boards :
Number crunching :
Stuck tasks
Message board moderation
Author | Message |
---|---|
Send message Joined: 6 Jan 22 Posts: 1 Credit: 219,800 RAC: 0 |
I seem to be getting tasks that get stuck and don't really go anywhere. For example, I have one task that has been running for 8 days and 6 hours, but has only registered 01:24:00 of CPU time. These tasks also never seem to exceed 3 seconds of CPU time since the last checkpoint. Does anybody know what exactly is going on here? Should I just abort these tasks? |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
Does anybody know what exactly is going on here? Should I just abort these tasks? You must be on Windows, though your computers are hidden. They get stuck. That is what is going on. Yes, abort them. (Linux doesn't have the problem for some reason.) |
Send message Joined: 5 Sep 20 Posts: 103 Credit: 2,142,600 RAC: 0 |
I am running QuChem on a Linux Virtual Machine on a Windows 10 host. The OS is OpenSuSE Tumbleweed, a development OS frequently updated.Kernel is 5.16.1. Tullio |
Send message Joined: 5 Sep 20 Posts: 103 Credit: 2,142,600 RAC: 0 |
Om a Windows 11 PC with 12 GB RAM I have 3 QuChem tasks running plus a rosetta python which uses much RAM. Tullio |
Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0 |
Yogurt789 wrote: I seem to be getting tasks that get stuck and don't really go anywhere. For example, I have one task that has been running for 8 days and 6 hours, but has only registered 01:24:00 of CPU time. These tasks also never seem to exceed 3 seconds of CPU time since the last checkpoint.I am running only Linux, hence am not observing this here at QuChemPedIA@home. But I get this phenomenon with the VirtualBox based "rosetta python projects" application of Rosetta@home occasionally. I am not aware of any other way to deal with those stuck tasks than to abort them. Here is a script which periodically checks for the presence of tasks with CPU time << elapsed time and aborts these. You need to edit the project URL in the script to adapt it from Rosetta@home to QuChemPedIA@home. Alas the script interpreter is 'bash', hence it is not entirely straightforward to run on Windows. Cygwin should work, WSL might work. Furthermore, the script requires a fairly recent version of 'boinccmd'. I don't know precisely how recent, but 7.16.17 works, 7.16.6 does not work. #!/bin/bash # Edit this: # a list of hosts, each optionally with GUI port number appended # (may be just a single host, or dozens of hosts) hosts=( "localhost" "computer_a" "computer_b:31420" ) # Edit this: # the password from gui_rpc_auth.cfg # This script expects the same password on all hosts. # Can be set to "" if you have empty gui_rpc_auth.cfg's. password="$(cat /var/lib/boinc/gui_rpc_auth.cfg)" # Edit this if you want to apply this to a different project. project_url="https://boinc.bakerlab.org/rosetta/" # Change this from "abort" to "suspend" if you prefer. task_op="abort" # Before a task hasn't been executing for some time, other task stats # may still be imprecise. The script therefore does not touch any # tasks which haven't been executing for at least this many seconds. # You can use integer numbers here, but not floating point numbers. # E.g.: 5 * 60 for 5 minutes. min_elapsed_time=$((5 * 60)) # After tasks were aborted, boinc-client may cease to request # new work due to "Communication deferred". To avoid this, should a # project update be forced after one or more tasks were aborted? # Set to 1 for yes, 0 for no. force_project_update=1 # Loop intervals. # You probably don't need to edit these. check_every_n_minutes=10 timestamp_every_n_minutes=120 # That's it; there is probably no need to edit anything from here on. delay=$((${check_every_n_minutes}*60/${#hosts[*]}+1)) ts=${timestamp_every_n_minutes} echo "Monitoring ${hosts[*]}." for ((;;)) do (( (ts += check_every_n_minutes) >= timestamp_every_n_minutes )) && { date; ts=0; } for host in ${hosts[*]} do # Edit this if you run on Cygwin: # boinccmd="/cygdrive/c/Program*Files/BOINC/boinccmd --host ${host} --passwd ${password}" if [ -n "${password}" ] then boinccmd="boinccmd --host ${host} --passwd ${password}" else boinccmd="boinccmd --host ${host}" fi tasks=$(${boinccmd} --get_tasks) || { sleep ${delay}; continue; } unset name url state ett cct while read line do case ${line} in [1-9]* ) i=${line%)*};; "name: "* ) name[$i]=${line#*"name: "};; "project URL: "* ) url[$i]=${line#*"project URL: "};; "active_task_state: "* ) state[$i]=${line#*"active_task_state: "};; "elapsed task time: "* ) tmp=${line#*"elapsed task time: "}; ett[$i]=${tmp%.*};; "current CPU time: "* ) tmp=${line#*"current CPU time: "}; cct[$i]=${tmp%.*};; esac done <<< "${tasks}" n=0 for j in ${!name[*]} do # Skip tasks # - which do not belong to this project, # - which are not currently running, # - which have been running for less than $min_elapsed_time seconds, # - which have a CPU time of more than 50% of elapsed time. [ "${url[$j]}" != "${project_url}" ] && continue [ "${state[$j]}" != "EXECUTING" ] && continue e=${ett[$j]}; ((e < min_elapsed_time)) && continue c=${cct[$j]}; ((e < 2*c)) && continue printf "${host}: ${task_op} ${name[$j]}\t" printf "(elapsed: %02d:%02d:%02d," $((e/3600)) $((e%3600/60)) $((e%60)) printf " CPU: %02d:%02d:%02d)\n" $((c/3600)) $((c%3600/60)) $((c%60)) ${boinccmd} --task "${project_url}" "${name[$j]}" "${task_op}" ((n++)) done ((force_project_update && n)) && { sleep 1; ${boinccmd} --project "${project_url}" update; } sleep ${delay} done done Source: AnandTech forum |
Send message Joined: 2 Oct 21 Posts: 24 Credit: 68,200 RAC: 0 |
I'm thinking since a bunch of us are on Rosetta and here that Python eats up a ton of resources and causes this project to get postponed or get stuck. Have a look at my post and the answer I got to it in the VM thread. I have had Rosetta tasks get stuck as well. I knocked off 5 that got stuck the other week. I think the key is Python. We know how much they drag a system down when 12 or more run at the same time. |
Send message Joined: 23 Feb 22 Posts: 23 Credit: 4,423,400 RAC: 0 |
I had such a "stuck task" last night: https://quchempedia.univ-angers.fr/athome/result.php?resultid=10177061 unfortunately, I found out only this morning, after 11-1/2 hours' runtime, and only 21 seconds CPU time. Too bad that such a task does not stop automatically once it becomes faulty. |
Send message Joined: 14 Dec 19 Posts: 68 Credit: 45,744,261 RAC: 0 |
(Linux doesn't have the problem for some reason.)But it does. I just aborted one that sat at 100% overnight. I don't get them very often. I'm not running any python memory hogs or anything, just a few Gaia, TN-Grid & QC. Cut back to 8 QC per computer and they run much better now. https://quchempedia.univ-angers.fr/athome/result.php?resultid=10173747 |
Send message Joined: 23 Feb 22 Posts: 23 Credit: 4,423,400 RAC: 0 |
I've had quite a number of such "stuck tasks" lately. Whereas before, what happened mostly was that the CPU stopped working after a few seconds (and the task was still running, four hours and hours, until I found out on basis of an usually high runtime that something must be wrong), I recently had cases like this one https://quchempedia.univ-angers.fr/athome/result.php?resultid=10319732. Unfortunately, I did not notice until after almost 9 hours that this task must be faulty, so I checked the task Properties of the BOINC Manager and I saw that the CPU was running for 3 hrs 47 mins only. This kind of behaviour seems new, at least to me. With several tasks running concurrently on 7 computers, it is of course difficult to monitor everything permanently in order to detect such faulty tasks early. And if these faulty tasks are getting more and more, it is kind of annoying, of course :-( |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
With several tasks running concurrently on 7 computers, it is of course difficult to monitor everything permanently in order to detect such faulty tasks early. Have you tried BoincTasks? https://efmer.com/boinctasks/download-boinctasks/ It is easy to install, and runs along with BOINC Manager. In fact, you can use it to control the BOINC tasks instead of BM if you want to, but that is not necessary. It gives you an easy indication of the % of the CPU that any given task is using. So if it is not using much (i.e., will take forever to finish), then that will be easy to see. And you can then easily abort the task. I use it on my Win10 machine to monitor not only the work units on it, but also all my Ubuntu machines on the LAN. It makes them all readily available. |
Send message Joined: 23 Feb 22 Posts: 23 Credit: 4,423,400 RAC: 0 |
Have you tried BoincTasks?thanks, Jim, for the hint. I now installed it first on one of my PCs on which I run the highest number of tasks concurrently. And yes, it helps to monitor at one glance what's going on. Still, of course, stays the problem itself. Also the problem of "postponed ..." which is even worse in a way since it prevents new tasks from being downloaded :-( |
Send message Joined: 23 Feb 22 Posts: 23 Credit: 4,423,400 RAC: 0 |
The tool BOINC tasks revealed some strange thing this morning: I noticed that for one of the running tasks, a CPU usage of 131,47% was shown. So I Iooked up the task properties in the BOINC manager and saw that the task has a runtime of 1:07hrs and a CPU time of 1:28hrs. How come? |
Send message Joined: 5 Sep 20 Posts: 103 Credit: 2,142,600 RAC: 0 |
This looks the what happens on LHC ATLAS@home ,when I allow a single task to run on two processors.The CPU time is a double of the runtime. Tullio |
Send message Joined: 23 Feb 22 Posts: 23 Credit: 4,423,400 RAC: 0 |
the number of "stuck" tasks has increased here markedly within the past few days :-( any explanation for this? |
Send message Joined: 23 Feb 22 Posts: 23 Credit: 4,423,400 RAC: 0 |
the number of "stuck" tasks has increased here markedly within the past few days :-(am I the only one who is experiencing this problem? Or are there other people as well? |
Send message Joined: 23 Feb 22 Posts: 23 Credit: 4,423,400 RAC: 0 |
indeed, no one else is having this problem?the number of "stuck" tasks has increased here markedly within the past few days :-(am I the only one who is experiencing this problem? Or are there other people as well? What's wrong with my computers ??? |
Send message Joined: 23 Feb 22 Posts: 23 Credit: 4,423,400 RAC: 0 |
indeed, no one else is having this problem?the number of "stuck" tasks has increased here markedly within the past few days :-(am I the only one who is experiencing this problem? Or are there other people as well? What's wrong with my computers ??? |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
What's wrong with my computers ??? If you mean the "Vm job unmanageable" ones, it is just that you are on Windows. I had them too on Win10, but don't on Ubuntu. It could be something else though. |
Send message Joined: 23 Feb 22 Posts: 23 Credit: 4,423,400 RAC: 0 |
no, Jim, I am talking about those tasks for which CPU activity suddenly stops right a few seconds after start, or after any other time later. The task itself though keeps running forever, until it is aborted manually. The tool BOINC tasks which you recommended to me makes this problem better and more easily visible. Still, I am not sitting around my computers day and night :-) Which means it could well happen that a new tasks has given up CPU usage right at the beginning, but is running all night through - uselessly and for nothing :-(What's wrong with my computers ???If you mean the "Vm job unmanageable" ones .... The other topic which you mentioned: "vm job unmanageable" happens once in a while, thanks god not too often. |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
no, Jim, I am talking about those tasks for which CPU activity suddenly stops right a few seconds after start, or after any other time later. OK, I had not seen that one the short time I was on Windows. It sounds like the same problem with the Pythons on Rosetta. They would just get stuck with 0% resource usage and had to be aborted. It happened on both Windows and Linux there though. I do not see it here on Ubuntu. |
©2024 Benoit DA MOTA - LERIA, University of Angers, France