Stuck tasks

Message boards : Number crunching : Stuck tasks
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Yogurt789

Send message
Joined: 6 Jan 22
Posts: 1
Credit: 219,800
RAC: 0
Message 1640 - Posted: 23 Jan 2022, 14:52:08 UTC

I seem to be getting tasks that get stuck and don't really go anywhere. For example, I have one task that has been running for 8 days and 6 hours, but has only registered 01:24:00 of CPU time. These tasks also never seem to exceed 3 seconds of CPU time since the last checkpoint.

Does anybody know what exactly is going on here? Should I just abort these tasks?
ID: 1640 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 1641 - Posted: 23 Jan 2022, 15:42:32 UTC - in response to Message 1640.  

Does anybody know what exactly is going on here? Should I just abort these tasks?

You must be on Windows, though your computers are hidden.

They get stuck. That is what is going on.
Yes, abort them. (Linux doesn't have the problem for some reason.)
ID: 1641 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tullio

Send message
Joined: 5 Sep 20
Posts: 103
Credit: 2,142,600
RAC: 0
Message 1642 - Posted: 27 Jan 2022, 13:22:11 UTC - in response to Message 1641.  
Last modified: 27 Jan 2022, 13:23:17 UTC

I am running QuChem on a Linux Virtual Machine on a Windows 10 host. The OS is OpenSuSE Tumbleweed, a development OS frequently updated.Kernel is 5.16.1.
Tullio
ID: 1642 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tullio

Send message
Joined: 5 Sep 20
Posts: 103
Credit: 2,142,600
RAC: 0
Message 1643 - Posted: 27 Jan 2022, 14:33:05 UTC

Om a Windows 11 PC with 12 GB RAM I have 3 QuChem tasks running plus a rosetta python which uses much RAM.
Tullio
ID: 1643 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 24
Credit: 68,559,000
RAC: 0
Message 1648 - Posted: 11 Feb 2022, 6:38:23 UTC - in response to Message 1640.  

Yogurt789 wrote:
I seem to be getting tasks that get stuck and don't really go anywhere. For example, I have one task that has been running for 8 days and 6 hours, but has only registered 01:24:00 of CPU time. These tasks also never seem to exceed 3 seconds of CPU time since the last checkpoint.

Does anybody know what exactly is going on here? Should I just abort these tasks?
I am running only Linux, hence am not observing this here at QuChemPedIA@home. But I get this phenomenon with the VirtualBox based "rosetta python projects" application of Rosetta@home occasionally. I am not aware of any other way to deal with those stuck tasks than to abort them.

Here is a script which periodically checks for the presence of tasks with CPU time << elapsed time and aborts these. You need to edit the project URL in the script to adapt it from Rosetta@home to QuChemPedIA@home. Alas the script interpreter is 'bash', hence it is not entirely straightforward to run on Windows. Cygwin should work, WSL might work. Furthermore, the script requires a fairly recent version of 'boinccmd'. I don't know precisely how recent, but 7.16.17 works, 7.16.6 does not work.

#!/bin/bash

# Edit this:
#    a list of hosts, each optionally with GUI port number appended
#    (may be just a single host, or dozens of hosts)
hosts=(
	"localhost"
	"computer_a"
	"computer_b:31420"
)

# Edit this:
#    the password from gui_rpc_auth.cfg
#    This script expects the same password on all hosts.
#    Can be set to "" if you have empty gui_rpc_auth.cfg's.
password="$(cat /var/lib/boinc/gui_rpc_auth.cfg)"

# Edit this if you want to apply this to a different project.
project_url="https://boinc.bakerlab.org/rosetta/"

# Change this from "abort" to "suspend" if you prefer.
task_op="abort"

# Before a task hasn't been executing for some time, other task stats
# may still be imprecise.  The script therefore does not touch any
# tasks which haven't been executing for at least this many seconds.
# You can use integer numbers here, but not floating point numbers.
# E.g.: 5 * 60 for 5 minutes.
min_elapsed_time=$((5 * 60))

# After tasks were aborted, boinc-client may cease to request
# new work due to "Communication deferred". To avoid this, should a
# project update be forced after one or more tasks were aborted?
# Set to 1 for yes, 0 for no.
force_project_update=1

# Loop intervals.
# You probably don't need to edit these.
check_every_n_minutes=10
timestamp_every_n_minutes=120

# That's it; there is probably no need to edit anything from here on.
delay=$((${check_every_n_minutes}*60/${#hosts[*]}+1))
ts=${timestamp_every_n_minutes}

echo "Monitoring ${hosts[*]}."
for ((;;))
do
	(( (ts += check_every_n_minutes) >= timestamp_every_n_minutes )) && { date; ts=0; }

	for host in ${hosts[*]}
	do
		# Edit this if you run on Cygwin:
		#    boinccmd="/cygdrive/c/Program*Files/BOINC/boinccmd --host ${host} --passwd ${password}"
		if [ -n "${password}" ]
		then
			boinccmd="boinccmd --host ${host} --passwd ${password}"
		else
			boinccmd="boinccmd --host ${host}"
		fi

		tasks=$(${boinccmd} --get_tasks) || { sleep ${delay}; continue; }

		unset name url state ett cct
		while read line
		do
			case ${line} in
		             		[1-9]* )	 i=${line%)*};;
		        	     "name: "* )  name[$i]=${line#*"name: "};;
			      "project URL: "* )   url[$i]=${line#*"project URL: "};;
			"active_task_state: "* ) state[$i]=${line#*"active_task_state: "};;
			"elapsed task time: "* )       tmp=${line#*"elapsed task time: "}; ett[$i]=${tmp%.*};;
			 "current CPU time: "* )       tmp=${line#*"current CPU time: "};  cct[$i]=${tmp%.*};;
			esac
		done <<< "${tasks}"

		n=0
		for j in ${!name[*]}
		do
			# Skip tasks
			#   - which do not belong to this project,
			#   - which are not currently running,
			#   - which have been running for less than $min_elapsed_time seconds,
			#   - which have a CPU time of more than 50% of elapsed time.
			[ "${url[$j]}"   != "${project_url}" ] && continue
			[ "${state[$j]}" != "EXECUTING"      ] && continue
			e=${ett[$j]}; ((e < min_elapsed_time)) && continue
			c=${cct[$j]}; ((e < 2*c)) && continue

			printf "${host}: ${task_op} ${name[$j]}\t"
			printf "(elapsed: %02d:%02d:%02d," $((e/3600)) $((e%3600/60)) $((e%60))
			printf " CPU: %02d:%02d:%02d)\n"   $((c/3600)) $((c%3600/60)) $((c%60))
			${boinccmd} --task "${project_url}" "${name[$j]}" "${task_op}"
			((n++))
		done

		((force_project_update && n)) && { sleep 1; ${boinccmd} --project "${project_url}" update; }

		sleep ${delay}
	done
done

Source: AnandTech forum
ID: 1648 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg_BE

Send message
Joined: 2 Oct 21
Posts: 24
Credit: 68,200
RAC: 0
Message 1651 - Posted: 11 Feb 2022, 23:41:43 UTC

I'm thinking since a bunch of us are on Rosetta and here that Python eats up a ton of resources and causes this project to get postponed or get stuck.

Have a look at my post and the answer I got to it in the VM thread.
I have had Rosetta tasks get stuck as well.
I knocked off 5 that got stuck the other week.
I think the key is Python. We know how much they drag a system down when 12 or more run at the same time.
ID: 1651 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 23 Feb 22
Posts: 23
Credit: 4,423,400
RAC: 0
Message 1708 - Posted: 11 Mar 2022, 6:36:32 UTC

I had such a "stuck task" last night:

https://quchempedia.univ-angers.fr/athome/result.php?resultid=10177061

unfortunately, I found out only this morning, after 11-1/2 hours' runtime, and only 21 seconds CPU time.

Too bad that such a task does not stop automatically once it becomes faulty.
ID: 1708 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 14 Dec 19
Posts: 68
Credit: 45,744,261
RAC: 0
Message 1709 - Posted: 11 Mar 2022, 19:54:40 UTC - in response to Message 1641.  
Last modified: 11 Mar 2022, 19:55:23 UTC

(Linux doesn't have the problem for some reason.)
But it does. I just aborted one that sat at 100% overnight. I don't get them very often.
I'm not running any python memory hogs or anything, just a few Gaia, TN-Grid & QC.
Cut back to 8 QC per computer and they run much better now.
https://quchempedia.univ-angers.fr/athome/result.php?resultid=10173747
ID: 1709 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 23 Feb 22
Posts: 23
Credit: 4,423,400
RAC: 0
Message 1713 - Posted: 24 Mar 2022, 16:25:15 UTC - in response to Message 1709.  

I've had quite a number of such "stuck tasks" lately.
Whereas before, what happened mostly was that the CPU stopped working after a few seconds (and the task was still running, four hours and hours, until I found out on basis of an usually high runtime that something must be wrong), I recently had cases like this one https://quchempedia.univ-angers.fr/athome/result.php?resultid=10319732.
Unfortunately, I did not notice until after almost 9 hours that this task must be faulty, so I checked the task Properties of the BOINC Manager and I saw that the CPU was running for 3 hrs 47 mins only.
This kind of behaviour seems new, at least to me.
With several tasks running concurrently on 7 computers, it is of course difficult to monitor everything permanently in order to detect such faulty tasks early.
And if these faulty tasks are getting more and more, it is kind of annoying, of course :-(
ID: 1713 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 1714 - Posted: 24 Mar 2022, 23:02:24 UTC - in response to Message 1713.  
Last modified: 24 Mar 2022, 23:03:34 UTC

With several tasks running concurrently on 7 computers, it is of course difficult to monitor everything permanently in order to detect such faulty tasks early.

Have you tried BoincTasks?
https://efmer.com/boinctasks/download-boinctasks/

It is easy to install, and runs along with BOINC Manager. In fact, you can use it to control the BOINC tasks instead of BM if you want to, but that is not necessary.
It gives you an easy indication of the % of the CPU that any given task is using. So if it is not using much (i.e., will take forever to finish), then that will be easy to see.
And you can then easily abort the task.

I use it on my Win10 machine to monitor not only the work units on it, but also all my Ubuntu machines on the LAN. It makes them all readily available.
ID: 1714 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 23 Feb 22
Posts: 23
Credit: 4,423,400
RAC: 0
Message 1715 - Posted: 27 Mar 2022, 18:12:15 UTC - in response to Message 1714.  

Have you tried BoincTasks?
https://efmer.com/boinctasks/download-boinctasks/

It is easy to install, and runs along with BOINC Manager. In fact, you can use it to control the BOINC tasks instead of BM if you want to, but that is not necessary.
It gives you an easy indication of the % of the CPU that any given task is using. So if it is not using much (i.e., will take forever to finish), then that will be easy to see.
And you can then easily abort the task.

I use it on my Win10 machine to monitor not only the work units on it, but also all my Ubuntu machines on the LAN. It makes them all readily available.
thanks, Jim, for the hint. I now installed it first on one of my PCs on which I run the highest number of tasks concurrently.
And yes, it helps to monitor at one glance what's going on.

Still, of course, stays the problem itself.

Also the problem of "postponed ..." which is even worse in a way since it prevents new tasks from being downloaded :-(
ID: 1715 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 23 Feb 22
Posts: 23
Credit: 4,423,400
RAC: 0
Message 1716 - Posted: 31 Mar 2022, 6:39:37 UTC

The tool BOINC tasks revealed some strange thing this morning:

I noticed that for one of the running tasks, a CPU usage of 131,47% was shown.
So I Iooked up the task properties in the BOINC manager and saw that the task has a runtime of 1:07hrs and a CPU time of 1:28hrs. How come?
ID: 1716 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tullio

Send message
Joined: 5 Sep 20
Posts: 103
Credit: 2,142,600
RAC: 0
Message 1718 - Posted: 1 Apr 2022, 15:41:09 UTC

This looks the what happens on LHC ATLAS@home ,when I allow a single task to run on two processors.The CPU time is a double of the runtime.
Tullio
ID: 1718 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 23 Feb 22
Posts: 23
Credit: 4,423,400
RAC: 0
Message 1737 - Posted: 29 Apr 2022, 14:15:11 UTC

the number of "stuck" tasks has increased here markedly within the past few days :-(
any explanation for this?
ID: 1737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 23 Feb 22
Posts: 23
Credit: 4,423,400
RAC: 0
Message 1738 - Posted: 1 May 2022, 16:21:20 UTC - in response to Message 1737.  

the number of "stuck" tasks has increased here markedly within the past few days :-(
any explanation for this?
am I the only one who is experiencing this problem? Or are there other people as well?
ID: 1738 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 23 Feb 22
Posts: 23
Credit: 4,423,400
RAC: 0
Message 1740 - Posted: 8 May 2022, 15:26:30 UTC - in response to Message 1738.  

the number of "stuck" tasks has increased here markedly within the past few days :-(
any explanation for this?
am I the only one who is experiencing this problem? Or are there other people as well?
indeed, no one else is having this problem?
What's wrong with my computers ???
ID: 1740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 23 Feb 22
Posts: 23
Credit: 4,423,400
RAC: 0
Message 1741 - Posted: 8 May 2022, 15:26:42 UTC - in response to Message 1738.  

the number of "stuck" tasks has increased here markedly within the past few days :-(
any explanation for this?
am I the only one who is experiencing this problem? Or are there other people as well?
indeed, no one else is having this problem?
What's wrong with my computers ???
ID: 1741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 1742 - Posted: 8 May 2022, 15:31:04 UTC - in response to Message 1741.  

What's wrong with my computers ???

If you mean the "Vm job unmanageable" ones, it is just that you are on Windows.
I had them too on Win10, but don't on Ubuntu.

It could be something else though.
ID: 1742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 23 Feb 22
Posts: 23
Credit: 4,423,400
RAC: 0
Message 1746 - Posted: 12 May 2022, 14:17:31 UTC - in response to Message 1742.  

What's wrong with my computers ???
If you mean the "Vm job unmanageable" ones ....
no, Jim, I am talking about those tasks for which CPU activity suddenly stops right a few seconds after start, or after any other time later. The task itself though keeps running forever, until it is aborted manually. The tool BOINC tasks which you recommended to me makes this problem better and more easily visible. Still, I am not sitting around my computers day and night :-) Which means it could well happen that a new tasks has given up CPU usage right at the beginning, but is running all night through - uselessly and for nothing :-(

The other topic which you mentioned: "vm job unmanageable" happens once in a while, thanks god not too often.
ID: 1746 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 1747 - Posted: 12 May 2022, 15:10:55 UTC - in response to Message 1746.  

no, Jim, I am talking about those tasks for which CPU activity suddenly stops right a few seconds after start, or after any other time later.

OK, I had not seen that one the short time I was on Windows.

It sounds like the same problem with the Pythons on Rosetta. They would just get stuck with 0% resource usage and had to be aborted.
It happened on both Windows and Linux there though. I do not see it here on Ubuntu.
ID: 1747 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Stuck tasks

©2024 Benoit DA MOTA - LERIA, University of Angers, France