Never ending tasks

Message boards : Number crunching : Never ending tasks
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Michael H.W. Weber
Avatar

Send message
Joined: 11 Apr 20
Posts: 23
Credit: 442,800
RAC: 0
Message 762 - Posted: 13 Apr 2020, 16:11:44 UTC

...like this one which I had to abort after approx. 5.5 hrs when noticing that there was no CPU usage and no checkpoint:

https://quchempedia.univ-angers.fr/athome/result.php?resultid=2237771

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 762 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 771 - Posted: 16 Apr 2020, 9:07:51 UTC - in response to Message 762.  

The vbox wrapper didn't turn off properly. I couldn't compile a better wrapper, so I'm using the official from Berkeley.
ID: 771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Michael H.W. Weber
Avatar

Send message
Joined: 11 Apr 20
Posts: 23
Credit: 442,800
RAC: 0
Message 775 - Posted: 17 Apr 2020, 9:55:40 UTC - in response to Message 771.  

The vbox wrapper didn't turn off properly. I couldn't compile a better wrapper, so I'm using the official from Berkeley.

Well, it appears not to solve the issue, right?
It might help if you provide the full procedure of creating your wrapper: To my experince, there are usually people around who might have more expertise with this than us. ;-)

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 775 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Michael H.W. Weber
Avatar

Send message
Joined: 11 Apr 20
Posts: 23
Credit: 442,800
RAC: 0
Message 786 - Posted: 18 Apr 2020, 11:46:41 UTC
Last modified: 18 Apr 2020, 11:57:47 UTC

I have taken a quick look into these problematic tasks which do not finish.
My conclusion: It seems not to be a problem of not finishing, it appears that these Virtualbox-based VMs are not properly initiating. Please check the logs and the Virtualbox manual in detail: There are numerous error messages given. Here a few examples from_

https://quchempedia.univ-angers.fr/athome/result.php?resultid=2268100

...

Hypervisor System Log:

04:57:37.111095          Saving settings file "C:\ProgramData\BOINC\slots\1\boinc_b6cb3843252a5eb9\boinc_b6cb3843252a5eb9.vbox" with version "1.16-windows"
04:57:39.759996          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={85cd948e-a71f-4289-281e-0ca7ad48cd89} aComponent={MachineWrap} aText={The given session is busy}, preserve=false aResultDetail=0

...

05:02:56.055177          ERROR [COM]: aRC=E_FAIL (0x80004005) aIID={85cd948e-a71f-4289-281e-0ca7ad48cd89} aComponent={SessionMachine} aText={This machine does not have any snapshots}, preserve=false aResultDetail=0

...

05:02:56.070793          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={Medium 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' is locked for reading by another task}, preserve=false aResultDetail=0
05:02:56.071771          Saving settings file "C:\Users\weber\.VirtualBox\VirtualBox.xml" with version "1.12-windows"
05:02:56.076649          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={The object is not ready}, preserve=false aResultDetail=0
05:02:56.076649          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={Medium 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' is locked for reading by another task}, preserve=false aResultDetail=0
05:03:08.894448          ERROR [COM]: aRC=VBOX_E_OBJECT_NOT_FOUND (0x80bb0001) aIID={9570b9d5-f1a1-448a-10c5-e12f5285adad} aComponent={VirtualBoxWrap} aText={Could not find a registered machine named 'boinc_68a02bed83281783'}, preserve=false aResultDetail=0
0

...

05:03:11.538433          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={85cd948e-a71f-4289-281e-0ca7ad48cd89} aComponent={MachineWrap} aText={The given session is busy}, preserve=false aResultDetail=0

...

When I checked the running tasks by selecting them in the BOINC manager, right-clicking and looking for properties it says that CPU time as well as CPU time after last checkpoint are both zero. So, computation never starts? That would be in line with the observation that these tasks do not appear to consume CPU load.
For correctly running VM tasks CPU time is almost identical to CPU time after last checkpoint, which is also kind of strange given the fact that I set all my systems to write checkpoints to disk only once every hour, because it is known that if the checkpoint interval is set too short, Virtualbox causes I/O issues (depending on the VM size of course).

Michael.

P.S.: You might also consider using Docker instead of Virtualbox. The latter nowadays appears to be a "heavyweight" compared to the former.
(A curious notion made by the person who, to the best of my knowledge, first suggested using VMs as universal checkpointing option for BOINC scientific apps during the Barcelona BOINC Workshop in 2009 where I presented RNA World for the first time on behalf of Rechenkraft.net - and CERN independently even presented a first implementation of that idea at the same conference). ;-)
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Michael H.W. Weber
Avatar

Send message
Joined: 11 Apr 20
Posts: 23
Credit: 442,800
RAC: 0
Message 787 - Posted: 18 Apr 2020, 11:50:27 UTC
Last modified: 18 Apr 2020, 12:04:29 UTC

All the manually aborted tasks (last row: errors) belong to the "never ending task issue":

Laptop: https://quchempedia.univ-angers.fr/athome/results.php?hostid=1969
Desktop: https://quchempedia.univ-angers.fr/athome/results.php?hostid=2004

Note that there are two different types:
One with CPU time zero (yielding a virtually empty log) and one with CPU time (plus a log).
The latter may be investigated in more detail.

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 787 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Michael H.W. Weber
Avatar

Send message
Joined: 11 Apr 20
Posts: 23
Credit: 442,800
RAC: 0
Message 1009 - Posted: 6 Aug 2020, 10:01:07 UTC

Today, again I have identified a task that ran more than a day on my Intel i5 system without consuming any noticeable CPU. And again it appears to have stalled before properly initializing the Virtualbox-based VM (which is the core of the issue)?
I just aborted this task to free up the CPU for correctly running WUs - it is in this packet and here is the error log:

<core_client_version>7.16.7</core_client_version>
<![CDATA[
<message>
aborted by user</message>
<stderr_txt>
2020-08-05 07:58:03 (5600): vboxwrapper (7.9.26200): starting
2020-08-05 07:58:03 (5600): Feature: Checkpoint interval offset (335 seconds)
2020-08-05 07:58:03 (5600): Feature: Enabling trickle-ups (Interval: 1800.000000)
2020-08-05 07:58:03 (5600): Detected: VirtualBox COM Interface (Version: 5.1.22)
2020-08-05 07:58:03 (5600): Detected: Minimum checkpoint interval (600.000000 seconds)
2020-08-05 07:58:03 (5600): Create VM. (boinc_1e975eb9682dd350, slot#2)
2020-08-05 07:58:03 (5600): Setting Memory Size for VM. (1900MB)
2020-08-05 07:58:03 (5600): Setting CPU Count for VM. (1)
2020-08-05 07:58:03 (5600): Setting Chipset Options for VM.
2020-08-05 07:58:03 (5600): Setting Boot Options for VM.
2020-08-05 07:58:03 (5600): Disabling VM Network Access.
2020-08-05 07:58:03 (5600): Setting Network Configuration for NAT.
2020-08-05 07:58:03 (5600): Disabling USB Support for VM.
2020-08-05 07:58:03 (5600): Disabling COM Port Support for VM.
2020-08-05 07:58:03 (5600): Disabling LPT Port Support for VM.
2020-08-05 07:58:03 (5600): Disabling Audio Support for VM.
2020-08-05 07:58:03 (5600): Disabling Clipboard Support for VM.
2020-08-05 07:58:03 (5600): Disabling Drag and Drop Support for VM.
2020-08-05 07:58:03 (5600): Adding storage controller(s) to VM.
2020-08-05 07:58:03 (5600): Adding virtual disk drive to VM. (vm_image.vdi)
2020-08-05 07:58:03 (5600): Adding VirtualBox Guest Additions to VM.
2020-08-05 07:58:03 (5600): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB)
2020-08-05 07:58:03 (5600): Enabling shared directory for VM.
2020-08-05 07:58:03 (5600): Starting VM. (boinc_1e975eb9682dd350, slot#2)
2020-08-05 07:58:11 (5600): Guest Log: BIOS: VirtualBox 5.1.22

2020-08-05 07:58:11 (5600): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63

2020-08-05 07:58:11 (5600): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032

2020-08-05 07:58:11 (5600): Guest Log: BIOS: Booting from Hard Disk...

2020-08-05 07:58:11 (5600): Successfully started VM. (PID = '10956')
2020-08-05 07:58:11 (5600): Reporting VM Process ID to BOINC.
2020-08-05 07:58:16 (5600): Guest Log: BIOS: KBD: unsupported int 16h function 03

2020-08-05 07:58:16 (5600): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 

2020-08-05 07:58:16 (5600): VM state change detected. (old = 'poweroff', new = 'running')
2020-08-05 07:58:26 (5600): Preference change detected
2020-08-05 07:58:26 (5600): Setting CPU throttle for VM. (100%)
2020-08-05 07:58:26 (5600): Setting checkpoint interval to 3600 seconds. (Higher value of (Preference: 3600 seconds) or (Vbox_job.xml: 600 seconds))
2020-08-05 08:28:27 (5600): Status Report: Trickle-Up Event.
2020-08-05 08:58:29 (5600): Status Report: Trickle-Up Event.
2020-08-05 09:04:04 (5600): Creating new snapshot for VM.
2020-08-05 09:04:09 (5600): Checkpoint completed.
2020-08-05 09:28:30 (5600): Status Report: Trickle-Up Event.
2020-08-05 09:38:30 (5600): Status Report: Elapsed Time: '6004.861517'
2020-08-05 09:38:30 (5600): Status Report: CPU Time: '2.698817'
2020-08-05 09:58:33 (5600): Status Report: Trickle-Up Event.
2020-08-05 10:04:13 (5600): Creating new snapshot for VM.
2020-08-05 10:04:18 (5600): Deleting stale snapshot.
2020-08-05 10:04:19 (5600): Checkpoint completed.
2020-08-05 10:28:35 (5600): Status Report: Trickle-Up Event.
2020-08-05 10:58:37 (5600): Status Report: Trickle-Up Event.
2020-08-05 11:04:22 (5600): Creating new snapshot for VM.
2020-08-05 11:04:27 (5600): Deleting stale snapshot.
2020-08-05 11:04:28 (5600): Checkpoint completed.
2020-08-05 11:18:33 (5600): Status Report: Elapsed Time: '12007.549265'
2020-08-05 11:18:33 (5600): Status Report: CPU Time: '3.697224'
2020-08-05 11:28:38 (5600): Status Report: Trickle-Up Event.
2020-08-05 11:58:42 (5600): Status Report: Trickle-Up Event.
2020-08-05 12:04:32 (5600): Creating new snapshot for VM.
2020-08-05 12:04:37 (5600): Deleting stale snapshot.
2020-08-05 12:04:38 (5600): Checkpoint completed.
2020-08-05 12:28:44 (5600): Status Report: Trickle-Up Event.
2020-08-05 12:58:36 (5600): Status Report: Elapsed Time: '18010.281513'
2020-08-05 12:58:36 (5600): Status Report: CPU Time: '4.243227'
2020-08-05 12:58:46 (5600): Status Report: Trickle-Up Event.
2020-08-05 13:04:41 (5600): Creating new snapshot for VM.
2020-08-05 13:04:47 (5600): Deleting stale snapshot.
2020-08-05 13:04:47 (5600): Checkpoint completed.
2020-08-05 13:28:48 (5600): Status Report: Trickle-Up Event.
2020-08-05 13:58:49 (5600): Status Report: Trickle-Up Event.
2020-08-05 14:04:50 (5600): Creating new snapshot for VM.
2020-08-05 14:04:55 (5600): Deleting stale snapshot.
2020-08-05 14:04:56 (5600): Checkpoint completed.
2020-08-05 14:28:51 (5600): Status Report: Trickle-Up Event.
2020-08-05 14:38:37 (5600): Status Report: Elapsed Time: '24011.322550'
2020-08-05 14:38:37 (5600): Status Report: CPU Time: '5.335234'
2020-08-05 14:58:53 (5600): Status Report: Trickle-Up Event.
2020-08-05 15:04:59 (5600): Creating new snapshot for VM.
2020-08-05 15:05:04 (5600): Deleting stale snapshot.
2020-08-05 15:05:05 (5600): Checkpoint completed.
2020-08-05 15:28:55 (5600): Status Report: Trickle-Up Event.
2020-08-05 15:58:58 (5600): Status Report: Trickle-Up Event.
2020-08-05 16:05:08 (5600): Creating new snapshot for VM.
2020-08-05 16:05:13 (5600): Deleting stale snapshot.
2020-08-05 16:05:14 (5600): Checkpoint completed.
2020-08-05 16:18:39 (5600): Status Report: Elapsed Time: '30013.723759'
2020-08-05 16:18:39 (5600): Status Report: CPU Time: '6.458441'
2020-08-05 16:29:00 (5600): Status Report: Trickle-Up Event.
2020-08-05 16:59:03 (5600): Status Report: Trickle-Up Event.
2020-08-05 17:05:18 (5600): Creating new snapshot for VM.
2020-08-05 17:05:23 (5600): Deleting stale snapshot.
2020-08-05 17:05:24 (5600): Checkpoint completed.
2020-08-05 17:29:05 (5600): Status Report: Trickle-Up Event.
2020-08-05 17:58:44 (5600): Status Report: Elapsed Time: '36018.600281'
2020-08-05 17:58:44 (5600): Status Report: CPU Time: '7.020045'
2020-08-05 17:59:09 (5600): Status Report: Trickle-Up Event.
2020-08-05 18:05:29 (5600): Creating new snapshot for VM.
2020-08-05 18:05:35 (5600): Deleting stale snapshot.
2020-08-05 18:05:35 (5600): Checkpoint completed.
2020-08-05 18:29:11 (5600): Status Report: Trickle-Up Event.
2020-08-05 18:59:13 (5600): Status Report: Trickle-Up Event.
2020-08-05 19:05:38 (5600): Creating new snapshot for VM.
2020-08-05 19:05:44 (5600): Deleting stale snapshot.
2020-08-05 19:05:44 (5600): Checkpoint completed.
2020-08-05 19:29:15 (5600): Status Report: Trickle-Up Event.
2020-08-05 19:38:45 (5600): Status Report: Elapsed Time: '42019.959356'
2020-08-05 19:38:45 (5600): Status Report: CPU Time: '8.034052'
2020-08-05 19:59:19 (5600): Status Report: Trickle-Up Event.
2020-08-05 20:05:50 (5600): Creating new snapshot for VM.
2020-08-05 20:05:55 (5600): Deleting stale snapshot.
2020-08-05 20:05:56 (5600): Checkpoint completed.
2020-08-05 20:29:22 (5600): Status Report: Trickle-Up Event.
2020-08-05 20:59:25 (5600): Status Report: Trickle-Up Event.
2020-08-05 21:06:01 (5600): Creating new snapshot for VM.
2020-08-05 21:06:06 (5600): Deleting stale snapshot.
2020-08-05 21:06:06 (5600): Checkpoint completed.
2020-08-05 21:18:47 (5600): Status Report: Elapsed Time: '48021.486954'
2020-08-05 21:18:47 (5600): Status Report: CPU Time: '9.032458'
2020-08-05 21:29:28 (5600): Status Report: Trickle-Up Event.
2020-08-05 21:59:30 (5600): Status Report: Trickle-Up Event.
2020-08-05 22:06:10 (5600): Creating new snapshot for VM.
2020-08-05 22:06:16 (5600): Deleting stale snapshot.
2020-08-05 22:06:16 (5600): Checkpoint completed.
2020-08-05 22:29:32 (5600): Status Report: Trickle-Up Event.
2020-08-05 22:58:49 (5600): Status Report: Elapsed Time: '54023.607628'
2020-08-05 22:58:49 (5600): Status Report: CPU Time: '9.547261'
2020-08-05 22:59:34 (5600): Status Report: Trickle-Up Event.
2020-08-05 23:06:20 (5600): Creating new snapshot for VM.
2020-08-05 23:06:25 (5600): Deleting stale snapshot.
2020-08-05 23:06:25 (5600): Checkpoint completed.
2020-08-05 23:29:37 (5600): Status Report: Trickle-Up Event.
2020-08-05 23:59:39 (5600): Status Report: Trickle-Up Event.
2020-08-06 00:06:29 (5600): Creating new snapshot for VM.
2020-08-06 00:06:35 (5600): Deleting stale snapshot.
2020-08-06 00:06:35 (5600): Checkpoint completed.
2020-08-06 00:29:41 (5600): Status Report: Trickle-Up Event.
2020-08-06 00:38:52 (5600): Status Report: Elapsed Time: '60026.398884'
2020-08-06 00:38:52 (5600): Status Report: CPU Time: '10.592468'
2020-08-06 00:59:44 (5600): Status Report: Trickle-Up Event.
2020-08-06 01:06:39 (5600): Creating new snapshot for VM.
2020-08-06 01:06:45 (5600): Deleting stale snapshot.
2020-08-06 01:06:45 (5600): Checkpoint completed.
2020-08-06 01:29:46 (5600): Status Report: Trickle-Up Event.
2020-08-06 01:59:49 (5600): Status Report: Trickle-Up Event.
2020-08-06 02:06:50 (5600): Creating new snapshot for VM.
2020-08-06 02:06:55 (5600): Deleting stale snapshot.
2020-08-06 02:06:55 (5600): Checkpoint completed.
2020-08-06 02:18:56 (5600): Status Report: Elapsed Time: '66030.553316'
2020-08-06 02:18:56 (5600): Status Report: CPU Time: '11.559674'
2020-08-06 02:29:52 (5600): Status Report: Trickle-Up Event.
2020-08-06 02:59:54 (5600): Status Report: Trickle-Up Event.
2020-08-06 03:06:59 (5600): Creating new snapshot for VM.
2020-08-06 03:07:05 (5600): Deleting stale snapshot.
2020-08-06 03:07:05 (5600): Checkpoint completed.
2020-08-06 03:29:56 (5600): Status Report: Trickle-Up Event.
2020-08-06 03:58:58 (5600): Status Report: Elapsed Time: '72032.345943'
2020-08-06 03:58:58 (5600): Status Report: CPU Time: '12.152478'
2020-08-06 03:59:58 (5600): Status Report: Trickle-Up Event.
2020-08-06 04:07:08 (5600): Creating new snapshot for VM.
2020-08-06 04:07:14 (5600): Deleting stale snapshot.
2020-08-06 04:07:14 (5600): Checkpoint completed.
2020-08-06 04:30:00 (5600): Status Report: Trickle-Up Event.
2020-08-06 05:00:02 (5600): Status Report: Trickle-Up Event.
2020-08-06 05:07:18 (5600): Creating new snapshot for VM.
2020-08-06 05:07:23 (5600): Deleting stale snapshot.
2020-08-06 05:07:24 (5600): Checkpoint completed.
2020-08-06 05:30:05 (5600): Status Report: Trickle-Up Event.
2020-08-06 05:39:00 (5600): Status Report: Elapsed Time: '78034.852169'
2020-08-06 05:39:00 (5600): Status Report: CPU Time: '13.197685'
2020-08-06 06:00:09 (5600): Status Report: Trickle-Up Event.
2020-08-06 06:07:29 (5600): Creating new snapshot for VM.
2020-08-06 06:07:34 (5600): Deleting stale snapshot.
2020-08-06 06:07:35 (5600): Checkpoint completed.
2020-08-06 06:30:11 (5600): Status Report: Trickle-Up Event.
2020-08-06 07:00:14 (5600): Status Report: Trickle-Up Event.
2020-08-06 07:07:39 (5600): Creating new snapshot for VM.
2020-08-06 07:07:45 (5600): Deleting stale snapshot.
2020-08-06 07:07:45 (5600): Checkpoint completed.
2020-08-06 07:19:01 (5600): Status Report: Elapsed Time: '84035.509655'
2020-08-06 07:19:01 (5600): Status Report: CPU Time: '14.211691'
2020-08-06 07:30:17 (5600): Status Report: Trickle-Up Event.
2020-08-06 08:00:19 (5600): Status Report: Trickle-Up Event.
2020-08-06 08:07:50 (5600): Creating new snapshot for VM.
2020-08-06 08:07:55 (5600): Deleting stale snapshot.
2020-08-06 08:07:56 (5600): Checkpoint completed.
2020-08-06 08:30:22 (5600): Status Report: Trickle-Up Event.
2020-08-06 08:59:06 (5600): Status Report: Elapsed Time: '90040.259159'
2020-08-06 08:59:06 (5600): Status Report: CPU Time: '14.710894'
2020-08-06 09:00:26 (5600): Status Report: Trickle-Up Event.
2020-08-06 09:08:01 (5600): Creating new snapshot for VM.
2020-08-06 09:08:07 (5600): Deleting stale snapshot.
2020-08-06 09:08:07 (5600): Checkpoint completed.
2020-08-06 09:30:29 (5600): Status Report: Trickle-Up Event.
2020-08-06 10:00:31 (5600): Status Report: Trickle-Up Event.
2020-08-06 10:08:12 (5600): Creating new snapshot for VM.
2020-08-06 10:08:17 (5600): Deleting stale snapshot.
2020-08-06 10:08:18 (5600): Checkpoint completed.
2020-08-06 10:30:34 (5600): Status Report: Trickle-Up Event.
2020-08-06 10:39:10 (5600): Status Report: Elapsed Time: '96044.118551'
2020-08-06 10:39:10 (5600): Status Report: CPU Time: '15.678100'
2020-08-06 11:00:36 (5600): Status Report: Trickle-Up Event.
2020-08-06 11:08:22 (5600): Creating new snapshot for VM.
2020-08-06 11:08:27 (5600): Deleting stale snapshot.
2020-08-06 11:08:28 (5600): Checkpoint completed.
2020-08-06 11:30:39 (5600): Status Report: Trickle-Up Event.
2020-08-06 11:33:59 (5600): Powering off VM.
2020-08-06 11:34:01 (5600): Successfully stopped VM.
2020-08-06 11:34:06 (5600): Deregistering VM. (boinc_1e975eb9682dd350, slot#2)
2020-08-06 11:34:06 (5600): Removing virtual disk drive(s) from VM.
2020-08-06 11:34:06 (5600): Removing network bandwidth throttle group from VM.
2020-08-06 11:34:06 (5600): Removing storage controller(s) from VM.
2020-08-06 11:34:06 (5600): Removing VM from VirtualBox.

    Hypervisor System Log:

30:18:37.161964          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:38.630150          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:41.380499          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:42.164599          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:43.632785          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.540655          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.582660          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.582660          ERROR [COM]: aRC=VBOX_E_OBJECT_NOT_FOUND (0x80bb0001) aIID={0169423f-46b4-cde9-91af-1e9d5b6cd945} aComponent={VirtualBoxWrap} aText={Could not find a registered machine named 'boinc_596cdf7fa7e15082'}, preserve=false aResultDetail=0
30:18:46.663170          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.663170          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.663670          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.663670          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.691174          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.692174          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:47.168234          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:48.635921          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:49.612045          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The given session is busy}, preserve=false aResultDetail=0
30:18:49.612545          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:52.171370          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:52.548418          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:52.557419 DeleteSnap ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=true  aResultDetail=0
30:18:52.565920 DeleteSnap ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=true  aResultDetail=0
30:18:52.565920 DeleteSnap ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=true  aResultDetail=0
30:18:52.565920 DeleteSnap ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=true  aResultDetail=0
30:18:52.581922          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:52.582422          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:52.596924          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:52.603425          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={Medium 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' is locked for reading by another task}, preserve=false aResultDetail=0
30:18:52.621927          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={The object is not ready}, preserve=false aResultDetail=0
30:18:52.621927          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={Medium 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' is locked for reading by another task}, preserve=false aResultDetail=0
30:18:53.639056          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:54.622181          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:57.175005          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:58.641691          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:59.627317          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:19:02.177640          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0

    VM Execution Log:


    VM Startup Log:


    VM Trace Log:

11:34:16 (5600): called boinc_finish(194)

</stderr_txt>
]]>

Please note that the same machine has completed many validated tasks (actually most of which I have returned so far).

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 1009 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 1012 - Posted: 7 Aug 2020, 11:12:30 UTC - in response to Message 1009.  

there are random crashes... This is difficult to reproduce and therefore difficult to solve :(
ID: 1012 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Uitti

Send message
Joined: 7 Jan 20
Posts: 1
Credit: 26,000
RAC: 0
Message 1013 - Posted: 10 Aug 2020, 4:17:13 UTC

So, do we (the project and the contributor) get nothing for running nine days and still does not end? (work units have showing 100% for days)

.. and there is no report to BOINC manager for being timed out? Am I supposed to watch the logs manually every day .. or every hour?
ID: 1013 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Michael H.W. Weber
Avatar

Send message
Joined: 11 Apr 20
Posts: 23
Credit: 442,800
RAC: 0
Message 1016 - Posted: 11 Aug 2020, 11:43:16 UTC - in response to Message 1012.  
Last modified: 11 Aug 2020, 11:43:39 UTC

there are random crashes... This is difficult to reproduce and therefore difficult to solve :(

Sorry, it is a machine delivering proper results most of the time, so there has to be some deterministic cause for this.

As a human observer I can immediately detect these broken WUs by simply inspecting the WUs "property" panel in the BOINC manager GUI: These WUs - and ONLY these WUs - show a non-changing checkpointing interval of 0 seconds (sometimes a few seconds). It apears that these WUs are caught in a checkpoint writing loop.

BOINC is configured by default to write checkpoints every minute. If this default setting is kept with BOINC projects using Virtualbox-based VMs, trouble is pre-programmed. Reason: The effort to write a whole VM snapshot might take the entire 1 minute interval reserved for computation (depending on the VM size). Hence, the task will end up in checkpoint writing rather than computing. Total I/O overload.
For this reason I set my checkpoint interval to 3600 seconds (every hour one checkpoint). When I observe a WU which indicates that the last checkpoint was written 0 seconds ago and this indication persists for, say more than 5 minutes, you can bet on the presence of a broken WU.

So, again: It is easy to detect these types of WUs for humans.

Bottom line:
What's required is a script which inspects the snapshot time stamp and ensure it is changing ONLY in the proper time intervals.
If not: automated WU abortion.

Practical problem: the stderr.txt file which is found in the slot folder of a given WU does NOT indicate any issues with the checkpoint writing (just check stderr.txt shown in my above posting and compare to valid tasks). It is indistinguishable from the properly running WUs. So, you need to check the system time stamp of the snapshot file hoping that this one is behaving unusual. Unfortunately, so far I have never taken the time to manually follow up the state of this file. I think it might be worth taking a look at this as soon as such a task reappears on anyone's Windows system.

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 1016 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 1034 - Posted: 24 Aug 2020, 15:20:13 UTC - in response to Message 1016.  

I have never seen this case on my machines, but thank you for raising and investigating. This is good news if I understand correctly. It might take a little while before I have enough time but it seems feasible.
ID: 1034 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 1035 - Posted: 24 Aug 2020, 15:22:49 UTC - in response to Message 1034.  

I have never seen this case on my machines, but thank you for raising and investigating. This is good news if I understand correctly. It might take a little while before I have enough time but it seems feasible.


by the way, I'm going to put the VM applications in beta again. they're too unstable and it creates misunderstandings.
ID: 1035 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Oliver

Send message
Joined: 25 Aug 20
Posts: 1
Credit: 282,200
RAC: 0
Message 1040 - Posted: 28 Aug 2020, 20:54:57 UTC - in response to Message 1035.  

I had five tasks today which I cancelled due to no activity for a long time while being at 99,9 %.

https://quchempedia.univ-angers.fr/athome/result.php?resultid=2617958
https://quchempedia.univ-angers.fr/athome/result.php?resultid=2608927
https://quchempedia.univ-angers.fr/athome/result.php?resultid=2608789
https://quchempedia.univ-angers.fr/athome/result.php?resultid=2602274
https://quchempedia.univ-angers.fr/athome/result.php?resultid=2602346
ID: 1040 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 1044 - Posted: 31 Aug 2020, 9:25:47 UTC - in response to Message 1040.  

Thank you for reporting.
ID: 1044 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
UBT - Timbo

Send message
Joined: 8 Dec 19
Posts: 13
Credit: 652,594
RAC: 0
Message 1106 - Posted: 28 Sep 2020, 15:11:45 UTC
Last modified: 28 Sep 2020, 15:14:55 UTC

Hi

I'm in the same boat - I have three hosts running Windows and using V Box and some tasks do successfully complete (and in a short period of time) and earn credits, whilst others seem to get stuck and use 0 CPU time.

These tasks have been going for days and only now have I found this posting where other members have found the same problem with "zombie" like tasks.

This of course means that each host has to be "watched over" in case working CPUs get "tied up" doing nothing, which means wasting electrical power if no work is being done by them.


I have today aborted 4 such tasks from one host:

https://quchempedia.univ-angers.fr/athome/result.php?resultid=3104633 for task #1673397

https://quchempedia.univ-angers.fr/athome/result.php?resultid=3104166 for task #1673163

https://quchempedia.univ-angers.fr/athome/result.php?resultid=3103980 for task #1673070

https://quchempedia.univ-angers.fr/athome/result.php?resultid=3104741 for task #1673451


The stderr logs shows a whole load of information (far too long to copy and paste here) but it does seem that there is an issue with some tasks and not with others, which I assume is down to how the tasks are originally generated?

regards
Tim
ID: 1106 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 1109 - Posted: 29 Sep 2020, 8:26:00 UTC - in response to Message 1106.  

Yes indeed, I'm sorry about that. I don't have the funds and human resources to improve the virtualbox application for Windows. I put it back in beta after this observation.

The best solution I know is to install a Linux VM and run Boinc and Quchempedia@home in it. That way you get full control
ID: 1109 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
UBT - Timbo

Send message
Joined: 8 Dec 19
Posts: 13
Credit: 652,594
RAC: 0
Message 1116 - Posted: 30 Sep 2020, 21:01:48 UTC - in response to Message 1109.  

Yes indeed, I'm sorry about that. I don't have the funds and human resources to improve the virtualbox application for Windows. I put it back in beta after this observation.

The best solution I know is to install a Linux VM and run Boinc and Quchempedia@home in it. That way you get full control


Thanks for the reply.

One wonders what the issue is then, if your tasks and a "standard installation" of V Box are causing intermittent issues?

I know that LHC@home, Cosmology@home and nanoHub@home also use V Box and (AFAIK) their tasks do not seem to have any issues with V Box.

Also, it seems strange that each QuChem task requires V Box to set aside 2 Gb of RAM for each task, when the actual requirement is far less?

So, perhaps you could downsize the memory requirement for these tasks...as that might then allow more to run...and if it is a memory issue (or a lack of, at the point when each task starts) that is causing some of these "zombie" tasks, then maybe this might help?

regards
Tim
ID: 1116 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 1117 - Posted: 1 Oct 2020, 9:37:26 UTC - in response to Message 1116.  

the trainee who developed the windows application is gone and it's really not my area of expertise...
ID: 1117 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
UBT - Timbo

Send message
Joined: 8 Dec 19
Posts: 13
Credit: 652,594
RAC: 0
Message 1290 - Posted: 26 Dec 2020, 20:08:04 UTC - in response to Message 1117.  
Last modified: 26 Dec 2020, 20:09:28 UTC

Hi

I now have a number of tasks that are all stuck at 100% - the error log shows they are still sending trickles, but still there's no sign of them ending :-(

Still crunching:

after: 6d 12hr.

https://quchempedia.univ-angers.fr/athome/result.php?resultid=5084462

after 4d 12hr

https://quchempedia.univ-angers.fr/athome/result.php?resultid=5077365

after 6d 22hr

https://quchempedia.univ-angers.fr/athome/result.php?resultid=4320138

after 6 d

https://quchempedia.univ-angers.fr/athome/result.php?resultid=4998674

So, do I leave them be or cancel them and do some other projects tasks (as there are clearly issues with this projects tasks that the admin cannot fix).

regards
Tim
ID: 1290 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 13 Oct 19
Posts: 87
Credit: 6,026,455
RAC: 0
Message 1291 - Posted: 26 Dec 2020, 21:35:50 UTC - in response to Message 1290.  

Check the Properties of the task in question. If the difference between the CPU time and the Elapsed time is more than a few minutes, that WU has stopped processing. It will proceed no further and should be aborted.
ID: 1291 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
UBT - Timbo

Send message
Joined: 8 Dec 19
Posts: 13
Credit: 652,594
RAC: 0
Message 1292 - Posted: 26 Dec 2020, 23:58:40 UTC - in response to Message 1291.  
Last modified: 27 Dec 2020, 0:00:27 UTC

Check the Properties of the task in question. If the difference between the CPU time and the Elapsed time is more than a few minutes, that WU has stopped processing. It will proceed no further and should be aborted.


Thanks for the tip.

This is what shows in the Properties of ONE task:

State Running
Received 19/12/2020 20:48:28
Report deadline 02/01/2021 20:48:41
Estimated computation size 20,000 GFLOPs
CPU time 00:11:12
CPU time since checkpoint 00:00:00
Elapsed time 6d 14:57:29
Estimated time remaining 00:00:00
Fraction done 100.000%


Various files in the QuChem task "Slots" directory are still incrementing and do not appear to be "stuck" on an older date/time than the current time now.

VBox claims the tasks are still "Running" - but I have my doubts.

regards
Tim
ID: 1292 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Never ending tasks

©2024 Benoit DA MOTA - LERIA, University of Angers, France