Message boards :
Number crunching :
Never ending tasks
Message board moderation
Author | Message |
---|---|
Send message Joined: 11 Apr 20 Posts: 23 Credit: 442,800 RAC: 0 |
...like this one which I had to abort after approx. 5.5 hrs when noticing that there was no CPU usage and no checkpoint: https://quchempedia.univ-angers.fr/athome/result.php?resultid=2237771 Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
The vbox wrapper didn't turn off properly. I couldn't compile a better wrapper, so I'm using the official from Berkeley. |
Send message Joined: 11 Apr 20 Posts: 23 Credit: 442,800 RAC: 0 |
The vbox wrapper didn't turn off properly. I couldn't compile a better wrapper, so I'm using the official from Berkeley. Well, it appears not to solve the issue, right? It might help if you provide the full procedure of creating your wrapper: To my experince, there are usually people around who might have more expertise with this than us. ;-) Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
Send message Joined: 11 Apr 20 Posts: 23 Credit: 442,800 RAC: 0 |
I have taken a quick look into these problematic tasks which do not finish. My conclusion: It seems not to be a problem of not finishing, it appears that these Virtualbox-based VMs are not properly initiating. Please check the logs and the Virtualbox manual in detail: There are numerous error messages given. Here a few examples from_ https://quchempedia.univ-angers.fr/athome/result.php?resultid=2268100 ... Hypervisor System Log: 04:57:37.111095 Saving settings file "C:\ProgramData\BOINC\slots\1\boinc_b6cb3843252a5eb9\boinc_b6cb3843252a5eb9.vbox" with version "1.16-windows" 04:57:39.759996 ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={85cd948e-a71f-4289-281e-0ca7ad48cd89} aComponent={MachineWrap} aText={The given session is busy}, preserve=false aResultDetail=0 ... 05:02:56.055177 ERROR [COM]: aRC=E_FAIL (0x80004005) aIID={85cd948e-a71f-4289-281e-0ca7ad48cd89} aComponent={SessionMachine} aText={This machine does not have any snapshots}, preserve=false aResultDetail=0 ... 05:02:56.070793 ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={Medium 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' is locked for reading by another task}, preserve=false aResultDetail=0 05:02:56.071771 Saving settings file "C:\Users\weber\.VirtualBox\VirtualBox.xml" with version "1.12-windows" 05:02:56.076649 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={The object is not ready}, preserve=false aResultDetail=0 05:02:56.076649 ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={Medium 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' is locked for reading by another task}, preserve=false aResultDetail=0 05:03:08.894448 ERROR [COM]: aRC=VBOX_E_OBJECT_NOT_FOUND (0x80bb0001) aIID={9570b9d5-f1a1-448a-10c5-e12f5285adad} aComponent={VirtualBoxWrap} aText={Could not find a registered machine named 'boinc_68a02bed83281783'}, preserve=false aResultDetail=0 0 ... 05:03:11.538433 ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={85cd948e-a71f-4289-281e-0ca7ad48cd89} aComponent={MachineWrap} aText={The given session is busy}, preserve=false aResultDetail=0 ... When I checked the running tasks by selecting them in the BOINC manager, right-clicking and looking for properties it says that CPU time as well as CPU time after last checkpoint are both zero. So, computation never starts? That would be in line with the observation that these tasks do not appear to consume CPU load. For correctly running VM tasks CPU time is almost identical to CPU time after last checkpoint, which is also kind of strange given the fact that I set all my systems to write checkpoints to disk only once every hour, because it is known that if the checkpoint interval is set too short, Virtualbox causes I/O issues (depending on the VM size of course). Michael. P.S.: You might also consider using Docker instead of Virtualbox. The latter nowadays appears to be a "heavyweight" compared to the former. (A curious notion made by the person who, to the best of my knowledge, first suggested using VMs as universal checkpointing option for BOINC scientific apps during the Barcelona BOINC Workshop in 2009 where I presented RNA World for the first time on behalf of Rechenkraft.net - and CERN independently even presented a first implementation of that idea at the same conference). ;-) President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
Send message Joined: 11 Apr 20 Posts: 23 Credit: 442,800 RAC: 0 |
All the manually aborted tasks (last row: errors) belong to the "never ending task issue": Laptop: https://quchempedia.univ-angers.fr/athome/results.php?hostid=1969 Desktop: https://quchempedia.univ-angers.fr/athome/results.php?hostid=2004 Note that there are two different types: One with CPU time zero (yielding a virtually empty log) and one with CPU time (plus a log). The latter may be investigated in more detail. Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
Send message Joined: 11 Apr 20 Posts: 23 Credit: 442,800 RAC: 0 |
Today, again I have identified a task that ran more than a day on my Intel i5 system without consuming any noticeable CPU. And again it appears to have stalled before properly initializing the Virtualbox-based VM (which is the core of the issue)? I just aborted this task to free up the CPU for correctly running WUs - it is in this packet and here is the error log: <core_client_version>7.16.7</core_client_version> <![CDATA[ <message> aborted by user</message> <stderr_txt> 2020-08-05 07:58:03 (5600): vboxwrapper (7.9.26200): starting 2020-08-05 07:58:03 (5600): Feature: Checkpoint interval offset (335 seconds) 2020-08-05 07:58:03 (5600): Feature: Enabling trickle-ups (Interval: 1800.000000) 2020-08-05 07:58:03 (5600): Detected: VirtualBox COM Interface (Version: 5.1.22) 2020-08-05 07:58:03 (5600): Detected: Minimum checkpoint interval (600.000000 seconds) 2020-08-05 07:58:03 (5600): Create VM. (boinc_1e975eb9682dd350, slot#2) 2020-08-05 07:58:03 (5600): Setting Memory Size for VM. (1900MB) 2020-08-05 07:58:03 (5600): Setting CPU Count for VM. (1) 2020-08-05 07:58:03 (5600): Setting Chipset Options for VM. 2020-08-05 07:58:03 (5600): Setting Boot Options for VM. 2020-08-05 07:58:03 (5600): Disabling VM Network Access. 2020-08-05 07:58:03 (5600): Setting Network Configuration for NAT. 2020-08-05 07:58:03 (5600): Disabling USB Support for VM. 2020-08-05 07:58:03 (5600): Disabling COM Port Support for VM. 2020-08-05 07:58:03 (5600): Disabling LPT Port Support for VM. 2020-08-05 07:58:03 (5600): Disabling Audio Support for VM. 2020-08-05 07:58:03 (5600): Disabling Clipboard Support for VM. 2020-08-05 07:58:03 (5600): Disabling Drag and Drop Support for VM. 2020-08-05 07:58:03 (5600): Adding storage controller(s) to VM. 2020-08-05 07:58:03 (5600): Adding virtual disk drive to VM. (vm_image.vdi) 2020-08-05 07:58:03 (5600): Adding VirtualBox Guest Additions to VM. 2020-08-05 07:58:03 (5600): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB) 2020-08-05 07:58:03 (5600): Enabling shared directory for VM. 2020-08-05 07:58:03 (5600): Starting VM. (boinc_1e975eb9682dd350, slot#2) 2020-08-05 07:58:11 (5600): Guest Log: BIOS: VirtualBox 5.1.22 2020-08-05 07:58:11 (5600): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63 2020-08-05 07:58:11 (5600): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032 2020-08-05 07:58:11 (5600): Guest Log: BIOS: Booting from Hard Disk... 2020-08-05 07:58:11 (5600): Successfully started VM. (PID = '10956') 2020-08-05 07:58:11 (5600): Reporting VM Process ID to BOINC. 2020-08-05 07:58:16 (5600): Guest Log: BIOS: KBD: unsupported int 16h function 03 2020-08-05 07:58:16 (5600): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 2020-08-05 07:58:16 (5600): VM state change detected. (old = 'poweroff', new = 'running') 2020-08-05 07:58:26 (5600): Preference change detected 2020-08-05 07:58:26 (5600): Setting CPU throttle for VM. (100%) 2020-08-05 07:58:26 (5600): Setting checkpoint interval to 3600 seconds. (Higher value of (Preference: 3600 seconds) or (Vbox_job.xml: 600 seconds)) 2020-08-05 08:28:27 (5600): Status Report: Trickle-Up Event. 2020-08-05 08:58:29 (5600): Status Report: Trickle-Up Event. 2020-08-05 09:04:04 (5600): Creating new snapshot for VM. 2020-08-05 09:04:09 (5600): Checkpoint completed. 2020-08-05 09:28:30 (5600): Status Report: Trickle-Up Event. 2020-08-05 09:38:30 (5600): Status Report: Elapsed Time: '6004.861517' 2020-08-05 09:38:30 (5600): Status Report: CPU Time: '2.698817' 2020-08-05 09:58:33 (5600): Status Report: Trickle-Up Event. 2020-08-05 10:04:13 (5600): Creating new snapshot for VM. 2020-08-05 10:04:18 (5600): Deleting stale snapshot. 2020-08-05 10:04:19 (5600): Checkpoint completed. 2020-08-05 10:28:35 (5600): Status Report: Trickle-Up Event. 2020-08-05 10:58:37 (5600): Status Report: Trickle-Up Event. 2020-08-05 11:04:22 (5600): Creating new snapshot for VM. 2020-08-05 11:04:27 (5600): Deleting stale snapshot. 2020-08-05 11:04:28 (5600): Checkpoint completed. 2020-08-05 11:18:33 (5600): Status Report: Elapsed Time: '12007.549265' 2020-08-05 11:18:33 (5600): Status Report: CPU Time: '3.697224' 2020-08-05 11:28:38 (5600): Status Report: Trickle-Up Event. 2020-08-05 11:58:42 (5600): Status Report: Trickle-Up Event. 2020-08-05 12:04:32 (5600): Creating new snapshot for VM. 2020-08-05 12:04:37 (5600): Deleting stale snapshot. 2020-08-05 12:04:38 (5600): Checkpoint completed. 2020-08-05 12:28:44 (5600): Status Report: Trickle-Up Event. 2020-08-05 12:58:36 (5600): Status Report: Elapsed Time: '18010.281513' 2020-08-05 12:58:36 (5600): Status Report: CPU Time: '4.243227' 2020-08-05 12:58:46 (5600): Status Report: Trickle-Up Event. 2020-08-05 13:04:41 (5600): Creating new snapshot for VM. 2020-08-05 13:04:47 (5600): Deleting stale snapshot. 2020-08-05 13:04:47 (5600): Checkpoint completed. 2020-08-05 13:28:48 (5600): Status Report: Trickle-Up Event. 2020-08-05 13:58:49 (5600): Status Report: Trickle-Up Event. 2020-08-05 14:04:50 (5600): Creating new snapshot for VM. 2020-08-05 14:04:55 (5600): Deleting stale snapshot. 2020-08-05 14:04:56 (5600): Checkpoint completed. 2020-08-05 14:28:51 (5600): Status Report: Trickle-Up Event. 2020-08-05 14:38:37 (5600): Status Report: Elapsed Time: '24011.322550' 2020-08-05 14:38:37 (5600): Status Report: CPU Time: '5.335234' 2020-08-05 14:58:53 (5600): Status Report: Trickle-Up Event. 2020-08-05 15:04:59 (5600): Creating new snapshot for VM. 2020-08-05 15:05:04 (5600): Deleting stale snapshot. 2020-08-05 15:05:05 (5600): Checkpoint completed. 2020-08-05 15:28:55 (5600): Status Report: Trickle-Up Event. 2020-08-05 15:58:58 (5600): Status Report: Trickle-Up Event. 2020-08-05 16:05:08 (5600): Creating new snapshot for VM. 2020-08-05 16:05:13 (5600): Deleting stale snapshot. 2020-08-05 16:05:14 (5600): Checkpoint completed. 2020-08-05 16:18:39 (5600): Status Report: Elapsed Time: '30013.723759' 2020-08-05 16:18:39 (5600): Status Report: CPU Time: '6.458441' 2020-08-05 16:29:00 (5600): Status Report: Trickle-Up Event. 2020-08-05 16:59:03 (5600): Status Report: Trickle-Up Event. 2020-08-05 17:05:18 (5600): Creating new snapshot for VM. 2020-08-05 17:05:23 (5600): Deleting stale snapshot. 2020-08-05 17:05:24 (5600): Checkpoint completed. 2020-08-05 17:29:05 (5600): Status Report: Trickle-Up Event. 2020-08-05 17:58:44 (5600): Status Report: Elapsed Time: '36018.600281' 2020-08-05 17:58:44 (5600): Status Report: CPU Time: '7.020045' 2020-08-05 17:59:09 (5600): Status Report: Trickle-Up Event. 2020-08-05 18:05:29 (5600): Creating new snapshot for VM. 2020-08-05 18:05:35 (5600): Deleting stale snapshot. 2020-08-05 18:05:35 (5600): Checkpoint completed. 2020-08-05 18:29:11 (5600): Status Report: Trickle-Up Event. 2020-08-05 18:59:13 (5600): Status Report: Trickle-Up Event. 2020-08-05 19:05:38 (5600): Creating new snapshot for VM. 2020-08-05 19:05:44 (5600): Deleting stale snapshot. 2020-08-05 19:05:44 (5600): Checkpoint completed. 2020-08-05 19:29:15 (5600): Status Report: Trickle-Up Event. 2020-08-05 19:38:45 (5600): Status Report: Elapsed Time: '42019.959356' 2020-08-05 19:38:45 (5600): Status Report: CPU Time: '8.034052' 2020-08-05 19:59:19 (5600): Status Report: Trickle-Up Event. 2020-08-05 20:05:50 (5600): Creating new snapshot for VM. 2020-08-05 20:05:55 (5600): Deleting stale snapshot. 2020-08-05 20:05:56 (5600): Checkpoint completed. 2020-08-05 20:29:22 (5600): Status Report: Trickle-Up Event. 2020-08-05 20:59:25 (5600): Status Report: Trickle-Up Event. 2020-08-05 21:06:01 (5600): Creating new snapshot for VM. 2020-08-05 21:06:06 (5600): Deleting stale snapshot. 2020-08-05 21:06:06 (5600): Checkpoint completed. 2020-08-05 21:18:47 (5600): Status Report: Elapsed Time: '48021.486954' 2020-08-05 21:18:47 (5600): Status Report: CPU Time: '9.032458' 2020-08-05 21:29:28 (5600): Status Report: Trickle-Up Event. 2020-08-05 21:59:30 (5600): Status Report: Trickle-Up Event. 2020-08-05 22:06:10 (5600): Creating new snapshot for VM. 2020-08-05 22:06:16 (5600): Deleting stale snapshot. 2020-08-05 22:06:16 (5600): Checkpoint completed. 2020-08-05 22:29:32 (5600): Status Report: Trickle-Up Event. 2020-08-05 22:58:49 (5600): Status Report: Elapsed Time: '54023.607628' 2020-08-05 22:58:49 (5600): Status Report: CPU Time: '9.547261' 2020-08-05 22:59:34 (5600): Status Report: Trickle-Up Event. 2020-08-05 23:06:20 (5600): Creating new snapshot for VM. 2020-08-05 23:06:25 (5600): Deleting stale snapshot. 2020-08-05 23:06:25 (5600): Checkpoint completed. 2020-08-05 23:29:37 (5600): Status Report: Trickle-Up Event. 2020-08-05 23:59:39 (5600): Status Report: Trickle-Up Event. 2020-08-06 00:06:29 (5600): Creating new snapshot for VM. 2020-08-06 00:06:35 (5600): Deleting stale snapshot. 2020-08-06 00:06:35 (5600): Checkpoint completed. 2020-08-06 00:29:41 (5600): Status Report: Trickle-Up Event. 2020-08-06 00:38:52 (5600): Status Report: Elapsed Time: '60026.398884' 2020-08-06 00:38:52 (5600): Status Report: CPU Time: '10.592468' 2020-08-06 00:59:44 (5600): Status Report: Trickle-Up Event. 2020-08-06 01:06:39 (5600): Creating new snapshot for VM. 2020-08-06 01:06:45 (5600): Deleting stale snapshot. 2020-08-06 01:06:45 (5600): Checkpoint completed. 2020-08-06 01:29:46 (5600): Status Report: Trickle-Up Event. 2020-08-06 01:59:49 (5600): Status Report: Trickle-Up Event. 2020-08-06 02:06:50 (5600): Creating new snapshot for VM. 2020-08-06 02:06:55 (5600): Deleting stale snapshot. 2020-08-06 02:06:55 (5600): Checkpoint completed. 2020-08-06 02:18:56 (5600): Status Report: Elapsed Time: '66030.553316' 2020-08-06 02:18:56 (5600): Status Report: CPU Time: '11.559674' 2020-08-06 02:29:52 (5600): Status Report: Trickle-Up Event. 2020-08-06 02:59:54 (5600): Status Report: Trickle-Up Event. 2020-08-06 03:06:59 (5600): Creating new snapshot for VM. 2020-08-06 03:07:05 (5600): Deleting stale snapshot. 2020-08-06 03:07:05 (5600): Checkpoint completed. 2020-08-06 03:29:56 (5600): Status Report: Trickle-Up Event. 2020-08-06 03:58:58 (5600): Status Report: Elapsed Time: '72032.345943' 2020-08-06 03:58:58 (5600): Status Report: CPU Time: '12.152478' 2020-08-06 03:59:58 (5600): Status Report: Trickle-Up Event. 2020-08-06 04:07:08 (5600): Creating new snapshot for VM. 2020-08-06 04:07:14 (5600): Deleting stale snapshot. 2020-08-06 04:07:14 (5600): Checkpoint completed. 2020-08-06 04:30:00 (5600): Status Report: Trickle-Up Event. 2020-08-06 05:00:02 (5600): Status Report: Trickle-Up Event. 2020-08-06 05:07:18 (5600): Creating new snapshot for VM. 2020-08-06 05:07:23 (5600): Deleting stale snapshot. 2020-08-06 05:07:24 (5600): Checkpoint completed. 2020-08-06 05:30:05 (5600): Status Report: Trickle-Up Event. 2020-08-06 05:39:00 (5600): Status Report: Elapsed Time: '78034.852169' 2020-08-06 05:39:00 (5600): Status Report: CPU Time: '13.197685' 2020-08-06 06:00:09 (5600): Status Report: Trickle-Up Event. 2020-08-06 06:07:29 (5600): Creating new snapshot for VM. 2020-08-06 06:07:34 (5600): Deleting stale snapshot. 2020-08-06 06:07:35 (5600): Checkpoint completed. 2020-08-06 06:30:11 (5600): Status Report: Trickle-Up Event. 2020-08-06 07:00:14 (5600): Status Report: Trickle-Up Event. 2020-08-06 07:07:39 (5600): Creating new snapshot for VM. 2020-08-06 07:07:45 (5600): Deleting stale snapshot. 2020-08-06 07:07:45 (5600): Checkpoint completed. 2020-08-06 07:19:01 (5600): Status Report: Elapsed Time: '84035.509655' 2020-08-06 07:19:01 (5600): Status Report: CPU Time: '14.211691' 2020-08-06 07:30:17 (5600): Status Report: Trickle-Up Event. 2020-08-06 08:00:19 (5600): Status Report: Trickle-Up Event. 2020-08-06 08:07:50 (5600): Creating new snapshot for VM. 2020-08-06 08:07:55 (5600): Deleting stale snapshot. 2020-08-06 08:07:56 (5600): Checkpoint completed. 2020-08-06 08:30:22 (5600): Status Report: Trickle-Up Event. 2020-08-06 08:59:06 (5600): Status Report: Elapsed Time: '90040.259159' 2020-08-06 08:59:06 (5600): Status Report: CPU Time: '14.710894' 2020-08-06 09:00:26 (5600): Status Report: Trickle-Up Event. 2020-08-06 09:08:01 (5600): Creating new snapshot for VM. 2020-08-06 09:08:07 (5600): Deleting stale snapshot. 2020-08-06 09:08:07 (5600): Checkpoint completed. 2020-08-06 09:30:29 (5600): Status Report: Trickle-Up Event. 2020-08-06 10:00:31 (5600): Status Report: Trickle-Up Event. 2020-08-06 10:08:12 (5600): Creating new snapshot for VM. 2020-08-06 10:08:17 (5600): Deleting stale snapshot. 2020-08-06 10:08:18 (5600): Checkpoint completed. 2020-08-06 10:30:34 (5600): Status Report: Trickle-Up Event. 2020-08-06 10:39:10 (5600): Status Report: Elapsed Time: '96044.118551' 2020-08-06 10:39:10 (5600): Status Report: CPU Time: '15.678100' 2020-08-06 11:00:36 (5600): Status Report: Trickle-Up Event. 2020-08-06 11:08:22 (5600): Creating new snapshot for VM. 2020-08-06 11:08:27 (5600): Deleting stale snapshot. 2020-08-06 11:08:28 (5600): Checkpoint completed. 2020-08-06 11:30:39 (5600): Status Report: Trickle-Up Event. 2020-08-06 11:33:59 (5600): Powering off VM. 2020-08-06 11:34:01 (5600): Successfully stopped VM. 2020-08-06 11:34:06 (5600): Deregistering VM. (boinc_1e975eb9682dd350, slot#2) 2020-08-06 11:34:06 (5600): Removing virtual disk drive(s) from VM. 2020-08-06 11:34:06 (5600): Removing network bandwidth throttle group from VM. 2020-08-06 11:34:06 (5600): Removing storage controller(s) from VM. 2020-08-06 11:34:06 (5600): Removing VM from VirtualBox. Hypervisor System Log: 30:18:37.161964 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:38.630150 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:41.380499 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:42.164599 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:43.632785 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:46.540655 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:46.582660 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:46.582660 ERROR [COM]: aRC=VBOX_E_OBJECT_NOT_FOUND (0x80bb0001) aIID={0169423f-46b4-cde9-91af-1e9d5b6cd945} aComponent={VirtualBoxWrap} aText={Could not find a registered machine named 'boinc_596cdf7fa7e15082'}, preserve=false aResultDetail=0 30:18:46.663170 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:46.663170 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:46.663670 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:46.663670 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:46.691174 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:46.692174 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:47.168234 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:48.635921 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:49.612045 ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The given session is busy}, preserve=false aResultDetail=0 30:18:49.612545 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:52.171370 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:52.548418 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:52.557419 DeleteSnap ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=true aResultDetail=0 30:18:52.565920 DeleteSnap ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=true aResultDetail=0 30:18:52.565920 DeleteSnap ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=true aResultDetail=0 30:18:52.565920 DeleteSnap ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=true aResultDetail=0 30:18:52.581922 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:52.582422 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:52.596924 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:52.603425 ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={Medium 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' is locked for reading by another task}, preserve=false aResultDetail=0 30:18:52.621927 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={The object is not ready}, preserve=false aResultDetail=0 30:18:52.621927 ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={Medium 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' is locked for reading by another task}, preserve=false aResultDetail=0 30:18:53.639056 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:54.622181 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:57.175005 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:58.641691 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:18:59.627317 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 30:19:02.177640 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0 VM Execution Log: VM Startup Log: VM Trace Log: 11:34:16 (5600): called boinc_finish(194) </stderr_txt> ]]> Please note that the same machine has completed many validated tasks (actually most of which I have returned so far). Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
there are random crashes... This is difficult to reproduce and therefore difficult to solve :( |
Send message Joined: 7 Jan 20 Posts: 1 Credit: 26,000 RAC: 0 |
So, do we (the project and the contributor) get nothing for running nine days and still does not end? (work units have showing 100% for days) .. and there is no report to BOINC manager for being timed out? Am I supposed to watch the logs manually every day .. or every hour? |
Send message Joined: 11 Apr 20 Posts: 23 Credit: 442,800 RAC: 0 |
there are random crashes... This is difficult to reproduce and therefore difficult to solve :( Sorry, it is a machine delivering proper results most of the time, so there has to be some deterministic cause for this. As a human observer I can immediately detect these broken WUs by simply inspecting the WUs "property" panel in the BOINC manager GUI: These WUs - and ONLY these WUs - show a non-changing checkpointing interval of 0 seconds (sometimes a few seconds). It apears that these WUs are caught in a checkpoint writing loop. BOINC is configured by default to write checkpoints every minute. If this default setting is kept with BOINC projects using Virtualbox-based VMs, trouble is pre-programmed. Reason: The effort to write a whole VM snapshot might take the entire 1 minute interval reserved for computation (depending on the VM size). Hence, the task will end up in checkpoint writing rather than computing. Total I/O overload. For this reason I set my checkpoint interval to 3600 seconds (every hour one checkpoint). When I observe a WU which indicates that the last checkpoint was written 0 seconds ago and this indication persists for, say more than 5 minutes, you can bet on the presence of a broken WU. So, again: It is easy to detect these types of WUs for humans. Bottom line: What's required is a script which inspects the snapshot time stamp and ensure it is changing ONLY in the proper time intervals. If not: automated WU abortion. Practical problem: the stderr.txt file which is found in the slot folder of a given WU does NOT indicate any issues with the checkpoint writing (just check stderr.txt shown in my above posting and compare to valid tasks). It is indistinguishable from the properly running WUs. So, you need to check the system time stamp of the snapshot file hoping that this one is behaving unusual. Unfortunately, so far I have never taken the time to manually follow up the state of this file. I think it might be worth taking a look at this as soon as such a task reappears on anyone's Windows system. Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
I have never seen this case on my machines, but thank you for raising and investigating. This is good news if I understand correctly. It might take a little while before I have enough time but it seems feasible. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
I have never seen this case on my machines, but thank you for raising and investigating. This is good news if I understand correctly. It might take a little while before I have enough time but it seems feasible. by the way, I'm going to put the VM applications in beta again. they're too unstable and it creates misunderstandings. |
Send message Joined: 25 Aug 20 Posts: 1 Credit: 282,200 RAC: 0 |
I had five tasks today which I cancelled due to no activity for a long time while being at 99,9 %. https://quchempedia.univ-angers.fr/athome/result.php?resultid=2617958 https://quchempedia.univ-angers.fr/athome/result.php?resultid=2608927 https://quchempedia.univ-angers.fr/athome/result.php?resultid=2608789 https://quchempedia.univ-angers.fr/athome/result.php?resultid=2602274 https://quchempedia.univ-angers.fr/athome/result.php?resultid=2602346 |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Thank you for reporting. |
Send message Joined: 8 Dec 19 Posts: 13 Credit: 652,594 RAC: 0 |
Hi I'm in the same boat - I have three hosts running Windows and using V Box and some tasks do successfully complete (and in a short period of time) and earn credits, whilst others seem to get stuck and use 0 CPU time. These tasks have been going for days and only now have I found this posting where other members have found the same problem with "zombie" like tasks. This of course means that each host has to be "watched over" in case working CPUs get "tied up" doing nothing, which means wasting electrical power if no work is being done by them. I have today aborted 4 such tasks from one host: https://quchempedia.univ-angers.fr/athome/result.php?resultid=3104633 for task #1673397 https://quchempedia.univ-angers.fr/athome/result.php?resultid=3104166 for task #1673163 https://quchempedia.univ-angers.fr/athome/result.php?resultid=3103980 for task #1673070 https://quchempedia.univ-angers.fr/athome/result.php?resultid=3104741 for task #1673451 The stderr logs shows a whole load of information (far too long to copy and paste here) but it does seem that there is an issue with some tasks and not with others, which I assume is down to how the tasks are originally generated? regards Tim |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Yes indeed, I'm sorry about that. I don't have the funds and human resources to improve the virtualbox application for Windows. I put it back in beta after this observation. The best solution I know is to install a Linux VM and run Boinc and Quchempedia@home in it. That way you get full control |
Send message Joined: 8 Dec 19 Posts: 13 Credit: 652,594 RAC: 0 |
Yes indeed, I'm sorry about that. I don't have the funds and human resources to improve the virtualbox application for Windows. I put it back in beta after this observation. Thanks for the reply. One wonders what the issue is then, if your tasks and a "standard installation" of V Box are causing intermittent issues? I know that LHC@home, Cosmology@home and nanoHub@home also use V Box and (AFAIK) their tasks do not seem to have any issues with V Box. Also, it seems strange that each QuChem task requires V Box to set aside 2 Gb of RAM for each task, when the actual requirement is far less? So, perhaps you could downsize the memory requirement for these tasks...as that might then allow more to run...and if it is a memory issue (or a lack of, at the point when each task starts) that is causing some of these "zombie" tasks, then maybe this might help? regards Tim |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
the trainee who developed the windows application is gone and it's really not my area of expertise... |
Send message Joined: 8 Dec 19 Posts: 13 Credit: 652,594 RAC: 0 |
Hi I now have a number of tasks that are all stuck at 100% - the error log shows they are still sending trickles, but still there's no sign of them ending :-( Still crunching: after: 6d 12hr. https://quchempedia.univ-angers.fr/athome/result.php?resultid=5084462 after 4d 12hr https://quchempedia.univ-angers.fr/athome/result.php?resultid=5077365 after 6d 22hr https://quchempedia.univ-angers.fr/athome/result.php?resultid=4320138 after 6 d https://quchempedia.univ-angers.fr/athome/result.php?resultid=4998674 So, do I leave them be or cancel them and do some other projects tasks (as there are clearly issues with this projects tasks that the admin cannot fix). regards Tim |
Send message Joined: 13 Oct 19 Posts: 87 Credit: 6,026,455 RAC: 0 |
Check the Properties of the task in question. If the difference between the CPU time and the Elapsed time is more than a few minutes, that WU has stopped processing. It will proceed no further and should be aborted. |
Send message Joined: 8 Dec 19 Posts: 13 Credit: 652,594 RAC: 0 |
Check the Properties of the task in question. If the difference between the CPU time and the Elapsed time is more than a few minutes, that WU has stopped processing. It will proceed no further and should be aborted. Thanks for the tip. This is what shows in the Properties of ONE task: State Running Various files in the QuChem task "Slots" directory are still incrementing and do not appear to be "stuck" on an older date/time than the current time now. VBox claims the tasks are still "Running" - but I have my doubts. regards Tim |
©2024 Benoit DA MOTA - LERIA, University of Angers, France