Posts by Michael H.W. Weber

1) Message boards : News : Project update (Message 1786)
Posted 29 Aug 2022 by Michael H.W. Weber
Post:
Thank you very much for the project update. Of course, we all know & experience the issues with reaearch funding these days.
I hope this does not mean that the project will close down?
If you indeed have found a way to greatly reduce the non-stable tasks that would be a VERY good reason to continue the good work, I believe...

Michael.
2) Message boards : Number crunching : Validation Inconclusive (Message 1030)
Posted 23 Aug 2020 by Michael H.W. Weber
Post:
Well, for some computations validation Linux vs. Windows do not work.
If this was the case here, too, the project lead should quickly change their server system such that WUs are not cross-OS validated.
A simple test comparison of a set of identical tasks calculated on Linux and Windows machines should do to clarify what is going on.

Michael.

[edit]: In my case, however, the LONG tasks are mainly delivered to Linux systems (only a beta-tester app for Windows has been released for these LONG tasks). Still, ALL tasks are invalidated, so it is not a cross-validation issue here.
3) Message boards : Number crunching : Validation Inconclusive (Message 1026)
Posted 22 Aug 2020 by Michael H.W. Weber
Post:
Again, ALL tasks of type long released after August 11th appear faulty.
For each invalidated task, you will finde wingmen crate-wise to confirm these tasks are buggy. Why re-circulate this often then?

Can anyone show me a properly validated one meeting the specs I described above (Linux, long, released after August 11th)?

Michael.
4) Message boards : Number crunching : Validation Inconclusive (Message 1024)
Posted 22 Aug 2020 by Michael H.W. Weber
Post:
A typical log:

Stderr Ausgabe

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<stderr_txt>
04:57:24 (6476): wrapper (7.5.26014): starting
04:57:24 (6476): wrapper: running worker.sh ()
Jobs starts with 1 cores
STEP OPT : Starting
Create output archive
OPT.out
Normal termination.
13:30:20 (6476): worker.sh exited; CPU time 1.953940
13:30:20 (6476): called boinc_finish(0)

</stderr_txt>
]]>

I only process the more demanding NWChem long tasks with Linux since around Agust 11th. Before, I had all tasks in the works. The last correctly validated "long" tasks were returned on August 9th.

Maybe it is just an issue with these long tasks then?
The project lead has to specifically take a look at these work packets injected into the system around August 11th, I think.

Michael.
5) Message boards : Number crunching : Validation Inconclusive (Message 1022)
Posted 21 Aug 2020 by Michael H.W. Weber
Post:
On all my machines, no single task has been validated starting August 11th.
It is a total of 37 tasks, many of these are running quite long (sometimes a few days).
I suspect something is generally wrong with work packets handed out since that date for Linux OS (I have actually stopped supporting this project using Windows OS due to the Virtualbox approach which is way too ressource hungry while the similar Linux tasks run smoothly - until August 11th.).

I checked all work packets and I found that NONE of the many wingmen working on the companion tasks have returned a single valid task, too.
That is the reason why I believe you need to check your work packets.
I have now suspended retrieving work packets until this issue has been resolved.
The machines returned most of this project's tasks properly before August 11th and currently work flawlessly for other DC projects in parallel. So it is no issue at my end.

Michael.
6) Message boards : Number crunching : Never ending tasks (Message 1016)
Posted 11 Aug 2020 by Michael H.W. Weber
Post:
there are random crashes... This is difficult to reproduce and therefore difficult to solve :(

Sorry, it is a machine delivering proper results most of the time, so there has to be some deterministic cause for this.

As a human observer I can immediately detect these broken WUs by simply inspecting the WUs "property" panel in the BOINC manager GUI: These WUs - and ONLY these WUs - show a non-changing checkpointing interval of 0 seconds (sometimes a few seconds). It apears that these WUs are caught in a checkpoint writing loop.

BOINC is configured by default to write checkpoints every minute. If this default setting is kept with BOINC projects using Virtualbox-based VMs, trouble is pre-programmed. Reason: The effort to write a whole VM snapshot might take the entire 1 minute interval reserved for computation (depending on the VM size). Hence, the task will end up in checkpoint writing rather than computing. Total I/O overload.
For this reason I set my checkpoint interval to 3600 seconds (every hour one checkpoint). When I observe a WU which indicates that the last checkpoint was written 0 seconds ago and this indication persists for, say more than 5 minutes, you can bet on the presence of a broken WU.

So, again: It is easy to detect these types of WUs for humans.

Bottom line:
What's required is a script which inspects the snapshot time stamp and ensure it is changing ONLY in the proper time intervals.
If not: automated WU abortion.

Practical problem: the stderr.txt file which is found in the slot folder of a given WU does NOT indicate any issues with the checkpoint writing (just check stderr.txt shown in my above posting and compare to valid tasks). It is indistinguishable from the properly running WUs. So, you need to check the system time stamp of the snapshot file hoping that this one is behaving unusual. Unfortunately, so far I have never taken the time to manually follow up the state of this file. I think it might be worth taking a look at this as soon as such a task reappears on anyone's Windows system.

Michael.
7) Message boards : Number crunching : Never ending tasks (Message 1009)
Posted 6 Aug 2020 by Michael H.W. Weber
Post:
Today, again I have identified a task that ran more than a day on my Intel i5 system without consuming any noticeable CPU. And again it appears to have stalled before properly initializing the Virtualbox-based VM (which is the core of the issue)?
I just aborted this task to free up the CPU for correctly running WUs - it is in this packet and here is the error log:

<core_client_version>7.16.7</core_client_version>
<![CDATA[
<message>
aborted by user</message>
<stderr_txt>
2020-08-05 07:58:03 (5600): vboxwrapper (7.9.26200): starting
2020-08-05 07:58:03 (5600): Feature: Checkpoint interval offset (335 seconds)
2020-08-05 07:58:03 (5600): Feature: Enabling trickle-ups (Interval: 1800.000000)
2020-08-05 07:58:03 (5600): Detected: VirtualBox COM Interface (Version: 5.1.22)
2020-08-05 07:58:03 (5600): Detected: Minimum checkpoint interval (600.000000 seconds)
2020-08-05 07:58:03 (5600): Create VM. (boinc_1e975eb9682dd350, slot#2)
2020-08-05 07:58:03 (5600): Setting Memory Size for VM. (1900MB)
2020-08-05 07:58:03 (5600): Setting CPU Count for VM. (1)
2020-08-05 07:58:03 (5600): Setting Chipset Options for VM.
2020-08-05 07:58:03 (5600): Setting Boot Options for VM.
2020-08-05 07:58:03 (5600): Disabling VM Network Access.
2020-08-05 07:58:03 (5600): Setting Network Configuration for NAT.
2020-08-05 07:58:03 (5600): Disabling USB Support for VM.
2020-08-05 07:58:03 (5600): Disabling COM Port Support for VM.
2020-08-05 07:58:03 (5600): Disabling LPT Port Support for VM.
2020-08-05 07:58:03 (5600): Disabling Audio Support for VM.
2020-08-05 07:58:03 (5600): Disabling Clipboard Support for VM.
2020-08-05 07:58:03 (5600): Disabling Drag and Drop Support for VM.
2020-08-05 07:58:03 (5600): Adding storage controller(s) to VM.
2020-08-05 07:58:03 (5600): Adding virtual disk drive to VM. (vm_image.vdi)
2020-08-05 07:58:03 (5600): Adding VirtualBox Guest Additions to VM.
2020-08-05 07:58:03 (5600): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB)
2020-08-05 07:58:03 (5600): Enabling shared directory for VM.
2020-08-05 07:58:03 (5600): Starting VM. (boinc_1e975eb9682dd350, slot#2)
2020-08-05 07:58:11 (5600): Guest Log: BIOS: VirtualBox 5.1.22

2020-08-05 07:58:11 (5600): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63

2020-08-05 07:58:11 (5600): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032

2020-08-05 07:58:11 (5600): Guest Log: BIOS: Booting from Hard Disk...

2020-08-05 07:58:11 (5600): Successfully started VM. (PID = '10956')
2020-08-05 07:58:11 (5600): Reporting VM Process ID to BOINC.
2020-08-05 07:58:16 (5600): Guest Log: BIOS: KBD: unsupported int 16h function 03

2020-08-05 07:58:16 (5600): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 

2020-08-05 07:58:16 (5600): VM state change detected. (old = 'poweroff', new = 'running')
2020-08-05 07:58:26 (5600): Preference change detected
2020-08-05 07:58:26 (5600): Setting CPU throttle for VM. (100%)
2020-08-05 07:58:26 (5600): Setting checkpoint interval to 3600 seconds. (Higher value of (Preference: 3600 seconds) or (Vbox_job.xml: 600 seconds))
2020-08-05 08:28:27 (5600): Status Report: Trickle-Up Event.
2020-08-05 08:58:29 (5600): Status Report: Trickle-Up Event.
2020-08-05 09:04:04 (5600): Creating new snapshot for VM.
2020-08-05 09:04:09 (5600): Checkpoint completed.
2020-08-05 09:28:30 (5600): Status Report: Trickle-Up Event.
2020-08-05 09:38:30 (5600): Status Report: Elapsed Time: '6004.861517'
2020-08-05 09:38:30 (5600): Status Report: CPU Time: '2.698817'
2020-08-05 09:58:33 (5600): Status Report: Trickle-Up Event.
2020-08-05 10:04:13 (5600): Creating new snapshot for VM.
2020-08-05 10:04:18 (5600): Deleting stale snapshot.
2020-08-05 10:04:19 (5600): Checkpoint completed.
2020-08-05 10:28:35 (5600): Status Report: Trickle-Up Event.
2020-08-05 10:58:37 (5600): Status Report: Trickle-Up Event.
2020-08-05 11:04:22 (5600): Creating new snapshot for VM.
2020-08-05 11:04:27 (5600): Deleting stale snapshot.
2020-08-05 11:04:28 (5600): Checkpoint completed.
2020-08-05 11:18:33 (5600): Status Report: Elapsed Time: '12007.549265'
2020-08-05 11:18:33 (5600): Status Report: CPU Time: '3.697224'
2020-08-05 11:28:38 (5600): Status Report: Trickle-Up Event.
2020-08-05 11:58:42 (5600): Status Report: Trickle-Up Event.
2020-08-05 12:04:32 (5600): Creating new snapshot for VM.
2020-08-05 12:04:37 (5600): Deleting stale snapshot.
2020-08-05 12:04:38 (5600): Checkpoint completed.
2020-08-05 12:28:44 (5600): Status Report: Trickle-Up Event.
2020-08-05 12:58:36 (5600): Status Report: Elapsed Time: '18010.281513'
2020-08-05 12:58:36 (5600): Status Report: CPU Time: '4.243227'
2020-08-05 12:58:46 (5600): Status Report: Trickle-Up Event.
2020-08-05 13:04:41 (5600): Creating new snapshot for VM.
2020-08-05 13:04:47 (5600): Deleting stale snapshot.
2020-08-05 13:04:47 (5600): Checkpoint completed.
2020-08-05 13:28:48 (5600): Status Report: Trickle-Up Event.
2020-08-05 13:58:49 (5600): Status Report: Trickle-Up Event.
2020-08-05 14:04:50 (5600): Creating new snapshot for VM.
2020-08-05 14:04:55 (5600): Deleting stale snapshot.
2020-08-05 14:04:56 (5600): Checkpoint completed.
2020-08-05 14:28:51 (5600): Status Report: Trickle-Up Event.
2020-08-05 14:38:37 (5600): Status Report: Elapsed Time: '24011.322550'
2020-08-05 14:38:37 (5600): Status Report: CPU Time: '5.335234'
2020-08-05 14:58:53 (5600): Status Report: Trickle-Up Event.
2020-08-05 15:04:59 (5600): Creating new snapshot for VM.
2020-08-05 15:05:04 (5600): Deleting stale snapshot.
2020-08-05 15:05:05 (5600): Checkpoint completed.
2020-08-05 15:28:55 (5600): Status Report: Trickle-Up Event.
2020-08-05 15:58:58 (5600): Status Report: Trickle-Up Event.
2020-08-05 16:05:08 (5600): Creating new snapshot for VM.
2020-08-05 16:05:13 (5600): Deleting stale snapshot.
2020-08-05 16:05:14 (5600): Checkpoint completed.
2020-08-05 16:18:39 (5600): Status Report: Elapsed Time: '30013.723759'
2020-08-05 16:18:39 (5600): Status Report: CPU Time: '6.458441'
2020-08-05 16:29:00 (5600): Status Report: Trickle-Up Event.
2020-08-05 16:59:03 (5600): Status Report: Trickle-Up Event.
2020-08-05 17:05:18 (5600): Creating new snapshot for VM.
2020-08-05 17:05:23 (5600): Deleting stale snapshot.
2020-08-05 17:05:24 (5600): Checkpoint completed.
2020-08-05 17:29:05 (5600): Status Report: Trickle-Up Event.
2020-08-05 17:58:44 (5600): Status Report: Elapsed Time: '36018.600281'
2020-08-05 17:58:44 (5600): Status Report: CPU Time: '7.020045'
2020-08-05 17:59:09 (5600): Status Report: Trickle-Up Event.
2020-08-05 18:05:29 (5600): Creating new snapshot for VM.
2020-08-05 18:05:35 (5600): Deleting stale snapshot.
2020-08-05 18:05:35 (5600): Checkpoint completed.
2020-08-05 18:29:11 (5600): Status Report: Trickle-Up Event.
2020-08-05 18:59:13 (5600): Status Report: Trickle-Up Event.
2020-08-05 19:05:38 (5600): Creating new snapshot for VM.
2020-08-05 19:05:44 (5600): Deleting stale snapshot.
2020-08-05 19:05:44 (5600): Checkpoint completed.
2020-08-05 19:29:15 (5600): Status Report: Trickle-Up Event.
2020-08-05 19:38:45 (5600): Status Report: Elapsed Time: '42019.959356'
2020-08-05 19:38:45 (5600): Status Report: CPU Time: '8.034052'
2020-08-05 19:59:19 (5600): Status Report: Trickle-Up Event.
2020-08-05 20:05:50 (5600): Creating new snapshot for VM.
2020-08-05 20:05:55 (5600): Deleting stale snapshot.
2020-08-05 20:05:56 (5600): Checkpoint completed.
2020-08-05 20:29:22 (5600): Status Report: Trickle-Up Event.
2020-08-05 20:59:25 (5600): Status Report: Trickle-Up Event.
2020-08-05 21:06:01 (5600): Creating new snapshot for VM.
2020-08-05 21:06:06 (5600): Deleting stale snapshot.
2020-08-05 21:06:06 (5600): Checkpoint completed.
2020-08-05 21:18:47 (5600): Status Report: Elapsed Time: '48021.486954'
2020-08-05 21:18:47 (5600): Status Report: CPU Time: '9.032458'
2020-08-05 21:29:28 (5600): Status Report: Trickle-Up Event.
2020-08-05 21:59:30 (5600): Status Report: Trickle-Up Event.
2020-08-05 22:06:10 (5600): Creating new snapshot for VM.
2020-08-05 22:06:16 (5600): Deleting stale snapshot.
2020-08-05 22:06:16 (5600): Checkpoint completed.
2020-08-05 22:29:32 (5600): Status Report: Trickle-Up Event.
2020-08-05 22:58:49 (5600): Status Report: Elapsed Time: '54023.607628'
2020-08-05 22:58:49 (5600): Status Report: CPU Time: '9.547261'
2020-08-05 22:59:34 (5600): Status Report: Trickle-Up Event.
2020-08-05 23:06:20 (5600): Creating new snapshot for VM.
2020-08-05 23:06:25 (5600): Deleting stale snapshot.
2020-08-05 23:06:25 (5600): Checkpoint completed.
2020-08-05 23:29:37 (5600): Status Report: Trickle-Up Event.
2020-08-05 23:59:39 (5600): Status Report: Trickle-Up Event.
2020-08-06 00:06:29 (5600): Creating new snapshot for VM.
2020-08-06 00:06:35 (5600): Deleting stale snapshot.
2020-08-06 00:06:35 (5600): Checkpoint completed.
2020-08-06 00:29:41 (5600): Status Report: Trickle-Up Event.
2020-08-06 00:38:52 (5600): Status Report: Elapsed Time: '60026.398884'
2020-08-06 00:38:52 (5600): Status Report: CPU Time: '10.592468'
2020-08-06 00:59:44 (5600): Status Report: Trickle-Up Event.
2020-08-06 01:06:39 (5600): Creating new snapshot for VM.
2020-08-06 01:06:45 (5600): Deleting stale snapshot.
2020-08-06 01:06:45 (5600): Checkpoint completed.
2020-08-06 01:29:46 (5600): Status Report: Trickle-Up Event.
2020-08-06 01:59:49 (5600): Status Report: Trickle-Up Event.
2020-08-06 02:06:50 (5600): Creating new snapshot for VM.
2020-08-06 02:06:55 (5600): Deleting stale snapshot.
2020-08-06 02:06:55 (5600): Checkpoint completed.
2020-08-06 02:18:56 (5600): Status Report: Elapsed Time: '66030.553316'
2020-08-06 02:18:56 (5600): Status Report: CPU Time: '11.559674'
2020-08-06 02:29:52 (5600): Status Report: Trickle-Up Event.
2020-08-06 02:59:54 (5600): Status Report: Trickle-Up Event.
2020-08-06 03:06:59 (5600): Creating new snapshot for VM.
2020-08-06 03:07:05 (5600): Deleting stale snapshot.
2020-08-06 03:07:05 (5600): Checkpoint completed.
2020-08-06 03:29:56 (5600): Status Report: Trickle-Up Event.
2020-08-06 03:58:58 (5600): Status Report: Elapsed Time: '72032.345943'
2020-08-06 03:58:58 (5600): Status Report: CPU Time: '12.152478'
2020-08-06 03:59:58 (5600): Status Report: Trickle-Up Event.
2020-08-06 04:07:08 (5600): Creating new snapshot for VM.
2020-08-06 04:07:14 (5600): Deleting stale snapshot.
2020-08-06 04:07:14 (5600): Checkpoint completed.
2020-08-06 04:30:00 (5600): Status Report: Trickle-Up Event.
2020-08-06 05:00:02 (5600): Status Report: Trickle-Up Event.
2020-08-06 05:07:18 (5600): Creating new snapshot for VM.
2020-08-06 05:07:23 (5600): Deleting stale snapshot.
2020-08-06 05:07:24 (5600): Checkpoint completed.
2020-08-06 05:30:05 (5600): Status Report: Trickle-Up Event.
2020-08-06 05:39:00 (5600): Status Report: Elapsed Time: '78034.852169'
2020-08-06 05:39:00 (5600): Status Report: CPU Time: '13.197685'
2020-08-06 06:00:09 (5600): Status Report: Trickle-Up Event.
2020-08-06 06:07:29 (5600): Creating new snapshot for VM.
2020-08-06 06:07:34 (5600): Deleting stale snapshot.
2020-08-06 06:07:35 (5600): Checkpoint completed.
2020-08-06 06:30:11 (5600): Status Report: Trickle-Up Event.
2020-08-06 07:00:14 (5600): Status Report: Trickle-Up Event.
2020-08-06 07:07:39 (5600): Creating new snapshot for VM.
2020-08-06 07:07:45 (5600): Deleting stale snapshot.
2020-08-06 07:07:45 (5600): Checkpoint completed.
2020-08-06 07:19:01 (5600): Status Report: Elapsed Time: '84035.509655'
2020-08-06 07:19:01 (5600): Status Report: CPU Time: '14.211691'
2020-08-06 07:30:17 (5600): Status Report: Trickle-Up Event.
2020-08-06 08:00:19 (5600): Status Report: Trickle-Up Event.
2020-08-06 08:07:50 (5600): Creating new snapshot for VM.
2020-08-06 08:07:55 (5600): Deleting stale snapshot.
2020-08-06 08:07:56 (5600): Checkpoint completed.
2020-08-06 08:30:22 (5600): Status Report: Trickle-Up Event.
2020-08-06 08:59:06 (5600): Status Report: Elapsed Time: '90040.259159'
2020-08-06 08:59:06 (5600): Status Report: CPU Time: '14.710894'
2020-08-06 09:00:26 (5600): Status Report: Trickle-Up Event.
2020-08-06 09:08:01 (5600): Creating new snapshot for VM.
2020-08-06 09:08:07 (5600): Deleting stale snapshot.
2020-08-06 09:08:07 (5600): Checkpoint completed.
2020-08-06 09:30:29 (5600): Status Report: Trickle-Up Event.
2020-08-06 10:00:31 (5600): Status Report: Trickle-Up Event.
2020-08-06 10:08:12 (5600): Creating new snapshot for VM.
2020-08-06 10:08:17 (5600): Deleting stale snapshot.
2020-08-06 10:08:18 (5600): Checkpoint completed.
2020-08-06 10:30:34 (5600): Status Report: Trickle-Up Event.
2020-08-06 10:39:10 (5600): Status Report: Elapsed Time: '96044.118551'
2020-08-06 10:39:10 (5600): Status Report: CPU Time: '15.678100'
2020-08-06 11:00:36 (5600): Status Report: Trickle-Up Event.
2020-08-06 11:08:22 (5600): Creating new snapshot for VM.
2020-08-06 11:08:27 (5600): Deleting stale snapshot.
2020-08-06 11:08:28 (5600): Checkpoint completed.
2020-08-06 11:30:39 (5600): Status Report: Trickle-Up Event.
2020-08-06 11:33:59 (5600): Powering off VM.
2020-08-06 11:34:01 (5600): Successfully stopped VM.
2020-08-06 11:34:06 (5600): Deregistering VM. (boinc_1e975eb9682dd350, slot#2)
2020-08-06 11:34:06 (5600): Removing virtual disk drive(s) from VM.
2020-08-06 11:34:06 (5600): Removing network bandwidth throttle group from VM.
2020-08-06 11:34:06 (5600): Removing storage controller(s) from VM.
2020-08-06 11:34:06 (5600): Removing VM from VirtualBox.

    Hypervisor System Log:

30:18:37.161964          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:38.630150          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:41.380499          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:42.164599          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:43.632785          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.540655          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.582660          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.582660          ERROR [COM]: aRC=VBOX_E_OBJECT_NOT_FOUND (0x80bb0001) aIID={0169423f-46b4-cde9-91af-1e9d5b6cd945} aComponent={VirtualBoxWrap} aText={Could not find a registered machine named 'boinc_596cdf7fa7e15082'}, preserve=false aResultDetail=0
30:18:46.663170          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.663170          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.663670          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.663670          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.691174          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:46.692174          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:47.168234          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:48.635921          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:49.612045          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The given session is busy}, preserve=false aResultDetail=0
30:18:49.612545          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:52.171370          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:52.548418          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:52.557419 DeleteSnap ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=true  aResultDetail=0
30:18:52.565920 DeleteSnap ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=true  aResultDetail=0
30:18:52.565920 DeleteSnap ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=true  aResultDetail=0
30:18:52.565920 DeleteSnap ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=true  aResultDetail=0
30:18:52.581922          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:52.582422          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:52.596924          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:52.603425          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={Medium 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' is locked for reading by another task}, preserve=false aResultDetail=0
30:18:52.621927          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={The object is not ready}, preserve=false aResultDetail=0
30:18:52.621927          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={Medium 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' is locked for reading by another task}, preserve=false aResultDetail=0
30:18:53.639056          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:54.622181          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:57.175005          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:58.641691          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:18:59.627317          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0
30:19:02.177640          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={b2547866-a0a1-4391-8b86-6952d82efaa0} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0

    VM Execution Log:


    VM Startup Log:


    VM Trace Log:

11:34:16 (5600): called boinc_finish(194)

</stderr_txt>
]]>

Please note that the same machine has completed many validated tasks (actually most of which I have returned so far).

Michael.
8) Message boards : Number crunching : Credits not proportional to compute effort (Message 1008)
Posted 6 Aug 2020 by Michael H.W. Weber
Post:
I observed that although on a given Linux system the run time for a given WU type varies up to 200% (referring to the fastest task completion), the credits are fixed (200 credits for the regular and 5000 for the long runners). This is true for both types of WUs (long and regular).

For Windows it is the same with the normal Virtualbox WUs.

So, the virtual credits do not represent the invested compute efforts - is there maybe a possibility for improvement some time in the future?

Michael.
9) Message boards : Number crunching : RAM requirements per WU (Message 1007)
Posted 6 Aug 2020 by Michael H.W. Weber
Post:
Well, so far I never saw more than 250 MB of RAM used on Linux systems - regardless of the task type (long/regular).

Michael.
10) Message boards : Number crunching : RAM requirements per WU (Message 1004)
Posted 4 Aug 2020 by Michael H.W. Weber
Post:
Is there an overview of the RAM requirements for the long vs. regular WU types under Linux / Windows?
Is it correct that the Linux client does not require Virtualbox?

Michael.
11) Message boards : Number crunching : Tasks incorrectly marked as invalid: Please check validation rules (Message 823)
Posted 28 Apr 2020 by Michael H.W. Weber
Post:
A plain rounding issue could be fatal.
Maybe contact Prof. Gernot Frenking, theoretical chemist at Philipps-University of Marburg/Germany.

Michael.
12) Message boards : Number crunching : Tasks incorrectly marked as invalid: Please check validation rules (Message 788)
Posted 18 Apr 2020 by Michael H.W. Weber
Post:
I explained this a lot to the different project participants at the beginning of the project. You can't know in advance if the calculation will converge and how long it will take. The consequence is unpredictable calculation times and invalid tasks when the chemistry doesn't work. This is how we explore the frontier between combinatorial chemistry and what is theoretically possible and stable. Scientifically, the invalid tasks are very important because they will allow us to train models (Artificial Intelligence) which will allow us in the future to predict the validity or not of the molecules studied with a good precision we hope. At the moment, we are already using these data and the results are encouraging.

Now I understand. Thanks.

But then there is still this VM configuration issue (never ending tasks) for which I posted a few error log notes here.

Michael.
13) Message boards : Number crunching : Never ending tasks (Message 787)
Posted 18 Apr 2020 by Michael H.W. Weber
Post:
All the manually aborted tasks (last row: errors) belong to the "never ending task issue":

Laptop: https://quchempedia.univ-angers.fr/athome/results.php?hostid=1969
Desktop: https://quchempedia.univ-angers.fr/athome/results.php?hostid=2004

Note that there are two different types:
One with CPU time zero (yielding a virtually empty log) and one with CPU time (plus a log).
The latter may be investigated in more detail.

Michael.
14) Message boards : Number crunching : Never ending tasks (Message 786)
Posted 18 Apr 2020 by Michael H.W. Weber
Post:
I have taken a quick look into these problematic tasks which do not finish.
My conclusion: It seems not to be a problem of not finishing, it appears that these Virtualbox-based VMs are not properly initiating. Please check the logs and the Virtualbox manual in detail: There are numerous error messages given. Here a few examples from_

https://quchempedia.univ-angers.fr/athome/result.php?resultid=2268100

...

Hypervisor System Log:

04:57:37.111095          Saving settings file "C:\ProgramData\BOINC\slots\1\boinc_b6cb3843252a5eb9\boinc_b6cb3843252a5eb9.vbox" with version "1.16-windows"
04:57:39.759996          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={85cd948e-a71f-4289-281e-0ca7ad48cd89} aComponent={MachineWrap} aText={The given session is busy}, preserve=false aResultDetail=0

...

05:02:56.055177          ERROR [COM]: aRC=E_FAIL (0x80004005) aIID={85cd948e-a71f-4289-281e-0ca7ad48cd89} aComponent={SessionMachine} aText={This machine does not have any snapshots}, preserve=false aResultDetail=0

...

05:02:56.070793          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={Medium 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' is locked for reading by another task}, preserve=false aResultDetail=0
05:02:56.071771          Saving settings file "C:\Users\weber\.VirtualBox\VirtualBox.xml" with version "1.12-windows"
05:02:56.076649          ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={The object is not ready}, preserve=false aResultDetail=0
05:02:56.076649          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={4afe423b-43e0-e9d0-82e8-ceb307940dda} aComponent={MediumWrap} aText={Medium 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' is locked for reading by another task}, preserve=false aResultDetail=0
05:03:08.894448          ERROR [COM]: aRC=VBOX_E_OBJECT_NOT_FOUND (0x80bb0001) aIID={9570b9d5-f1a1-448a-10c5-e12f5285adad} aComponent={VirtualBoxWrap} aText={Could not find a registered machine named 'boinc_68a02bed83281783'}, preserve=false aResultDetail=0
0

...

05:03:11.538433          ERROR [COM]: aRC=VBOX_E_INVALID_OBJECT_STATE (0x80bb0007) aIID={85cd948e-a71f-4289-281e-0ca7ad48cd89} aComponent={MachineWrap} aText={The given session is busy}, preserve=false aResultDetail=0

...

When I checked the running tasks by selecting them in the BOINC manager, right-clicking and looking for properties it says that CPU time as well as CPU time after last checkpoint are both zero. So, computation never starts? That would be in line with the observation that these tasks do not appear to consume CPU load.
For correctly running VM tasks CPU time is almost identical to CPU time after last checkpoint, which is also kind of strange given the fact that I set all my systems to write checkpoints to disk only once every hour, because it is known that if the checkpoint interval is set too short, Virtualbox causes I/O issues (depending on the VM size of course).

Michael.

P.S.: You might also consider using Docker instead of Virtualbox. The latter nowadays appears to be a "heavyweight" compared to the former.
(A curious notion made by the person who, to the best of my knowledge, first suggested using VMs as universal checkpointing option for BOINC scientific apps during the Barcelona BOINC Workshop in 2009 where I presented RNA World for the first time on behalf of Rechenkraft.net - and CERN independently even presented a first implementation of that idea at the same conference). ;-)
15) Message boards : Number crunching : Checkpoint? (Message 785)
Posted 18 Apr 2020 by Michael H.W. Weber
Post:
Yes, and that's usually pretty decent.

Excellent.

Michael.
16) Message boards : Number crunching : Checkpoint? (Message 779)
Posted 17 Apr 2020 by Michael H.W. Weber
Post:
This option looks nice. We have to make some tests to validate the behaviour !
I add this to the todo list .

Has this built-in checkpointing ability been enabled for the Linux native apps by now?

Michael.
17) Message boards : Number crunching : Tasks incorrectly marked as invalid: Please check validation rules (Message 778)
Posted 17 Apr 2020 by Michael H.W. Weber
Post:
I think a first step would be to cleanly separate the validation from the execution errors and then see in what e.g. the execution error group of tasks differs from the successfully completed ones. Do these have a greater RAM requirement, etc.?

The fact that the same machines return valid tasks to me again hints that there is no hardware issue.

Michael.
18) Message boards : Number crunching : Tasks incorrectly marked as invalid: Please check validation rules (Message 777)
Posted 17 Apr 2020 by Michael H.W. Weber
Post:
There are execution and validation errors. Your workunits have been marked as invalid because as soon as the task is returned, without comparison, it is possible to see that the calculation went wrong.

If that is the case, then you need to really check what is going wrong in detail - and here is why:

In my case, of 307 tasks processed on two Windows machines in total, at present 7 had to be aborted manually because they are running infinitely and another 72 tasks (!) are marked as invalid due to the two issues we are discussing here (execution AND validation errors as you said above). That is a failure rate of approx. 25% meaning that a quarter of our compute time per defintion goes to waste (and a quarter of our electricity heats up the air for nothing).
If that is a problem at my end, I would probably no longer run my machines, but that is certainly NOT the case: I run all my machines for distributed computing projects since 2001 in a 24/7 style. Doing that I have acquired some experience. The two machines in question here run all other currently active DC projects without producing errors, so there must be some sort of issue with your client.

Your project is rather new and it is known that sometimes there are issues with virtualbox (again, knowing this, I do not run virtualbox on eight of the possible cores but only on a fraction of that and there are no other DC projects or applications worth to mention running in parallel). So, I think it is normal that your project has some issues at this stage. But you will need to acquire some expertise in exactly identifying what is going wrong. Usually, the DC community is excellently suited to help finding this out. Try to test different Virtualbox versions (choosing the correct version has been a solution for several other DC project issues with virtualbox). Is it possible that the RAM size of the virtualbox environment is set too small in some cases? Check rounding inconsistencies when validating tasks (AMD vs. AMD, Intel vs. Intel, not cross-validation - probably long known by you). Ask people to run some of the invalid tasks on their machines for camparison of the environment. You may even deploy hardware testing apps in the beta testing section and ask specific people to participate here if you really suspect a machine issue.

These are just a few ideas, of course. I do not want to rant about the problems. I like your project and I just would like to help you make it better by reporting all inconsistencies I can find. 25% failure rate is not a nice thing to have. You can assume that most other participants, at least those supporting DC for a longer period of time, will be helpful.

Michael.
19) Message boards : Number crunching : Differences in logs of valid tasks (Message 776)
Posted 17 Apr 2020 by Michael H.W. Weber
Post:
My core question was why are the logs so far from being virtually identical given that these machines processed the same task?

Michael.
20) Message boards : Number crunching : Never ending tasks (Message 775)
Posted 17 Apr 2020 by Michael H.W. Weber
Post:
The vbox wrapper didn't turn off properly. I couldn't compile a better wrapper, so I'm using the official from Berkeley.

Well, it appears not to solve the issue, right?
It might help if you provide the full procedure of creating your wrapper: To my experince, there are usually people around who might have more expertise with this than us. ;-)

Michael.


Next 20

©2024 Benoit DA MOTA - LERIA, University of Angers, France