ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time

Message boards : Number crunching : ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time
Message board moderation

To post messages, you must log in.

AuthorMessage
Greg_BE

Send message
Joined: 2 Oct 21
Posts: 24
Credit: 68,200
RAC: 0
Message 1638 - Posted: 13 Jan 2022, 23:00:34 UTC
Last modified: 13 Jan 2022, 23:01:39 UTC

022-01-13 20:56:42 (16248): VM state change detected. (old = 'poweroff', new = 'running')
2022-01-13 20:56:47 (16248): Guest Log: vgdrvHeartbeatInit: Setting up heartbeat to trigger every 2000 milliseconds

2022-01-13 20:56:47 (16248): Guest Log: vboxguest: misc device minor 59, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000)

2022-01-13 20:56:52 (16248): Preference change detected
2022-01-13 20:56:52 (16248): Setting CPU throttle for VM. (100%)
2022-01-13 20:56:53 (16248): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 180 seconds) or (Vbox_job.xml: 600 seconds))
2022-01-13 20:57:08 (16248): Guest Log: vboxsf: g_fHostFeatures=0x8000000f g_fSfFeatures=0x1 g_uSfLastFunction=29

2022-01-13 21:13:07 (16248): Creating new snapshot for VM.
2022-01-13 21:13:16 (16248): Checkpoint completed.
2022-01-13 21:19:46 (16248): ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time.
2022-01-13 21:19:46 (16248): Powering off VM.
2022-01-13 21:19:46 (16248): Successfully stopped VM.


Why? If its running on just one core then it should be fine.
Only RAH Python and LHC ATLAS are using Vbox as well.
RAH 1 core and ATLAS 4 cores.
Can't Vbox handle running 6 cores at once?
f I close Boinc Mgr and restart it the task will restart and run fine to completion.

I've got 49 Gigs of memory just to handle all these big projects and I am only using 44% without QuChem. I haven't seen the combined usage even approach 50%

So what else is going on?
ID: 1638 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 24
Credit: 68,559,000
RAC: 0
Message 1649 - Posted: 11 Feb 2022, 6:50:32 UTC
Last modified: 11 Feb 2022, 7:01:38 UTC

I am running QuChem on Linux, therefore am not observing this here. But I got the same occasionally at Cosmology@home with the "camb_boinc2docker" application, and very frequently at Rosetta@home with the "rosetta python projects" application. (I've got Vbox 6.1.28, that's apparently a factor for the frequency of such events.)

I suspect that vboxwrapper simply doesn't cope with the large latencies which a Vbox VM can sometimes exhibit. IOW my guess is that someone set a timeout too small somewhere.

I am currently running "rosetta python projects" (merely 16 or fewer tasks at one on a computer with plenty of cores and 256 GB RAM) and am restarting the boinc client twice a day. Otherwise the client would run out of work eventually, since it does not request new work as long as there is one or more "postponed" task in the buffer. :-(
ID: 1649 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg_BE

Send message
Joined: 2 Oct 21
Posts: 24
Credit: 68,200
RAC: 0
Message 1650 - Posted: 11 Feb 2022, 23:37:30 UTC - in response to Message 1649.  

I am running QuChem on Linux, therefore am not observing this here. But I got the same occasionally at Cosmology@home with the "camb_boinc2docker" application, and very frequently at Rosetta@home with the "rosetta python projects" application. (I've got Vbox 6.1.28, that's apparently a factor for the frequency of such events.)

I suspect that vboxwrapper simply doesn't cope with the large latencies which a Vbox VM can sometimes exhibit. IOW my guess is that someone set a timeout too small somewhere.

I am currently running "rosetta python projects" (merely 16 or fewer tasks at one on a computer with plenty of cores and 256 GB RAM) and am restarting the boinc client twice a day. Otherwise the client would run out of work eventually, since it does not request new work as long as there is one or more "postponed" task in the buffer. :-(



I am a old time Rosetta cruncher. At times Python stuffs 12 or more tasks on my system.
Maybe it is there that QuChem crashes. I have never really paid attention.
Python uses up a ton of resources in all the key elements, so that may be a sign.
ID: 1650 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 24
Credit: 68,559,000
RAC: 0
Message 1652 - Posted: 12 Feb 2022, 7:21:12 UTC

I found the following idea via the Rosetta@home message board, originally posted by @computezrmle at the Cosmology@home message board:
http://www.cosmologyathome.org/forum_thread.php?id=7769&postid=22921

On Dec 5 2021 computezrmle wrote:
Volunteers frequently affected by the postponed issue may try a different vboxwrapper.

BOINC's wiki pages mention communication problems between vboxwrapper and VirtualBox 6.x, especially on Windows.
They offer premade executables that may solve the problems:
https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables

It would be the job of the project developers to test those vboxwrappers and distribute them to the clients.
As long as this is not done volunteers could use the following steps as a workaround:

1. Download an alternative vboxwrapper from the page mentioned above (or use one you got from another project, e.g. LHC@home)
2. Start the BOINC client but suspend computing
3. Change to the project directory, e.g. projects/www.cosmologyathome.org, and replace the vboxwrapper there with the test version; the filename must be the name of the old vboxwrapper
4. Resume computing -> check the logfiles of tasks started after the patch


Each restart of the BOINC client will replace the patch with the original vboxwrapper from the project server.
This can be avoided setting <dont_check_file_sizes>1</dont_check_file_sizes> in cc_config.xml, but then all other automatic updates will also not work.

I haven't tried this myself yet.
ID: 1652 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 1653 - Posted: 12 Feb 2022, 12:22:23 UTC - in response to Message 1652.  
Last modified: 12 Feb 2022, 12:36:44 UTC

I haven't tried this myself yet.

I have. BOINC does a "Signature verification" and won't accept the new wrapper.
But maybe you should try it and see if you can make it work.
ID: 1653 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg_BE

Send message
Joined: 2 Oct 21
Posts: 24
Credit: 68,200
RAC: 0
Message 1654 - Posted: 12 Feb 2022, 17:44:05 UTC - in response to Message 1652.  

I found the following idea via the Rosetta@home message board, originally posted by @computezrmle at the Cosmology@home message board:
http://www.cosmologyathome.org/forum_thread.php?id=7769&postid=22921

On Dec 5 2021 computezrmle wrote:
Volunteers frequently affected by the postponed issue may try a different vboxwrapper.

BOINC's wiki pages mention communication problems between vboxwrapper and VirtualBox 6.x, especially on Windows.
They offer premade executables that may solve the problems:
https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables

It would be the job of the project developers to test those vboxwrappers and distribute them to the clients.
As long as this is not done volunteers could use the following steps as a workaround:

1. Download an alternative vboxwrapper from the page mentioned above (or use one you got from another project, e.g. LHC@home)
2. Start the BOINC client but suspend computing
3. Change to the project directory, e.g. projects/www.cosmologyathome.org, and replace the vboxwrapper there with the test version; the filename must be the name of the old vboxwrapper
4. Resume computing -> check the logfiles of tasks started after the patch


Each restart of the BOINC client will replace the patch with the original vboxwrapper from the project server.
This can be avoided setting <dont_check_file_sizes>1</dont_check_file_sizes> in cc_config.xml, but then all other automatic updates will also not work.

I haven't tried this myself yet.


Not worth the hassle, every night when I go to bed I shut down the system. So I would have to rebuild this every morning? Forget it. And if other projects have issues, I would have to do it again? Nah...if QuChem can't figure this out, then they get the data back when I notice the problem or when BOINC restarts the next day.
ID: 1654 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 23 Feb 22
Posts: 23
Credit: 4,423,400
RAC: 0
Message 1701 - Posted: 9 Mar 2022, 8:39:48 UTC
Last modified: 9 Mar 2022, 8:45:51 UTC

Yesterday, I finally manged to attach to this project.
Since then, I've had several such cases with the "postponed" issue.
The even worse thing thoug is: as long as the fautly task is not removed manually, no new tasks are being downloaded. In the BOINC event log it says "...don't need", regardless of how big the buffer is in the settings (even several days).
So, in each such "postponed" case one needs to abort the task manually, only then new tasks can be downloaded. Which is nonsense, of course.
Hope that the project people can iron this problem out ASAP
ID: 1701 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 23 Feb 22
Posts: 23
Credit: 4,423,400
RAC: 0
Message 1702 - Posted: 9 Mar 2022, 9:04:06 UTC - in response to Message 1654.  

It would be the job of the project developers to test those vboxwrappers and distribute them to the clients.
As long as this is not done volunteers could use the following steps as a workaround:

1. Download an alternative vboxwrapper from the page mentioned above (or use one you got from another project, e.g. LHC@home)
2. Start the BOINC client but suspend computing
3. Change to the project directory, e.g. projects/www.cosmologyathome.org, and replace the vboxwrapper there with the test version; the filename must be the name of the old vboxwrapper
4. Resume computing -> check the logfiles of tasks started after the patch
the vboxwrapper used here is vboxwrapper_26200_windows_x86_64.exe. The one from the link above is a newer one and the same one as is being used by LHC: vboxwrapper_26203_windows_x86_64.exe.
So after replacing the 26200 version with the 26203 version, the newer one needs to be renamed to read 26200 ?
ID: 1702 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 23 Feb 22
Posts: 23
Credit: 4,423,400
RAC: 0
Message 1703 - Posted: 9 Mar 2022, 11:55:33 UTC - in response to Message 1702.  

I now have exchanged the vboxwrapper file as described above, i.e. the vboxwrapper_26203_windows_x86_64.exe is working under the name vboxwrapper_26200_windows_x86_64.exe, and three tasks have begun being processed "normally".
So I will find out soon whether this helps to eliminate the ".postponed" problem, or not.
ID: 1703 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time

©2024 Benoit DA MOTA - LERIA, University of Angers, France