VM job unmanageable

Message boards : Number crunching : VM job unmanageable
Message board moderation

To post messages, you must log in.

AuthorMessage
zombie67 [MM]
Avatar

Send message
Joined: 3 Oct 19
Posts: 11
Credit: 5,443,793
RAC: 0
Message 10 - Posted: 4 Oct 2019, 5:50:33 UTC
Last modified: 4 Oct 2019, 5:52:07 UTC

On my windows machines, I am getting many tasks stuck in the "VM job unmanageable" state. It requires quitting/restarting BOINC to get them un-stuck. This with the latest versions of BOINC and vbox. This is with 0.07 version of the application. I notice there is a new 0.08 version. Does this new version fix the issue?

Edit: I also found a task in the same state on one of my Macs. So the problem is not limited to Windows.
Reno, NV
Team: SETI.USA
ID: 10 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 13 - Posted: 4 Oct 2019, 7:33:02 UTC - in response to Message 10.  

The same on my computer (Windows) and I already see that problem on other project. Probably a Vbox known issue. For me, I obtain improvement with two strategies :
    - suspended tasks kept in memory
    - no task interruption (eg. only one project or very very long switching time, like several days)

ID: 13 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 3 Oct 19
Posts: 11
Credit: 5,443,793
RAC: 0
Message 24 - Posted: 4 Oct 2019, 14:30:39 UTC
Last modified: 4 Oct 2019, 14:38:33 UTC

For what it's worth, I already had both of those set as you suggest. For example, I went to sleep with a 28 core machine (HT turned off), and no other CPU projects running. When I woke up six hours later, I had only 6 tasks running, and 52 tasks stalled. This is a serious problem, IMO.

Edit: If task interruption is really a cause of this problem, then the situation is doubly bad. Because the only way to fix the stalled tasks is to quit/restart BOINC, which then interrupts the tasks that were still running. But I don't think either task interruption is really the cause. There are many other vbox projects out there that do not have this issue.
Reno, NV
Team: SETI.USA
ID: 24 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 26 - Posted: 4 Oct 2019, 16:12:01 UTC - in response to Message 24.  

Curious behaviour... At the moment, I have no idea.

Energy saving configuration ? Boinc client that decides to switch tasks ?
ID: 26 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
DoctorNow
Avatar

Send message
Joined: 4 Oct 19
Posts: 1
Credit: 243,053
RAC: 0
Message 30 - Posted: 4 Oct 2019, 17:53:40 UTC
Last modified: 4 Oct 2019, 17:55:56 UTC

Same problem here on my two hosts.
Almost every WU runs for a few hours and then leaves with this state.
First I suspected memory problems and reduced the amount of simultaneous running tasks to 2.
But doesn't seem to help.
And I have two different system with different VirtualBox versions:
Win 7 with VB 6.0.4 and Win 10 with VB 6.0.12

Edit:
Only 1 task finished so far, that was this one
ID: 30 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 3 Oct 19
Posts: 11
Credit: 5,443,793
RAC: 0
Message 36 - Posted: 4 Oct 2019, 23:41:39 UTC - in response to Message 26.  

Curious behaviour... At the moment, I have no idea.

Energy saving configuration ? Boinc client that decides to switch tasks ?


Nope. These are dedicated crunchers running 24/7. And there are no other projects to switch to.
Reno, NV
Team: SETI.USA
ID: 36 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 26 Aug 19
Posts: 15
Credit: 1,265,326
RAC: 0
Message 39 - Posted: 5 Oct 2019, 13:22:14 UTC

I remember this was an old recurrent issue with LHC when they started with VM applications years ago (and Boinc/Rob Walton was fighting to get a stable working wrapper), it would especially happen when "mixing" several VM tasks at the same time (and other VM projects like RNA...).

I don't know how they solved this, I never had it anymore (I think) with any LHC sub-projects.

Some days ago I think I found one QCPIA task in that same status on my iMac, but I also think it was at the start of the AF RAID and I must have I killed it and didn't want to try anything by that time.

Try to search / discuss on the LHC forum ?
ID: 39 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 41 - Posted: 5 Oct 2019, 13:31:22 UTC - in response to Message 10.  
Last modified: 5 Oct 2019, 13:32:12 UTC

On my windows machines, I am getting many tasks stuck in the "VM job unmanageable" state. It requires quitting/restarting BOINC to get them un-stuck. This with the latest versions of BOINC and vbox.

I see that on nanoHUB once every few days with VBox 5.2.10 and Ubuntu 18.04, and BOINC 7.14.1 (and 7.16.1). The work units are very short (less than 5 minutes), so I just abort them.
ID: 41 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 3 Oct 19
Posts: 11
Credit: 5,443,793
RAC: 0
Message 42 - Posted: 5 Oct 2019, 14:21:35 UTC

Yep. Nanohub has the same problem.
Reno, NV
Team: SETI.USA
ID: 42 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 47 - Posted: 6 Oct 2019, 8:05:26 UTC - in response to Message 42.  

Did they find a solution to this problem ?
ID: 47 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 26 Aug 19
Posts: 15
Credit: 1,265,326
RAC: 0
Message 49 - Posted: 6 Oct 2019, 11:14:21 UTC

I didn't see that with Nanohub on my iMac.

And for LHC yes they did, but no idea how...
ID: 49 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 51 - Posted: 6 Oct 2019, 13:27:08 UTC - in response to Message 47.  

Did they find a solution to this problem ?

Not on nanoHUB. The VM Unmanageable is relatively rare for them, and they have bigger problems than that at the moment.

It does happen on LHC too. There are several threads on it.
The most knowledgeable guy over there, computezrmle, has this to say about it:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4628&postid=34506#34506
ID: 51 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 26 Aug 19
Posts: 15
Credit: 1,265,326
RAC: 0
Message 66 - Posted: 7 Oct 2019, 20:14:19 UTC

This is a very interesting thread I read a good part of it : people still experience this issue in recent time depending on LHC subproject.

It seems to be a very subtle problem depending on many factors :

- amount of RAM and resources available on the machine + boinc parameters about memory, task switch frequency...
- how many VM tasks you run concurrently (it seems that "the less the better"),
- the version of VB installed on the machine (it seems "the most recent is not necessarily the best one"),
- the OS used,
- the version of the VB boinc wrapper that is being implemented by the project application (all LHC subprojects don't use the same wrapper version)
- and maybe other factors...

So it's a bit of an alchemy !!
ID: 66 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 72 - Posted: 8 Oct 2019, 7:40:00 UTC - in response to Message 66.  

This is a very interesting thread I read a good part of it : people still experience this issue in recent time depending on LHC subproject.

It seems to be a very subtle problem depending on many factors :

- amount of RAM and resources available on the machine + boinc parameters about memory, task switch frequency...
- how many VM tasks you run concurrently (it seems that "the less the better"),
- the version of VB installed on the machine (it seems "the most recent is not necessarily the best one"),
- the OS used,
- the version of the VB boinc wrapper that is being implemented by the project application (all LHC subprojects don't use the same wrapper version)
- and maybe other factors...

So it's a bit of an alchemy !!


A good way to catch a good headache !
ID: 72 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 13 Oct 19
Posts: 87
Credit: 6,026,455
RAC: 0
Message 130 - Posted: 15 Oct 2019, 21:47:59 UTC
Last modified: 15 Oct 2019, 21:54:13 UTC

Moved to VirtualBox thread
ID: 130 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
UBT - Timbo

Send message
Joined: 8 Dec 19
Posts: 13
Credit: 652,594
RAC: 0
Message 389 - Posted: 6 Jan 2020, 10:57:57 UTC

Hi all

I found this message thread on Github that might interest those affected by this issue:

https://github.com/BOINC/boinc/issues/3173

It seems that VBox thinks there isn't enough memory to complete the job and hence it delays restarting for 1 day.

The fix seems to be to restart the BOINC Manager client and VBox shoud then restart, assuming any local memory intensive apps have ceased.

A YouTube video claims that if you reduce the "Computing preferences > Computing > Use at most __% of the CPU time" setting prior to the restart of BOINC Manager might also fix this.

https://www.youtube.com/watch?v=2CK8Yxxylnw

regards
Tim
ID: 389 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : VM job unmanageable

©2024 Benoit DA MOTA - LERIA, University of Angers, France