Native Linux WU refuses to suspend

Message boards : Number crunching : Native Linux WU refuses to suspend
Message board moderation

To post messages, you must log in.

AuthorMessage
marmot

Send message
Joined: 29 Aug 19
Posts: 15
Credit: 159,816
RAC: 0
Message 78 - Posted: 8 Oct 2019, 20:09:17 UTC

Up until October 1st, my electric bill was on a summer plan with price 31 cents per kWh from 2pm till 7pm and 8 cents the rest of the day and weekends.
All the BOINC installs were set to suspend work from 2pm till 7pm.

Noticed that the BOINC dedicated Linux VM (my own build) was still using 95% CPU at 4pm late September.

nwchem refuses to acknowledge suspension order from BOINCmgr although BOINCmgr reports that the WU is suspended. MATE System Monitor showed nwchem process happily crunching away at 31 cents per kWh...

Also, when TBrada had a 4 high priority WU, nwchem again refused to suspend when BOINCmgr used resource managed suspension logic. So the situation occurred where BOINC had 4 cores allowed for usage and was actually running 6 threads. 2 threads for a t2 version of nwchem that refused to suspend.

There is a lack of checkpoints, but suspended WU's can remain in RAM and not lose their progress. nwchem uses a maximum (personally measured) of 580mb per nwchem thread; so suspending in RAM could be significant loss of RAM to other projects while suspended. But it's better than loss of work progress.

Confirmed this behavior still exists in the WU on my machine Sunday.
ID: 78 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 84 - Posted: 9 Oct 2019, 6:10:57 UTC - in response to Message 78.  

It seems that your BOINCmgr loose the ability to manage nwChem. Probably he stops our wrapper (run.sh if I remember, that lauch nwchem several times) but detachs child processes. Can you report the corresponding hierarchy of involved tasks (PID and PPID).
ID: 84 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bryan

Send message
Joined: 3 Oct 19
Posts: 14
Credit: 32,908,253
RAC: 0
Message 90 - Posted: 9 Oct 2019, 14:21:28 UTC - in response to Message 84.  
Last modified: 9 Oct 2019, 14:21:56 UTC

I noticed on one machine when I finished all WU and they had been turned in I still had 8 nwchem processes running. I had to kill them using their PIDs. They apparently were orphaned.
ID: 90 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 95 - Posted: 10 Oct 2019, 11:58:53 UTC - in response to Message 90.  

Yes, I notice this when we worked on the affinity issue.

The BOINC wrapper called (multiple times) mpirun, that invoke nwchem, but nwchem is/are not attached to mpirun. We will investigate this issue shortly after the affinity patch.
ID: 95 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 29 Aug 19
Posts: 15
Credit: 159,816
RAC: 0
Message 99 - Posted: 10 Oct 2019, 20:26:34 UTC - in response to Message 84.  

It seems that your BOINCmgr loose the ability to manage nwChem. Probably he stops our wrapper (run.sh if I remember, that lauch nwchem several times) but detachs child processes. Can you report the corresponding hierarchy of involved tasks (PID and PPID).


This system monitor in antiX is MATE's version. It doesn't clearly represent the parent/child hierarchy.
Oh, I found a view setting Dependencies that shows hierarchy graphically.

This is a new work unit that I have NOT tried to pause or suspend in any way.

Process manager shows BOINC launched:

wrapper PID 1100 (launched Tuesday 7:27am)
-- bash PID 1102 (launched Tuesday 7:27am)
---- bash PID 1124 (launched Tuesday 7:27am)
------ mpirun PID 19982 (launched Thursday 12:58am)
-------- nwchem PID 19987 (launched Thursday 12:58am)
-------- nwchem PID 19988 (launched Thursday 12:58am)

I see under properties:
nwchem ( BOINC mgr reports 2 days 7 hours 43 minutes)
launched: today 12:58am
PID 19988
command line: nwchem TD_singlet.nw
ID: 99 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 100 - Posted: 11 Oct 2019, 5:50:21 UTC - in response to Message 99.  

Ok, the hierarchy is correct (and nwchem is attached to mpirun). If you try to supsend can you verify the state of all the processus involved ?
ID: 100 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Thundergrid

Send message
Joined: 8 Oct 19
Posts: 3
Credit: 18,600
RAC: 0
Message 154 - Posted: 17 Oct 2019, 14:47:03 UTC

Hi,

i've run into the same issue, tasks doesn't respect cpu usage either, it keeps at 100% all the time no matter the configuration settings.

Balanced cpu usage is critical to boinc success, if a task consumes CPU cycles i need to work then it becomes a resource hog and nobody likes them.

I hope you find the problem and correct it, also checkpoints will be a great feature with such a long WU... but respecting cpu usage seems more important at the time.

Regards.
ID: 154 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 162 - Posted: 18 Oct 2019, 6:28:56 UTC - in response to Message 154.  

Checkpoints are impossible, but unsuspendable tasks are a concern...

VM solve these 2 issues, but has a lot of other drawbacks
ID: 162 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 29 Aug 19
Posts: 15
Credit: 159,816
RAC: 0
Message 165 - Posted: 18 Oct 2019, 7:47:04 UTC
Last modified: 18 Oct 2019, 8:05:48 UTC

OK, so the new WU came down.
Received 12 T1 WU's and there are 4 cores available.

The new affinity coding has all 4 running T1 WU attaching to core 1 and ignoring core 2, 3 and 4. Each mwchem process uses 25% of a single core time slices.

I edited app_config.xml and added <project_max_concurrent>1</project_max_concurrent> and told BOINC Mgr to read the new config file. After reading the config file, BOINC Mgr sent pause commands to 3 of the 4 running nwchem WU's. BOINC Mgr now shows 3 T1 WU as waiting but the linux process manager shows all 4 nwchem processes running as before (on CPU core 1 using 25% of time each).
If they finish while BOINC Mgr has them in waiting state then they will possibly end in an error.

*This WU version still refuses to acknowledge suspension command issued by BOINC Mgr.
ID: 165 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 168 - Posted: 18 Oct 2019, 12:03:41 UTC - in response to Message 165.  

The boinc_wrapper (provided by Boinc), is not working as intended... after old WU, affinity problem should be solved, but suspension is a concern. Once again, VM doesn't experience this issue.
ID: 168 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 29 Aug 19
Posts: 15
Credit: 159,816
RAC: 0
Message 174 - Posted: 18 Oct 2019, 18:40:38 UTC - in response to Message 168.  
Last modified: 18 Oct 2019, 18:41:18 UTC

Once again, VM doesn't experience this issue.


Personally, I choose to not use your VM because it is inefficient in RAM usage (native nwchem is only using 1 GB RAM while 2 other projects get the other 1GB) and forces my machines into your project for days. The opportunity cost of missing high priority, rare work from other projects is toooooo steep.

A much higher percentage (compared to overall installed market share of OS'es) of the user base that crunch BOINC have native Linux machines doing the work.
It appears that each year the percentage of Linux machines increases in the BOINC user base (especially among the high-end equipment) and so, in the future, increasing amount of work done on your project will likely be native Linux WU's.
ID: 174 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 14 Oct 19
Posts: 7
Credit: 2,614,863
RAC: 0
Message 179 - Posted: 19 Oct 2019, 13:08:46 UTC - in response to Message 168.  

I'm experiencing the same problems - work units will not pause when I order it via the boincmgr.
They will however stop when I abort a task.
Two of the computers were installed Mon. the 14th, and one computer was installed today.

I'm also seeing the problem of core affinity, having multiple tasks sharing the same CPU core.
See the thread "New T1 native nwchem work unit affinity problem"
https://quchempedia.univ-angers.fr/athome/forum_thread.php?id=23
for that issue.

I have several machines on which I cannot have VMs running, so Linux native tasks must work
if I should continue crunching.

Have a nice weekend!!

//Gunnar
ID: 179 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 26 Aug 19
Posts: 15
Credit: 1,265,326
RAC: 0
Message 183 - Posted: 20 Oct 2019, 14:27:28 UTC - in response to Message 174.  
Last modified: 20 Oct 2019, 14:28:50 UTC

[quote]A much higher percentage (compared to overall installed market share of OS'es) of the user base that crunch BOINC have native Linux machines doing the work.

BoincStats disagrees.

Only linux is summarized there (173k) but if you sum up only the first 10 rows of windows versions you are above of 300k...
(and the RAC of only the first windows row is more than twice the global linux)

So I don't know where you get that idea from.

It appears that each year the percentage of Linux machines increases in the BOINC user base (especially among the high-end equipment) and so, in the future, increasing amount of work done on your project will likely be native Linux WU's.

And for that I need proof too. I don't say that linux base is not increasing, I say it will never be bigger than windows.

And for the Mac, oh well, I know, I know... (I'm a Mac user)
ID: 183 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 29 Aug 19
Posts: 15
Credit: 159,816
RAC: 0
Message 198 - Posted: 22 Oct 2019, 6:01:06 UTC - in response to Message 183.  
Last modified: 22 Oct 2019, 6:03:58 UTC

A much higher percentage (compared to overall installed market share of OS'es) of the user base that crunch BOINC have native Linux machines doing the work.

BoincStats disagrees.

Only linux is summarized there (173k) but if you sum up only the first 10 rows of windows versions you are above of 300k...
(and the RAC of only the first windows row is more than twice the global linux)

So I don't know where you get that idea from.


I said compared to market share.

Linux is at 2.1% of market share while Windows is at 87.4. Net Marketshare (Mac is actually gaining share; it is Linux with Apple's custom GUI)

Linux RAC in BOINC is ~22% (397k / 1750k [first 10 lines of Windows machines]) https://www.boincstats.com/stats/-5/host/breakdown/os/0/6/0

So 22% actual RAC compared to 2.1% market share means that Linux machines are pulling 10x their demographic in BOINC work.
A higher percentage of Linux OS cores are dedicated to BOINC than Windows OS cores are dedicated to BOINC.
In other words; people with Linux dedicated cores are more likely to run BOINC.

It appears that each year the percentage of Linux machines increases in the BOINC user base (especially among the high-end equipment) and so, in the future, increasing amount of work done on your project will likely be native Linux WU's.

And for that I need proof too. I don't say that linux base is not increasing, I say it will never be bigger than windows.

And for the Mac, oh well, I know, I know... (I'm a Mac user)


Many of the Gridcoin devs/help desk are of the opinion that Linux RAC has been increasing. Most our team's computing power is on Linux cores. They have been trying to get me off Windows for 3 years.
It would need a graph of Linux RAC vs Windows RAC over the last 5-10 years. I can ask if someone on the team has evidence. Or I can go through BOINCStats and use a spread sheet....
ID: 198 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 29 Aug 19
Posts: 15
Credit: 159,816
RAC: 0
Message 199 - Posted: 22 Oct 2019, 6:14:44 UTC - in response to Message 165.  

OK, so the new WU came down.
Received 12 T1 WU's and there are 4 cores available.

The new affinity coding has all 4 running T1 WU attaching to core 1 and ignoring core 2, 3 and 4. Each mwchem process uses 25% of a single core time slices.

I edited app_config.xml and added <project_max_concurrent>1</project_max_concurrent> and told BOINC Mgr to read the new config file. After reading the config file, BOINC Mgr sent pause commands to 3 of the 4 running nwchem WU's. BOINC Mgr now shows 3 T1 WU as waiting but the linux process manager shows all 4 nwchem processes running as before (on CPU core 1 using 25% of time each).
If they finish while BOINC Mgr has them in waiting state then they will possibly end in an error.


Update:
They didn't end in error although BOINCMgr thought they were suspended (actually chose SUSPEND option on each task in BOINCMgr).
They completed successfully while bound to a single core.

56497 12761 17 Oct 2019, 3:00:22 UTC 21 Oct 2019, 16:52:02 UTC Completed and validated 107,601.70 107,601.70 506.65 NWChem v0.11 (t1) x86_64-pc-linux-gnu
56490 12575 17 Oct 2019, 3:00:21 UTC 21 Oct 2019, 14:39:02 UTC Completed and validated 101,876.40 101,876.40 479.69 NWChem v0.11 (t1) x86_64-pc-linux-gnu
56500 12773 17 Oct 2019, 3:00:07 UTC 21 Oct 2019, 13:42:44 UTC Completed and validated 175,668.19 103,707.20 827.14 NWChem v0.11 (t1) x86_64-pc-linux-gnu
56488 12561 17 Oct 2019, 2:59:51 UTC 20 Oct 2019, 16:41:26 UTC Completed and validated 298,843.84 78,637.86 1,407.11 NWChem v0.11 (t1) x86_64-pc-linux-gnu
ID: 199 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Thundergrid

Send message
Joined: 8 Oct 19
Posts: 3
Credit: 18,600
RAC: 0
Message 258 - Posted: 7 Nov 2019, 13:46:55 UTC - in response to Message 162.  

Checkpoints are impossible, but unsuspendable tasks are a concern...
VM solve these 2 issues, but has a lot of other drawbacks


Ok, i've installed VBox 6.0.. boinc still downloads native linux app, how can i force it to download the VM version?
Regards
ID: 258 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 26 Aug 19
Posts: 15
Credit: 1,265,326
RAC: 0
Message 264 - Posted: 9 Nov 2019, 11:42:46 UTC

Damotbe will confirm but I think the previous dev (the trainee) did completely stop the VM task for linux once he had come up with the native client, he thought it was a better solution.

So I don't think you can do this now, there is no choice.

He was also planning to do the same with windows and mac OS but his training period ended before that :)
ID: 264 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 265 - Posted: 10 Nov 2019, 9:28:15 UTC - in response to Message 264.  

Yes, I confirm.

Btw, you can run Quchempedia in a Linux VM on linux. This way, you can control precisely what you want.
ID: 265 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Native Linux WU refuses to suspend

©2024 Benoit DA MOTA - LERIA, University of Angers, France