Message boards :
Number crunching :
Native Linux WU refuses to suspend
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Aug 19 Posts: 15 Credit: 159,816 RAC: 0 |
Up until October 1st, my electric bill was on a summer plan with price 31 cents per kWh from 2pm till 7pm and 8 cents the rest of the day and weekends. All the BOINC installs were set to suspend work from 2pm till 7pm. Noticed that the BOINC dedicated Linux VM (my own build) was still using 95% CPU at 4pm late September. nwchem refuses to acknowledge suspension order from BOINCmgr although BOINCmgr reports that the WU is suspended. MATE System Monitor showed nwchem process happily crunching away at 31 cents per kWh... Also, when TBrada had a 4 high priority WU, nwchem again refused to suspend when BOINCmgr used resource managed suspension logic. So the situation occurred where BOINC had 4 cores allowed for usage and was actually running 6 threads. 2 threads for a t2 version of nwchem that refused to suspend. There is a lack of checkpoints, but suspended WU's can remain in RAM and not lose their progress. nwchem uses a maximum (personally measured) of 580mb per nwchem thread; so suspending in RAM could be significant loss of RAM to other projects while suspended. But it's better than loss of work progress. Confirmed this behavior still exists in the WU on my machine Sunday. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
It seems that your BOINCmgr loose the ability to manage nwChem. Probably he stops our wrapper (run.sh if I remember, that lauch nwchem several times) but detachs child processes. Can you report the corresponding hierarchy of involved tasks (PID and PPID). |
Send message Joined: 3 Oct 19 Posts: 14 Credit: 32,908,253 RAC: 0 |
I noticed on one machine when I finished all WU and they had been turned in I still had 8 nwchem processes running. I had to kill them using their PIDs. They apparently were orphaned. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Yes, I notice this when we worked on the affinity issue. The BOINC wrapper called (multiple times) mpirun, that invoke nwchem, but nwchem is/are not attached to mpirun. We will investigate this issue shortly after the affinity patch. |
Send message Joined: 29 Aug 19 Posts: 15 Credit: 159,816 RAC: 0 |
It seems that your BOINCmgr loose the ability to manage nwChem. Probably he stops our wrapper (run.sh if I remember, that lauch nwchem several times) but detachs child processes. Can you report the corresponding hierarchy of involved tasks (PID and PPID). This system monitor in antiX is MATE's version. Oh, I found a view setting Dependencies that shows hierarchy graphically. This is a new work unit that I have NOT tried to pause or suspend in any way. Process manager shows BOINC launched: wrapper PID 1100 (launched Tuesday 7:27am) -- bash PID 1102 (launched Tuesday 7:27am) ---- bash PID 1124 (launched Tuesday 7:27am) ------ mpirun PID 19982 (launched Thursday 12:58am) -------- nwchem PID 19987 (launched Thursday 12:58am) -------- nwchem PID 19988 (launched Thursday 12:58am) I see under properties: nwchem ( BOINC mgr reports 2 days 7 hours 43 minutes) launched: today 12:58am PID 19988 command line: nwchem TD_singlet.nw |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Ok, the hierarchy is correct (and nwchem is attached to mpirun). If you try to supsend can you verify the state of all the processus involved ? |
Send message Joined: 8 Oct 19 Posts: 3 Credit: 18,600 RAC: 0 |
Hi, i've run into the same issue, tasks doesn't respect cpu usage either, it keeps at 100% all the time no matter the configuration settings. Balanced cpu usage is critical to boinc success, if a task consumes CPU cycles i need to work then it becomes a resource hog and nobody likes them. I hope you find the problem and correct it, also checkpoints will be a great feature with such a long WU... but respecting cpu usage seems more important at the time. Regards. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Checkpoints are impossible, but unsuspendable tasks are a concern... VM solve these 2 issues, but has a lot of other drawbacks |
Send message Joined: 29 Aug 19 Posts: 15 Credit: 159,816 RAC: 0 |
OK, so the new WU came down. Received 12 T1 WU's and there are 4 cores available. The new affinity coding has all 4 running T1 WU attaching to core 1 and ignoring core 2, 3 and 4. Each mwchem process uses 25% of a single core time slices. I edited app_config.xml and added <project_max_concurrent>1</project_max_concurrent> and told BOINC Mgr to read the new config file. After reading the config file, BOINC Mgr sent pause commands to 3 of the 4 running nwchem WU's. BOINC Mgr now shows 3 T1 WU as waiting but the linux process manager shows all 4 nwchem processes running as before (on CPU core 1 using 25% of time each). If they finish while BOINC Mgr has them in waiting state then they will possibly end in an error. *This WU version still refuses to acknowledge suspension command issued by BOINC Mgr. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
The boinc_wrapper (provided by Boinc), is not working as intended... after old WU, affinity problem should be solved, but suspension is a concern. Once again, VM doesn't experience this issue. |
Send message Joined: 29 Aug 19 Posts: 15 Credit: 159,816 RAC: 0 |
Once again, VM doesn't experience this issue. Personally, I choose to not use your VM because it is inefficient in RAM usage (native nwchem is only using 1 GB RAM while 2 other projects get the other 1GB) and forces my machines into your project for days. The opportunity cost of missing high priority, rare work from other projects is toooooo steep. A much higher percentage (compared to overall installed market share of OS'es) of the user base that crunch BOINC have native Linux machines doing the work. It appears that each year the percentage of Linux machines increases in the BOINC user base (especially among the high-end equipment) and so, in the future, increasing amount of work done on your project will likely be native Linux WU's. |
Send message Joined: 14 Oct 19 Posts: 7 Credit: 2,614,863 RAC: 0 |
I'm experiencing the same problems - work units will not pause when I order it via the boincmgr. They will however stop when I abort a task. Two of the computers were installed Mon. the 14th, and one computer was installed today. I'm also seeing the problem of core affinity, having multiple tasks sharing the same CPU core. See the thread "New T1 native nwchem work unit affinity problem" https://quchempedia.univ-angers.fr/athome/forum_thread.php?id=23 for that issue. I have several machines on which I cannot have VMs running, so Linux native tasks must work if I should continue crunching. Have a nice weekend!! //Gunnar |
Send message Joined: 26 Aug 19 Posts: 15 Credit: 1,265,326 RAC: 0 |
[quote]A much higher percentage (compared to overall installed market share of OS'es) of the user base that crunch BOINC have native Linux machines doing the work. BoincStats disagrees. Only linux is summarized there (173k) but if you sum up only the first 10 rows of windows versions you are above of 300k... (and the RAC of only the first windows row is more than twice the global linux) So I don't know where you get that idea from. It appears that each year the percentage of Linux machines increases in the BOINC user base (especially among the high-end equipment) and so, in the future, increasing amount of work done on your project will likely be native Linux WU's. And for that I need proof too. I don't say that linux base is not increasing, I say it will never be bigger than windows. And for the Mac, oh well, I know, I know... (I'm a Mac user) |
Send message Joined: 29 Aug 19 Posts: 15 Credit: 159,816 RAC: 0 |
A much higher percentage (compared to overall installed market share of OS'es) of the user base that crunch BOINC have native Linux machines doing the work. I said compared to market share. Linux is at 2.1% of market share while Windows is at 87.4. Net Marketshare (Mac is actually gaining share; it is Linux with Apple's custom GUI) Linux RAC in BOINC is ~22% (397k / 1750k [first 10 lines of Windows machines]) https://www.boincstats.com/stats/-5/host/breakdown/os/0/6/0 So 22% actual RAC compared to 2.1% market share means that Linux machines are pulling 10x their demographic in BOINC work. A higher percentage of Linux OS cores are dedicated to BOINC than Windows OS cores are dedicated to BOINC. In other words; people with Linux dedicated cores are more likely to run BOINC. It appears that each year the percentage of Linux machines increases in the BOINC user base (especially among the high-end equipment) and so, in the future, increasing amount of work done on your project will likely be native Linux WU's. Many of the Gridcoin devs/help desk are of the opinion that Linux RAC has been increasing. Most our team's computing power is on Linux cores. They have been trying to get me off Windows for 3 years. It would need a graph of Linux RAC vs Windows RAC over the last 5-10 years. I can ask if someone on the team has evidence. Or I can go through BOINCStats and use a spread sheet.... |
Send message Joined: 29 Aug 19 Posts: 15 Credit: 159,816 RAC: 0 |
OK, so the new WU came down. Update: They didn't end in error although BOINCMgr thought they were suspended (actually chose SUSPEND option on each task in BOINCMgr). They completed successfully while bound to a single core. 56497 12761 17 Oct 2019, 3:00:22 UTC 21 Oct 2019, 16:52:02 UTC Completed and validated 107,601.70 107,601.70 506.65 NWChem v0.11 (t1) x86_64-pc-linux-gnu 56490 12575 17 Oct 2019, 3:00:21 UTC 21 Oct 2019, 14:39:02 UTC Completed and validated 101,876.40 101,876.40 479.69 NWChem v0.11 (t1) x86_64-pc-linux-gnu 56500 12773 17 Oct 2019, 3:00:07 UTC 21 Oct 2019, 13:42:44 UTC Completed and validated 175,668.19 103,707.20 827.14 NWChem v0.11 (t1) x86_64-pc-linux-gnu 56488 12561 17 Oct 2019, 2:59:51 UTC 20 Oct 2019, 16:41:26 UTC Completed and validated 298,843.84 78,637.86 1,407.11 NWChem v0.11 (t1) x86_64-pc-linux-gnu |
Send message Joined: 8 Oct 19 Posts: 3 Credit: 18,600 RAC: 0 |
Checkpoints are impossible, but unsuspendable tasks are a concern... Ok, i've installed VBox 6.0.. boinc still downloads native linux app, how can i force it to download the VM version? Regards |
Send message Joined: 26 Aug 19 Posts: 15 Credit: 1,265,326 RAC: 0 |
Damotbe will confirm but I think the previous dev (the trainee) did completely stop the VM task for linux once he had come up with the native client, he thought it was a better solution. So I don't think you can do this now, there is no choice. He was also planning to do the same with windows and mac OS but his training period ended before that :) |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Yes, I confirm. Btw, you can run Quchempedia in a Linux VM on linux. This way, you can control precisely what you want. |
©2024 Benoit DA MOTA - LERIA, University of Angers, France