Posts by marmot

1) Message boards : Number crunching : Native Linux WU refuses to suspend (Message 199)
Posted 22 Oct 2019 by marmot
Post:
OK, so the new WU came down.
Received 12 T1 WU's and there are 4 cores available.

The new affinity coding has all 4 running T1 WU attaching to core 1 and ignoring core 2, 3 and 4. Each mwchem process uses 25% of a single core time slices.

I edited app_config.xml and added <project_max_concurrent>1</project_max_concurrent> and told BOINC Mgr to read the new config file. After reading the config file, BOINC Mgr sent pause commands to 3 of the 4 running nwchem WU's. BOINC Mgr now shows 3 T1 WU as waiting but the linux process manager shows all 4 nwchem processes running as before (on CPU core 1 using 25% of time each).
If they finish while BOINC Mgr has them in waiting state then they will possibly end in an error.


Update:
They didn't end in error although BOINCMgr thought they were suspended (actually chose SUSPEND option on each task in BOINCMgr).
They completed successfully while bound to a single core.

56497 12761 17 Oct 2019, 3:00:22 UTC 21 Oct 2019, 16:52:02 UTC Completed and validated 107,601.70 107,601.70 506.65 NWChem v0.11 (t1) x86_64-pc-linux-gnu
56490 12575 17 Oct 2019, 3:00:21 UTC 21 Oct 2019, 14:39:02 UTC Completed and validated 101,876.40 101,876.40 479.69 NWChem v0.11 (t1) x86_64-pc-linux-gnu
56500 12773 17 Oct 2019, 3:00:07 UTC 21 Oct 2019, 13:42:44 UTC Completed and validated 175,668.19 103,707.20 827.14 NWChem v0.11 (t1) x86_64-pc-linux-gnu
56488 12561 17 Oct 2019, 2:59:51 UTC 20 Oct 2019, 16:41:26 UTC Completed and validated 298,843.84 78,637.86 1,407.11 NWChem v0.11 (t1) x86_64-pc-linux-gnu
2) Message boards : Number crunching : Native Linux WU refuses to suspend (Message 198)
Posted 22 Oct 2019 by marmot
Post:
A much higher percentage (compared to overall installed market share of OS'es) of the user base that crunch BOINC have native Linux machines doing the work.

BoincStats disagrees.

Only linux is summarized there (173k) but if you sum up only the first 10 rows of windows versions you are above of 300k...
(and the RAC of only the first windows row is more than twice the global linux)

So I don't know where you get that idea from.


I said compared to market share.

Linux is at 2.1% of market share while Windows is at 87.4. Net Marketshare (Mac is actually gaining share; it is Linux with Apple's custom GUI)

Linux RAC in BOINC is ~22% (397k / 1750k [first 10 lines of Windows machines]) https://www.boincstats.com/stats/-5/host/breakdown/os/0/6/0

So 22% actual RAC compared to 2.1% market share means that Linux machines are pulling 10x their demographic in BOINC work.
A higher percentage of Linux OS cores are dedicated to BOINC than Windows OS cores are dedicated to BOINC.
In other words; people with Linux dedicated cores are more likely to run BOINC.

It appears that each year the percentage of Linux machines increases in the BOINC user base (especially among the high-end equipment) and so, in the future, increasing amount of work done on your project will likely be native Linux WU's.

And for that I need proof too. I don't say that linux base is not increasing, I say it will never be bigger than windows.

And for the Mac, oh well, I know, I know... (I'm a Mac user)


Many of the Gridcoin devs/help desk are of the opinion that Linux RAC has been increasing. Most our team's computing power is on Linux cores. They have been trying to get me off Windows for 3 years.
It would need a graph of Linux RAC vs Windows RAC over the last 5-10 years. I can ask if someone on the team has evidence. Or I can go through BOINCStats and use a spread sheet....
3) Message boards : Number crunching : Error while computing with windows 10 (Message 197)
Posted 22 Oct 2019 by marmot
Post:
@PDW
There must be others who have Windows Pro and have QuChemPedia successfully run tasks to comment on this ?


My Windows 10 Pro machine fried a MB (actual on-board resistor turned to black carbon soot; took the power supply 12V line with it)
It'll be a week (maybe longer) before I can attempt this on that machine.

Although, I have a spare laptop, Windows 10 Pro license, and a custom VM that successfully runs nwchem WU's; if I can find the time, can newly build Windows 10 Pro 1903 on the laptop, copy the custom VM over I can test both project d/led VM's and the known working quantity (custom VM).

@Byron Leigh Hatch
Alternately, a custom built VM running native Linux nwchem WU's successfully on one of our machines could be tested on your Windows 10 Pro host.
Just need to transfer the vdi file to you, you add a default Linux (Debian?) VM machine in VBox, then attach the drive, fire it up then change the VM host name(tricky part for beginner; we could change the name before shipment) , then attach to the project in BOINC with your email and ask for work.
4) Message boards : Number crunching : New T1 native nwchem work unit affinity problem. (Message 196)
Posted 22 Oct 2019 by marmot
Post:

<app_config>
   <project_max_concurrent>2</project_max_concurrent>
</app_config>


This was added to my configuration 30+ days ago and will do nothing to correct the issue discussed in this thread.

The T1 WU's are all attaching to a single core and even max_concurrent = 2 will still leave a single CPU core unused.
5) Message boards : Number crunching : Error while computing with windows 10 (Message 180)
Posted 20 Oct 2019 by marmot
Post:
I doubt that the issue is so esoteric. It was not long ago that VirtualBox did not support more than 4 cores (there used to be a warning on the manager).
Then, they upped it to 8 cores. I don't know what it is now, but you have 48 cores, apparently spread out over two CPUs.

That is probably too many cores, and two CPUs is not what VBox is designed for.


Oh, did not notice the core count.

My servers have 32 cores on 2 CPU's and VBox will only work properly on 8 or less cores. 16 core attempt (warning in the settings as above 8 cores turns yellow, but VBox Manager won't stop it) still only uses 8 cores;the guest sees 16 cores but it ends up having the idle process use 50% CPU.

BUT, QuChemPediIA should only send down maximum 8 core VM's and his core count shouldn't matter and the WU's should be working..

You could try to build your own Linux VM on 4 cores (it's very fast install for Linux Mint), install BOINC and attach to the project then see if it runs 4x T1 native nwchem WU's.
6) Message boards : Number crunching : Error while computing with windows 10 (Message 176)
Posted 18 Oct 2019 by marmot
Post:
Doing a reset of the situation, in order to avoid some odd VBox error, might be in order.

Use a product like Revo Uninstaller (can be found on PortableApps.com, there are other products but I trust Revo most) to completely remove Oracle Virtual Box; do advanced uninstall to find and remove all VBox's folders and registry entries. (Of course, you need to be careful to save any custom VM's you've built!!!).
Double check you have the BIOS settings for virtualization correct (exiting BIOS without saving settings can happen...)
Then reinstall VBox from scratch. (Oracle VBox has had a few bugs over the years)

It's a highly unlikely scenario, but if the issue is in the BIOS/hardware, you could move the hard drive to another machine, let Windows 10 go through device reconfiguration on the new machine, then ignore Windows 10 activation errors, and see if the VM's work on that hardware.
Be sure to use hardware that is known to run VM's properly.
7) Message boards : Number crunching : New T1 native nwchem work unit affinity problem. (Message 175)
Posted 18 Oct 2019 by marmot
Post:
run.sh is dated October 3, 2019 as are all *.nw files.

Files in /bin folder are dated 8/30/2019
8) Message boards : Number crunching : Native Linux WU refuses to suspend (Message 174)
Posted 18 Oct 2019 by marmot
Post:
Once again, VM doesn't experience this issue.


Personally, I choose to not use your VM because it is inefficient in RAM usage (native nwchem is only using 1 GB RAM while 2 other projects get the other 1GB) and forces my machines into your project for days. The opportunity cost of missing high priority, rare work from other projects is toooooo steep.

A much higher percentage (compared to overall installed market share of OS'es) of the user base that crunch BOINC have native Linux machines doing the work.
It appears that each year the percentage of Linux machines increases in the BOINC user base (especially among the high-end equipment) and so, in the future, increasing amount of work done on your project will likely be native Linux WU's.
9) Message boards : Number crunching : New T1 native nwchem work unit affinity problem. (Message 166)
Posted 18 Oct 2019 by marmot
Post:
On the Linux machine that has 4 cores and runs 4 nwchem T1 at a time, all 4 nwchem processes choose to bind to core 1 and share it's time. So Linux process manager shows each nwchem process using 25% of a core and the other 3 CPU cores sit idle.

The T1 WU don't interact and coordinate CPU core affinity selection.

Another issue is that I have no way to force the server to send down T4 WU in order to use all 4 cores.
10) Message boards : Number crunching : Native Linux WU refuses to suspend (Message 165)
Posted 18 Oct 2019 by marmot
Post:
OK, so the new WU came down.
Received 12 T1 WU's and there are 4 cores available.

The new affinity coding has all 4 running T1 WU attaching to core 1 and ignoring core 2, 3 and 4. Each mwchem process uses 25% of a single core time slices.

I edited app_config.xml and added <project_max_concurrent>1</project_max_concurrent> and told BOINC Mgr to read the new config file. After reading the config file, BOINC Mgr sent pause commands to 3 of the 4 running nwchem WU's. BOINC Mgr now shows 3 T1 WU as waiting but the linux process manager shows all 4 nwchem processes running as before (on CPU core 1 using 25% of time each).
If they finish while BOINC Mgr has them in waiting state then they will possibly end in an error.

*This WU version still refuses to acknowledge suspension command issued by BOINC Mgr.
11) Message boards : Number crunching : Native Linux WU refuses to suspend (Message 99)
Posted 10 Oct 2019 by marmot
Post:
It seems that your BOINCmgr loose the ability to manage nwChem. Probably he stops our wrapper (run.sh if I remember, that lauch nwchem several times) but detachs child processes. Can you report the corresponding hierarchy of involved tasks (PID and PPID).


This system monitor in antiX is MATE's version. It doesn't clearly represent the parent/child hierarchy.
Oh, I found a view setting Dependencies that shows hierarchy graphically.

This is a new work unit that I have NOT tried to pause or suspend in any way.

Process manager shows BOINC launched:

wrapper PID 1100 (launched Tuesday 7:27am)
-- bash PID 1102 (launched Tuesday 7:27am)
---- bash PID 1124 (launched Tuesday 7:27am)
------ mpirun PID 19982 (launched Thursday 12:58am)
-------- nwchem PID 19987 (launched Thursday 12:58am)
-------- nwchem PID 19988 (launched Thursday 12:58am)

I see under properties:
nwchem ( BOINC mgr reports 2 days 7 hours 43 minutes)
launched: today 12:58am
PID 19988
command line: nwchem TD_singlet.nw
12) Message boards : Number crunching : Monster wu (Message 97)
Posted 10 Oct 2019 by marmot
Post:
adrianxw's issue with VBox aside.

My two native nwchem tasks (1 t2 and 1 t1) both show 244+ days left after 50+ hours run time.

The previous tasks have always estimated somewhere over 3 days. One task went 5 days but the machine was downclocked 50% and these should complete within the 3 day period. I'll let them go 4 days.
The estimate used to be a fairly good estimate.

The problem with such drastic estimates is that all other work units from other projects will stop as BOINC thinks the work cache is completely full.
13) Message boards : Number crunching : measured RAM usage for run life of NWCHEM process (Message 80)
Posted 8 Oct 2019 by marmot
Post:
From discussion about how much RAM the native Linux version of nwchem WU uses over at https://forum.boinc-af.org/index.php?topic=7838.msg490228#msg490228.

The test was run during a late September local heat wave and so the computer was downclocked severely and suspended for 8-10 hours a day. First run was lost due to a power outage and 81 hours of work lost from lack of save point. 2nd run is on the same WU which actually was given credit even though it was reported days after the deadline. WU was run in an antiX small-footprint Debian Linux VM dedicated to BOINC running on Windows 7 host laptop.

The WU is a t2 class nwchem and RAM used is measured for each of the 2 threads. Maximum observed usage per thread is 580MB

Work Unit: https://quchempedia.univ-angers.fr/athome/result.php?resultid=34852

Task 34852
Name	dsgdb9nsd_nwchem,bath02,010558591,nwchem,1570181983_0
Workunit	22185
Created	4 Oct 2019, 9:39:46 UTC
Sent	6 Oct 2019, 11:15:35 UTC
Report deadline	20 Oct 2019, 11:15:35 UTC
Received	7 Oct 2019, 18:11:49 UTC
Server state	Over
Outcome	Success
Client state	Done
Exit status	0 (0x00000000)
Computer ID	40
Run time	23 hours 30 min 48 sec
CPU time	19 hours 30 min 7 sec
Validate state	Valid
Credit	398.57
Device peak FLOPS	2.03 GFLOPS
Application version	NWChem v0.08 (t1)
x86_64-pc-linux-gnu
Peak working set size	572.46 MB
Peak swap size	1,988.55 MB
Peak disk usage	251.48 MB


Observed usage (HOUR / thread 1 RAM / thread 2 RAM / total RAM used by all processes / virtual RAM claimed by nwchem thread / run 1st or 2nd)
NWChem t2
hour 01: 166, 196mb / tot 0658mb / virt 1.2gb ea (run 2)
hour 02: 166, 174mb / tot 0636mb / virt 1.2gb ea (run 2)
hour 06: 166, 195mb / tot 0657mb / virt 1.2gb ea (run 2)
hour 08: 167, 191mb / tot 0662mb / virt 1.2gb ea (run 2)
hour 14: 166, 188mb / tot 0657mb / virt 1.2gb ea (run 2)

hour 22: 162, 170mb / tot 0641mb / virt 1.2gb ea
hour 22: 167, 191mb / tot 0665mb / virt 1.2gb ea (run 2)

hour 29: 162, 190mb / tot 0661mb / virt 1.2gb ea
hour 29: 570, 576mb / tot 1453mb / virt 1.6/1.5gb (run 2) hour unclear as BOINC declares "time of day suspended", yet it's still running.
hour 31: 350, 357mb / tot 1021mb / virt 1.3/1.3gb (run 2)

hour 34: 162, 184mb / tot 0657mb / virt 1.2gb ea
hour 34: 574, 580mb / tot 1454mb / virt 1.6gb ea (run 2)

hour 41: 450, 534mb / tot 1255mb / virt 1.4/1.5gb (run 2)

hour 47: 162, 171mb / tot 0644mb / virt 1.2gb ea
hour 47: 489, 576mb / tot 1376mb / virt 1.5/1.6gb(run 2)

hour 54: 162, 171mb / tot 0645mb / virt 1.2gb ea
hour 58: 490, 576mb / tot 1378mb / virt 1.5/1.6gb(run 2)

hour 74: 490, 576mb / tot 1378mb / virt 1.5/1.6gb(run 2)

hour 79: 490, 577mb / tot 1380mb / virt 1.5/1.6gb(run 2)

hour 81: 485, 581mb / tot 1380mb / virt 1.5/1.6gb

hour 83: 186, 219mb / tot 0798mb / virt 1.8gb ea(run 2)
hour 85: 197, 232mb / tot 0838mb / virt 1.8gb ea(run 2)
hour 90: 264, 325mb / tot 0998mb / virt 1.9gb ea(run 2)
hour 96: 254, 314mb / tot 0971mb / virt 1.9gb ea(run 2)
hour 100: 330, 339mb / tot 1072mb / virt 1.9/2.0gb (run 2)
hour 104: 260, 319mb / tot 0980mb / virt 1.9gb ea(run 2)
hour 110: 283, 343mb / tot 1022mb / virt 1.9gb ea(run 2)
hour 120: 289, 349mb / tot 1036mb / virt 1.9gb ea(run 2)
hour 126: 283, 336mb / tot 1019mb / virt 1.9gb ea(run 2)
hour 133: 175, 176mb / tot 0747mb / virt 1.8gb ea(run 2)
hour 140: 196, 225mb / tot 0820mb / virt 1.8gb ea(run 2)
hour 148: 202, 232mb / tot 0825mb / virt 1.8/1.9gb (run 2)
hour 155: 215, 237mb / tot 0846mb / virt 1.8/1.9gb (run 2)
hour 160: 192, 209mb / tot 0798mb / virt 1.8/1.9gb (run 2)



So, people should dedicate ~600MB RAM per thread they plan on running.

2 core runs 1.2GB
4 core runs 2.4GB
8 core runs 4.8GB
14) Message boards : Number crunching : Native Linux WU refuses to suspend (Message 78)
Posted 8 Oct 2019 by marmot
Post:
Up until October 1st, my electric bill was on a summer plan with price 31 cents per kWh from 2pm till 7pm and 8 cents the rest of the day and weekends.
All the BOINC installs were set to suspend work from 2pm till 7pm.

Noticed that the BOINC dedicated Linux VM (my own build) was still using 95% CPU at 4pm late September.

nwchem refuses to acknowledge suspension order from BOINCmgr although BOINCmgr reports that the WU is suspended. MATE System Monitor showed nwchem process happily crunching away at 31 cents per kWh...

Also, when TBrada had a 4 high priority WU, nwchem again refused to suspend when BOINCmgr used resource managed suspension logic. So the situation occurred where BOINC had 4 cores allowed for usage and was actually running 6 threads. 2 threads for a t2 version of nwchem that refused to suspend.

There is a lack of checkpoints, but suspended WU's can remain in RAM and not lose their progress. nwchem uses a maximum (personally measured) of 580mb per nwchem thread; so suspending in RAM could be significant loss of RAM to other projects while suspended. But it's better than loss of work progress.

Confirmed this behavior still exists in the WU on my machine Sunday.
15) Message boards : Number crunching : Monster wu (Message 77)
Posted 8 Oct 2019 by marmot
Post:
I've been crunching these in native Linux for 3+ weeks and this is certainly new behavior.

The 2 current WU report 245+ days till completion.
All the prior WU would estimate maximum 3.5 days.

I do not currently have <fraction_done_exact/> in the app_config.xml but will see if it repairs the issue.
Doubtful since the WU is reporting less than 1% complete after 7 hours.




©2024 Benoit DA MOTA - LERIA, University of Angers, France