Very little CPU usage

Message boards : Number crunching : Very little CPU usage
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Bryan

Send message
Joined: 3 Oct 19
Posts: 14
Credit: 32,908,253
RAC: 0
Message 76 - Posted: 8 Oct 2019, 16:00:06 UTC - in response to Message 73.  

damotbe, the problem is that you are wasting resources. When 64 WU are launched on a 64t machine only 2 threads are used on a single CPU and 4 threads get used on a dual CPU machine. That means that the threads are over committed by either 16:1 or 32:1. It is a total waste of machine capability. Without your program setting affinity a single machine could produce a lot more work for the project.

The program continuously launches "child" processes and kills the old processes so an affinity script must continuously run. The best solution is to remove the affinity control from the program's executable. Let Linux decide what threads to use.
ID: 76 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 85 - Posted: 9 Oct 2019, 6:19:17 UTC - in response to Message 76.  

damotbe, the problem is that you are wasting resources. When 64 WU are launched on a 64t machine only 2 threads are used on a single CPU and 4 threads get used on a dual CPU machine. That means that the threads are over committed by either 16:1 or 32:1. It is a total waste of machine capability. Without your program setting affinity a single machine could produce a lot more work for the project.

The program continuously launches "child" processes and kills the old processes so an affinity script must continuously run. The best solution is to remove the affinity control from the program's executable. Let Linux decide what threads to use.


Yes we know that and we are very concerned by this issue. The affinity is auto-tuned by mpirun. We tried multiple parameters of this wrapper. It's very difficult to assess the changes since multiple versions of the code runs. Secondly, the static compilation of nwchem (not our code) is very tricky and we tried to shortcut mpi. Thirdly, we don't have engineer anymore (reality of public research...) and it's hard to solve this particular issue, but we try !
ID: 85 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bryan

Send message
Joined: 3 Oct 19
Posts: 14
Credit: 32,908,253
RAC: 0
Message 89 - Posted: 9 Oct 2019, 14:15:49 UTC - in response to Message 85.  

I don't know if the VBox version actually assigns CPU affinity since I haven't run it. On the native Linux, which runs well, you should warn people that running more than 1 WU per CPU is only increasing the time to completion. On my dual CPU Intel machines running 64 WU means it would take 16X more time to complete a WU.

I hope you find a solution because the native linux app runs very well.
ID: 89 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 94 - Posted: 10 Oct 2019, 11:53:44 UTC - in response to Message 89.  

We reproduce the problem on my bi-socket Xeon and we wrote a affinity script to include in the native app. We will try to commit the change tonight.
ID: 94 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bryan

Send message
Joined: 3 Oct 19
Posts: 14
Credit: 32,908,253
RAC: 0
Message 96 - Posted: 10 Oct 2019, 13:54:33 UTC - in response to Message 94.  

Excellent! That will greatly help the project and the users :)
ID: 96 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
CCPLogibro

Send message
Joined: 2 Jan 20
Posts: 1
Credit: 31,106
RAC: 0
Message 390 - Posted: 6 Jan 2020, 17:19:10 UTC

Is this the behavior people are talking about?

I have a bunch of WUs not using any CPU whatsoever but still running.

ID: 390 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marsinph

Send message
Joined: 13 Nov 19
Posts: 21
Credit: 2,596,565
RAC: 0
Message 391 - Posted: 6 Jan 2020, 18:32:03 UTC - in response to Message 390.  

I see i am not alone !!!
ID: 391 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[VENETO] boboviz

Send message
Joined: 13 Sep 19
Posts: 69
Credit: 399,347
RAC: 0
Message 392 - Posted: 6 Jan 2020, 18:58:13 UTC - in response to Message 390.  

Is this the behavior people are talking about?

Yes

I have a bunch of WUs not using any CPU whatsoever but still running.

Kill these wus
ID: 392 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 393 - Posted: 7 Jan 2020, 10:16:24 UTC - in response to Message 392.  

If I understand well the problem comes from Vbox guest additions needed for the shared directory (to read/write out of the VM). If it fails, the calculation is not launched (no access to data and scripts) but the VM continue to run... And no access to our scripts means we can't tell to the VM to stop.

I add to the todo list to find a workaround. in the meantime, do not hesitate to abort these workunits. If all your workunits do that, maybe another version of vbox and additions will do the trick.
ID: 393 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 394 - Posted: 7 Jan 2020, 14:38:29 UTC
Last modified: 7 Jan 2020, 15:02:52 UTC

I just thought I should mention that native NWChem is working OK on an Ubuntu 18.04.3 machine.

Looking at BOINCTasks, it seems at first that not all the cores are being used:



However, running the "top" command shows that all 12 cores of my Ryzen 2600 are actually in use (with one on Folding):



So it may be the case that the "core affinity" problem is still there, but it doesn't matter for me.
Even if the work units are jumping between cores, all of them are in use.

It has been running very well for months.
ID: 394 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
vaughan

Send message
Joined: 11 Oct 19
Posts: 4
Credit: 1,604,204
RAC: 0
Message 396 - Posted: 8 Jan 2020, 3:12:19 UTC

Do we abort long running tasks stuck at 100% for 2 days?

Seems an absolute waste of time and electricity. Can this issue be fixed?
ID: 396 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 13 Oct 19
Posts: 87
Credit: 6,026,455
RAC: 0
Message 397 - Posted: 8 Jan 2020, 3:40:33 UTC - in response to Message 396.  

Do we abort long running tasks stuck at 100% for 2 days?

Seems an absolute waste of time and electricity. Can this issue be fixed?

I abort any WU taking longer than 12 hours to complete. Haven't been sorry yet.
ID: 397 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 399 - Posted: 8 Jan 2020, 10:23:49 UTC - in response to Message 397.  

Current work units should be short (<12h) with a mean runtime of 1.5 hours.

However, for the next batches this rule of thumb will no longer apply, target runtime should be around 12h-24h with huge variability due to large molecular systems.
ID: 399 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marsinph

Send message
Joined: 13 Nov 19
Posts: 21
Credit: 2,596,565
RAC: 0
Message 401 - Posted: 8 Jan 2020, 14:13:51 UTC - in response to Message 399.  

Current work units should be short (<12h) with a mean runtime of 1.5 hours.

However, for the next batches this rule of thumb will no longer apply, target runtime should be around 12h-24h with huge variability due to large molecular systems.





Hello Damotbe
Thank you for information about the new WU.

I have about 50% of my "short" WU stuck. The others runs in about 2 hours.
But if we would need (and monopolyze full ressources while host is doing nothing) for up to one day, It is not efficient.
Do you think we will wait several days, for at end, canceling ?
ID: 401 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
UBT - Timbo

Send message
Joined: 8 Dec 19
Posts: 13
Credit: 652,594
RAC: 0
Message 406 - Posted: 8 Jan 2020, 21:44:01 UTC
Last modified: 8 Jan 2020, 21:44:16 UTC

Hi all

I have one task that is now at 50 hrs of elapsed time (according to BOINC Manager).

I can see very little CPU time being used by the task which is this:

https://quchempedia.univ-angers.fr/athome/result.php?resultid=702088

Here is the output from the task "properties"

Application NWChem 0.11 (vbox64_t1)
Name od9_0_athome_b3lyp-321gd,batch21,dsgdb9nsd_088715,nwchem,1576787625
State Running
Received 05/01/2020 19:49:45
Report deadline 19/01/2020 19:49:46
Estimated computation size 3,500 GFLOPs
CPU time 00:04:31
CPU time since checkpoint 00:00:01
Elapsed time 2d 02:39:48
Estimated time remaining 00:00:00
Fraction done 100.000%
Virtual memory size 80.89 MB
Working set size 2.00 GB
Directory slots/1
Process ID 4948
Progress rate 1.800% per hour
Executable vboxwrapper_26200_windows_x86_64.exe


So, I am aborting it, as I prefer to run other tasks that actually do good.

Question: Is this project in alpha, beta or just "development" or is it fully operational?

regards
Tim
ID: 406 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 14 Dec 19
Posts: 68
Credit: 45,744,261
RAC: 0
Message 409 - Posted: 9 Jan 2020, 16:59:26 UTC - in response to Message 399.  

...the next batches...target runtime should be around 12h-24h with huge variability due to large molecular systems.
If you increase the run time without implementing checkpointing you will become a BOINC pariah.

From reading all these problems with windows and the Virtual CatBox it's obvious that you're understaffed. Please do not try to be all things to all people.

Why not focus on the Linux app that works great and focus all of your attention on the science???
Can you actually handle 500,000 WUs a day??? Do you have the storage???

It really sounds like you're hellbent to spread yourself too thin.
ID: 409 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tcauchy

Send message
Joined: 4 Aug 19
Posts: 11
Credit: 74,704,720
RAC: 0
Message 412 - Posted: 10 Jan 2020, 10:33:26 UTC

Dear Aurum,

We are indeed understaffed. I am the theoretical chemist and Benoit is the computer scientist. We are both lecturers with a huge teaching time.
We also depend on internships.

We do not want to spread too much. Clearly the linux app works great and the windows VM is not perfect but some users manage to make it work.
Around 30k-50k running tasks is perfect. And no, our infrastructure cannot handle 500k WU.

However, what Benoit meant was that we are calculating right now small molecules with at most 9 atoms of C, N, O and F. It is a strong limitation in terms of chemical diversity.
Our scientifc goal is to at least be representative of organic chemistry. That means to include more elements like B, S, Cl... and increase slightly the molecule size to see the impact on machine learning predictions. Therefore, calculation's time could be longer and an automatic aborting habit could be problematic.

Cordially,
Thomas
ID: 412 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[VENETO] boboviz

Send message
Joined: 13 Sep 19
Posts: 69
Credit: 399,347
RAC: 0
Message 413 - Posted: 10 Jan 2020, 21:15:38 UTC - in response to Message 409.  

Why not focus on the Linux app that works great and focus all of your attention on the science???

I agree about science, but not about Linux, because Windows clients are MUCH more than Linux.
ID: 413 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 425 - Posted: 13 Jan 2020, 20:09:37 UTC - in response to Message 409.  

If you increase the run time without implementing checkpointing you will become a BOINC pariah.

From reading all these problems with windows and the Virtual CatBox it's obvious that you're understaffed. Please do not try to be all things to all people.

Why not focus on the Linux app that works great and focus all of your attention on the science???
Can you actually handle 500,000 WUs a day??? Do you have the storage???

It really sounds like you're hellbent to spread yourself too thin.


As Thomas said, focusing on Science imply to compute larger molecular systems. These calculations are very serious, focuses on our goal and could lead to important discoveries. Of course we don't want to waste ressources. Checkpointing is a work in progress. At first, we thought it came naturally with BOINC. Since today, next workunits will implement (nwchem) checkpointing. Since, it is not a boinc checkpoint, the progress bar restarts at 0% but internally computation restarts from the very last step. it's not as convenient and transparent as we'd like, but the important thing is that it works !

For VirtualBox, we followed the recommendations and it works very well on the development machine. Unfortunately, it's not as good as expected in a heterogeneous production environment. It's not that we' re understaffed (of course we are and we are underfunded), but it's mostly that the virtualbox solution with BOINC is a decoy or even a lie... My solution at home was to install LinuxMint and to run boinc from inside the VM : It works perfectly on my old windows 7. So the application is already working quite well. In fact, we already focus on the Linux app, since the VM only load a Linux to run the Linux app... the problem is the boinc client on Windows or Mac which communicates very badly with the VM. But this part is not under our control...

In terms of hardware, we can't handle 500,000 WUs a day and it is not our goal. Larger molecular systems imply greater computation times, which imply less WUs a day. At the moment, peak is approximately 30,000 WUs a day and it works. We've started to see some limitations in the last couple of days. I've been working today to push those limits... In term of storage, server is ok (36TB free) and we moved the server in a network with a large archiving capacity.

In the meantime, I'm trying to understand and internalize all of the trainee's work.
we're few in number, very busy, poorly funded, but we're working hard.
Today, I start working at 8AM and when I wrote this message it was 9PM. ;-)
ID: 425 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Very little CPU usage

©2024 Benoit DA MOTA - LERIA, University of Angers, France