Very little CPU usage

Author	Message
Bryan Send message Joined: 3 Oct 19 Posts: 14 Credit: 32,908,253 RAC: 0	Message 76 - Posted: 8 Oct 2019, 16:00:06 UTC - in response to Message 73. damotbe, the problem is that you are wasting resources. When 64 WU are launched on a 64t machine only 2 threads are used on a single CPU and 4 threads get used on a dual CPU machine. That means that the threads are over committed by either 16:1 or 32:1. It is a total waste of machine capability. Without your program setting affinity a single machine could produce a lot more work for the project. The program continuously launches "child" processes and kills the old processes so an affinity script must continuously run. The best solution is to remove the affinity control from the program's executable. Let Linux decide what threads to use. ID: 76 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 85 - Posted: 9 Oct 2019, 6:19:17 UTC - in response to Message 76. damotbe, the problem is that you are wasting resources. When 64 WU are launched on a 64t machine only 2 threads are used on a single CPU and 4 threads get used on a dual CPU machine. That means that the threads are over committed by either 16:1 or 32:1. It is a total waste of machine capability. Without your program setting affinity a single machine could produce a lot more work for the project. The program continuously launches "child" processes and kills the old processes so an affinity script must continuously run. The best solution is to remove the affinity control from the program's executable. Let Linux decide what threads to use. Yes we know that and we are very concerned by this issue. The affinity is auto-tuned by mpirun. We tried multiple parameters of this wrapper. It's very difficult to assess the changes since multiple versions of the code runs. Secondly, the static compilation of nwchem (not our code) is very tricky and we tried to shortcut mpi. Thirdly, we don't have engineer anymore (reality of public research...) and it's hard to solve this particular issue, but we try ! ID: 85 · Rating: 0 · rate: / Reply Quote

Bryan Send message Joined: 3 Oct 19 Posts: 14 Credit: 32,908,253 RAC: 0	Message 89 - Posted: 9 Oct 2019, 14:15:49 UTC - in response to Message 85. I don't know if the VBox version actually assigns CPU affinity since I haven't run it. On the native Linux, which runs well, you should warn people that running more than 1 WU per CPU is only increasing the time to completion. On my dual CPU Intel machines running 64 WU means it would take 16X more time to complete a WU. I hope you find a solution because the native linux app runs very well. ID: 89 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 94 - Posted: 10 Oct 2019, 11:53:44 UTC - in response to Message 89. We reproduce the problem on my bi-socket Xeon and we wrote a affinity script to include in the native app. We will try to commit the change tonight. ID: 94 · Rating: 0 · rate: / Reply Quote

Bryan Send message Joined: 3 Oct 19 Posts: 14 Credit: 32,908,253 RAC: 0	Message 96 - Posted: 10 Oct 2019, 13:54:33 UTC - in response to Message 94. Excellent! That will greatly help the project and the users :) ID: 96 · Rating: 0 · rate: / Reply Quote

CCPLogibro Send message Joined: 2 Jan 20 Posts: 1 Credit: 31,106 RAC: 0	Message 390 - Posted: 6 Jan 2020, 17:19:10 UTC Is this the behavior people are talking about? I have a bunch of WUs not using any CPU whatsoever but still running. ID: 390 · Rating: 0 · rate: / Reply Quote

marsinph Send message Joined: 13 Nov 19 Posts: 21 Credit: 2,596,565 RAC: 0	Message 391 - Posted: 6 Jan 2020, 18:32:03 UTC - in response to Message 390. I see i am not alone !!! ID: 391 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 13 Sep 19 Posts: 69 Credit: 399,347 RAC: 0	Message 392 - Posted: 6 Jan 2020, 18:58:13 UTC - in response to Message 390. Is this the behavior people are talking about? Yes I have a bunch of WUs not using any CPU whatsoever but still running. Kill these wus ID: 392 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 393 - Posted: 7 Jan 2020, 10:16:24 UTC - in response to Message 392. If I understand well the problem comes from Vbox guest additions needed for the shared directory (to read/write out of the VM). If it fails, the calculation is not launched (no access to data and scripts) but the VM continue to run... And no access to our scripts means we can't tell to the VM to stop. I add to the todo list to find a workaround. in the meantime, do not hesitate to abort these workunits. If all your workunits do that, maybe another version of vbox and additions will do the trick. ID: 393 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0	Message 394 - Posted: 7 Jan 2020, 14:38:29 UTC Last modified: 7 Jan 2020, 15:02:52 UTC I just thought I should mention that native NWChem is working OK on an Ubuntu 18.04.3 machine. Looking at BOINCTasks, it seems at first that not all the cores are being used: However, running the "top" command shows that all 12 cores of my Ryzen 2600 are actually in use (with one on Folding): So it may be the case that the "core affinity" problem is still there, but it doesn't matter for me. Even if the work units are jumping between cores, all of them are in use. It has been running very well for months. ID: 394 · Rating: 0 · rate: / Reply Quote

vaughan Send message Joined: 11 Oct 19 Posts: 4 Credit: 1,604,204 RAC: 0	Message 396 - Posted: 8 Jan 2020, 3:12:19 UTC Do we abort long running tasks stuck at 100% for 2 days? Seems an absolute waste of time and electricity. Can this issue be fixed? ID: 396 · Rating: 0 · rate: / Reply Quote

swiftmallard Send message Joined: 13 Oct 19 Posts: 87 Credit: 6,026,455 RAC: 0	Message 397 - Posted: 8 Jan 2020, 3:40:33 UTC - in response to Message 396. Do we abort long running tasks stuck at 100% for 2 days? Seems an absolute waste of time and electricity. Can this issue be fixed? I abort any WU taking longer than 12 hours to complete. Haven't been sorry yet. ID: 397 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 399 - Posted: 8 Jan 2020, 10:23:49 UTC - in response to Message 397. Current work units should be short (<12h) with a mean runtime of 1.5 hours. However, for the next batches this rule of thumb will no longer apply, target runtime should be around 12h-24h with huge variability due to large molecular systems. ID: 399 · Rating: 0 · rate: / Reply Quote

marsinph Send message Joined: 13 Nov 19 Posts: 21 Credit: 2,596,565 RAC: 0	Message 401 - Posted: 8 Jan 2020, 14:13:51 UTC - in response to Message 399. Current work units should be short (<12h) with a mean runtime of 1.5 hours. However, for the next batches this rule of thumb will no longer apply, target runtime should be around 12h-24h with huge variability due to large molecular systems. Hello Damotbe Thank you for information about the new WU. I have about 50% of my "short" WU stuck. The others runs in about 2 hours. But if we would need (and monopolyze full ressources while host is doing nothing) for up to one day, It is not efficient. Do you think we will wait several days, for at end, canceling ? ID: 401 · Rating: 0 · rate: / Reply Quote

UBT - Timbo Send message Joined: 8 Dec 19 Posts: 13 Credit: 652,594 RAC: 0	Message 406 - Posted: 8 Jan 2020, 21:44:01 UTC Last modified: 8 Jan 2020, 21:44:16 UTC Hi all I have one task that is now at 50 hrs of elapsed time (according to BOINC Manager). I can see very little CPU time being used by the task which is this: https://quchempedia.univ-angers.fr/athome/result.php?resultid=702088 Here is the output from the task "properties" Application NWChem 0.11 (vbox64_t1) Name od9_0_athome_b3lyp-321gd,batch21,dsgdb9nsd_088715,nwchem,1576787625 State Running Received 05/01/2020 19:49:45 Report deadline 19/01/2020 19:49:46 Estimated computation size 3,500 GFLOPs CPU time 00:04:31 CPU time since checkpoint 00:00:01 Elapsed time 2d 02:39:48 Estimated time remaining 00:00:00 Fraction done 100.000% Virtual memory size 80.89 MB Working set size 2.00 GB Directory slots/1 Process ID 4948 Progress rate 1.800% per hour Executable vboxwrapper_26200_windows_x86_64.exe So, I am aborting it, as I prefer to run other tasks that actually do good. Question: Is this project in alpha, beta or just "development" or is it fully operational? regards Tim ID: 406 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 14 Dec 19 Posts: 68 Credit: 45,744,261 RAC: 0	Message 409 - Posted: 9 Jan 2020, 16:59:26 UTC - in response to Message 399. ...the next batches...target runtime should be around 12h-24h with huge variability due to large molecular systems. If you increase the run time without implementing checkpointing you will become a BOINC pariah. From reading all these problems with windows and the Virtual CatBox it's obvious that you're understaffed. Please do not try to be all things to all people. Why not focus on the Linux app that works great and focus all of your attention on the science??? Can you actually handle 500,000 WUs a day??? Do you have the storage??? It really sounds like you're hellbent to spread yourself too thin. ID: 409 · Rating: 0 · rate: / Reply Quote

tcauchy Send message Joined: 4 Aug 19 Posts: 11 Credit: 74,704,720 RAC: 0	Message 412 - Posted: 10 Jan 2020, 10:33:26 UTC Dear Aurum, We are indeed understaffed. I am the theoretical chemist and Benoit is the computer scientist. We are both lecturers with a huge teaching time. We also depend on internships. We do not want to spread too much. Clearly the linux app works great and the windows VM is not perfect but some users manage to make it work. Around 30k-50k running tasks is perfect. And no, our infrastructure cannot handle 500k WU. However, what Benoit meant was that we are calculating right now small molecules with at most 9 atoms of C, N, O and F. It is a strong limitation in terms of chemical diversity. Our scientifc goal is to at least be representative of organic chemistry. That means to include more elements like B, S, Cl... and increase slightly the molecule size to see the impact on machine learning predictions. Therefore, calculation's time could be longer and an automatic aborting habit could be problematic. Cordially, Thomas ID: 412 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 13 Sep 19 Posts: 69 Credit: 399,347 RAC: 0	Message 413 - Posted: 10 Jan 2020, 21:15:38 UTC - in response to Message 409. Why not focus on the Linux app that works great and focus all of your attention on the science??? I agree about science, but not about Linux, because Windows clients are MUCH more than Linux. ID: 413 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 425 - Posted: 13 Jan 2020, 20:09:37 UTC - in response to Message 409. If you increase the run time without implementing checkpointing you will become a BOINC pariah. From reading all these problems with windows and the Virtual CatBox it's obvious that you're understaffed. Please do not try to be all things to all people. Why not focus on the Linux app that works great and focus all of your attention on the science??? Can you actually handle 500,000 WUs a day??? Do you have the storage??? It really sounds like you're hellbent to spread yourself too thin. As Thomas said, focusing on Science imply to compute larger molecular systems. These calculations are very serious, focuses on our goal and could lead to important discoveries. Of course we don't want to waste ressources. Checkpointing is a work in progress. At first, we thought it came naturally with BOINC. Since today, next workunits will implement (nwchem) checkpointing. Since, it is not a boinc checkpoint, the progress bar restarts at 0% but internally computation restarts from the very last step. it's not as convenient and transparent as we'd like, but the important thing is that it works ! For VirtualBox, we followed the recommendations and it works very well on the development machine. Unfortunately, it's not as good as expected in a heterogeneous production environment. It's not that we' re understaffed (of course we are and we are underfunded), but it's mostly that the virtualbox solution with BOINC is a decoy or even a lie... My solution at home was to install LinuxMint and to run boinc from inside the VM : It works perfectly on my old windows 7. So the application is already working quite well. In fact, we already focus on the Linux app, since the VM only load a Linux to run the Linux app... the problem is the boinc client on Windows or Mac which communicates very badly with the VM. But this part is not under our control... In terms of hardware, we can't handle 500,000 WUs a day and it is not our goal. Larger molecular systems imply greater computation times, which imply less WUs a day. At the moment, peak is approximately 30,000 WUs a day and it works. We've started to see some limitations in the last couple of days. I've been working today to push those limits... In term of storage, server is ok (36TB free) and we moved the server in a network with a large archiving capacity. In the meantime, I'm trying to understand and internalize all of the trainee's work. we're few in number, very busy, poorly funded, but we're working hard. Today, I start working at 8AM and when I wrote this message it was 9PM. ;-) ID: 425 · Rating: 0 · rate: / Reply Quote