Message boards :
Number crunching :
Very little CPU usage
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 3 Oct 19 Posts: 14 Credit: 32,908,253 RAC: 0 |
damotbe, the problem is that you are wasting resources. When 64 WU are launched on a 64t machine only 2 threads are used on a single CPU and 4 threads get used on a dual CPU machine. That means that the threads are over committed by either 16:1 or 32:1. It is a total waste of machine capability. Without your program setting affinity a single machine could produce a lot more work for the project. The program continuously launches "child" processes and kills the old processes so an affinity script must continuously run. The best solution is to remove the affinity control from the program's executable. Let Linux decide what threads to use. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
damotbe, the problem is that you are wasting resources. When 64 WU are launched on a 64t machine only 2 threads are used on a single CPU and 4 threads get used on a dual CPU machine. That means that the threads are over committed by either 16:1 or 32:1. It is a total waste of machine capability. Without your program setting affinity a single machine could produce a lot more work for the project. Yes we know that and we are very concerned by this issue. The affinity is auto-tuned by mpirun. We tried multiple parameters of this wrapper. It's very difficult to assess the changes since multiple versions of the code runs. Secondly, the static compilation of nwchem (not our code) is very tricky and we tried to shortcut mpi. Thirdly, we don't have engineer anymore (reality of public research...) and it's hard to solve this particular issue, but we try ! |
Send message Joined: 3 Oct 19 Posts: 14 Credit: 32,908,253 RAC: 0 |
I don't know if the VBox version actually assigns CPU affinity since I haven't run it. On the native Linux, which runs well, you should warn people that running more than 1 WU per CPU is only increasing the time to completion. On my dual CPU Intel machines running 64 WU means it would take 16X more time to complete a WU. I hope you find a solution because the native linux app runs very well. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
We reproduce the problem on my bi-socket Xeon and we wrote a affinity script to include in the native app. We will try to commit the change tonight. |
Send message Joined: 3 Oct 19 Posts: 14 Credit: 32,908,253 RAC: 0 |
Excellent! That will greatly help the project and the users :) |
Send message Joined: 2 Jan 20 Posts: 1 Credit: 31,106 RAC: 0 |
Is this the behavior people are talking about? I have a bunch of WUs not using any CPU whatsoever but still running. |
Send message Joined: 13 Nov 19 Posts: 21 Credit: 2,596,565 RAC: 0 |
I see i am not alone !!! |
Send message Joined: 13 Sep 19 Posts: 69 Credit: 399,347 RAC: 0 |
Is this the behavior people are talking about? Yes I have a bunch of WUs not using any CPU whatsoever but still running. Kill these wus |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
If I understand well the problem comes from Vbox guest additions needed for the shared directory (to read/write out of the VM). If it fails, the calculation is not launched (no access to data and scripts) but the VM continue to run... And no access to our scripts means we can't tell to the VM to stop. I add to the todo list to find a workaround. in the meantime, do not hesitate to abort these workunits. If all your workunits do that, maybe another version of vbox and additions will do the trick. |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
I just thought I should mention that native NWChem is working OK on an Ubuntu 18.04.3 machine. Looking at BOINCTasks, it seems at first that not all the cores are being used: However, running the "top" command shows that all 12 cores of my Ryzen 2600 are actually in use (with one on Folding): So it may be the case that the "core affinity" problem is still there, but it doesn't matter for me. Even if the work units are jumping between cores, all of them are in use. It has been running very well for months. |
Send message Joined: 11 Oct 19 Posts: 4 Credit: 1,604,204 RAC: 0 |
Do we abort long running tasks stuck at 100% for 2 days? Seems an absolute waste of time and electricity. Can this issue be fixed? |
Send message Joined: 13 Oct 19 Posts: 87 Credit: 6,026,455 RAC: 0 |
Do we abort long running tasks stuck at 100% for 2 days? I abort any WU taking longer than 12 hours to complete. Haven't been sorry yet. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Current work units should be short (<12h) with a mean runtime of 1.5 hours. However, for the next batches this rule of thumb will no longer apply, target runtime should be around 12h-24h with huge variability due to large molecular systems. |
Send message Joined: 13 Nov 19 Posts: 21 Credit: 2,596,565 RAC: 0 |
Current work units should be short (<12h) with a mean runtime of 1.5 hours. Hello Damotbe Thank you for information about the new WU. I have about 50% of my "short" WU stuck. The others runs in about 2 hours. But if we would need (and monopolyze full ressources while host is doing nothing) for up to one day, It is not efficient. Do you think we will wait several days, for at end, canceling ? |
Send message Joined: 8 Dec 19 Posts: 13 Credit: 652,594 RAC: 0 |
Hi all I have one task that is now at 50 hrs of elapsed time (according to BOINC Manager). I can see very little CPU time being used by the task which is this: https://quchempedia.univ-angers.fr/athome/result.php?resultid=702088 Here is the output from the task "properties" Application NWChem 0.11 (vbox64_t1) So, I am aborting it, as I prefer to run other tasks that actually do good. Question: Is this project in alpha, beta or just "development" or is it fully operational? regards Tim |
Send message Joined: 14 Dec 19 Posts: 68 Credit: 45,744,261 RAC: 0 |
...the next batches...target runtime should be around 12h-24h with huge variability due to large molecular systems.If you increase the run time without implementing checkpointing you will become a BOINC pariah. From reading all these problems with windows and the Virtual CatBox it's obvious that you're understaffed. Please do not try to be all things to all people. Why not focus on the Linux app that works great and focus all of your attention on the science??? Can you actually handle 500,000 WUs a day??? Do you have the storage??? It really sounds like you're hellbent to spread yourself too thin. |
Send message Joined: 4 Aug 19 Posts: 11 Credit: 74,704,720 RAC: 0 |
Dear Aurum, We are indeed understaffed. I am the theoretical chemist and Benoit is the computer scientist. We are both lecturers with a huge teaching time. We also depend on internships. We do not want to spread too much. Clearly the linux app works great and the windows VM is not perfect but some users manage to make it work. Around 30k-50k running tasks is perfect. And no, our infrastructure cannot handle 500k WU. However, what Benoit meant was that we are calculating right now small molecules with at most 9 atoms of C, N, O and F. It is a strong limitation in terms of chemical diversity. Our scientifc goal is to at least be representative of organic chemistry. That means to include more elements like B, S, Cl... and increase slightly the molecule size to see the impact on machine learning predictions. Therefore, calculation's time could be longer and an automatic aborting habit could be problematic. Cordially, Thomas |
Send message Joined: 13 Sep 19 Posts: 69 Credit: 399,347 RAC: 0 |
Why not focus on the Linux app that works great and focus all of your attention on the science??? I agree about science, but not about Linux, because Windows clients are MUCH more than Linux. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
If you increase the run time without implementing checkpointing you will become a BOINC pariah. As Thomas said, focusing on Science imply to compute larger molecular systems. These calculations are very serious, focuses on our goal and could lead to important discoveries. Of course we don't want to waste ressources. Checkpointing is a work in progress. At first, we thought it came naturally with BOINC. Since today, next workunits will implement (nwchem) checkpointing. Since, it is not a boinc checkpoint, the progress bar restarts at 0% but internally computation restarts from the very last step. it's not as convenient and transparent as we'd like, but the important thing is that it works ! For VirtualBox, we followed the recommendations and it works very well on the development machine. Unfortunately, it's not as good as expected in a heterogeneous production environment. It's not that we' re understaffed (of course we are and we are underfunded), but it's mostly that the virtualbox solution with BOINC is a decoy or even a lie... My solution at home was to install LinuxMint and to run boinc from inside the VM : It works perfectly on my old windows 7. So the application is already working quite well. In fact, we already focus on the Linux app, since the VM only load a Linux to run the Linux app... the problem is the boinc client on Windows or Mac which communicates very badly with the VM. But this part is not under our control... In terms of hardware, we can't handle 500,000 WUs a day and it is not our goal. Larger molecular systems imply greater computation times, which imply less WUs a day. At the moment, peak is approximately 30,000 WUs a day and it works. We've started to see some limitations in the last couple of days. I've been working today to push those limits... In term of storage, server is ok (36TB free) and we moved the server in a network with a large archiving capacity. In the meantime, I'm trying to understand and internalize all of the trainee's work. we're few in number, very busy, poorly funded, but we're working hard. Today, I start working at 8AM and when I wrote this message it was 9PM. ;-) |
©2024 Benoit DA MOTA - LERIA, University of Angers, France