New T1 native nwchem work unit affinity problem.

Message boards : Number crunching : New T1 native nwchem work unit affinity problem.
Message board moderation

To post messages, you must log in.

AuthorMessage
marmot

Send message
Joined: 29 Aug 19
Posts: 15
Credit: 159,816
RAC: 0
Message 166 - Posted: 18 Oct 2019, 7:52:11 UTC

On the Linux machine that has 4 cores and runs 4 nwchem T1 at a time, all 4 nwchem processes choose to bind to core 1 and share it's time. So Linux process manager shows each nwchem process using 25% of a core and the other 3 CPU cores sit idle.

The T1 WU don't interact and coordinate CPU core affinity selection.

Another issue is that I have no way to force the server to send down T4 WU in order to use all 4 cores.
ID: 166 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 171 - Posted: 18 Oct 2019, 14:44:48 UTC - in response to Message 166.  

We stopped generation of T4 WU (not efficient + affinity issue).
Perhaps you have old WU, can you verify the BASH script executed in your slots directory ?

Thank you
ID: 171 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 29 Aug 19
Posts: 15
Credit: 159,816
RAC: 0
Message 175 - Posted: 18 Oct 2019, 18:46:18 UTC

run.sh is dated October 3, 2019 as are all *.nw files.

Files in /bin folder are dated 8/30/2019
ID: 175 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 14 Oct 19
Posts: 7
Credit: 2,614,863
RAC: 0
Message 178 - Posted: 19 Oct 2019, 12:53:37 UTC - in response to Message 171.  

I have the same problem too, on THREE different computers, one of which I installed yesterday,
and thus must be totally fresh and pristine to all its files and scripts.
(All of them seems to be running T1 Linux native tasks.)

I would gladly turn on several more computers if I knew that they would be working efficiently.
Please let us know when Linux computers could run the tasks efficiently.

Have a nice weekend!!

Kindest regards,
Gunnar
ID: 178 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 8 Oct 19
Posts: 13
Credit: 2,548,714
RAC: 0
Message 194 - Posted: 22 Oct 2019, 0:09:41 UTC

Same for me for the most part. 2P system using just the 1st thread of each CPU.
ID: 194 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 195 - Posted: 22 Oct 2019, 1:57:56 UTC
Last modified: 22 Oct 2019, 2:04:55 UTC

I find that on Linux, the native work units run much better if you set "Max # CPUs 2" in your preferences page,
https://quchempedia.univ-angers.fr/athome/prefs.php?subset=project
and also run only a maximum of two work units at a time.

You can control the maximum number downloaded at a time using the "Max # jobs" setting, but for more control over the actual number running,
you can use an "app_config.xml" file placed in the "quchempedia.univ-angers.fr_athome" projects folder (in /var/lib/boinc-client/projects).

If you are not familiar with an app_config.xml, you create it using a text editor such as Notepad, and save it as an ".xml" file.
It should contain:
<app_config>
   <project_max_concurrent>2</project_max_concurrent>
</app_config>

Then, you do a "read config files", or just reboot your computer to activate it.

It is possible that other numbers running at a time may work better, but two works for me relatively well.

It has turned impossible to run work units into successes for me, both t1 and t2 on several machines.
ID: 195 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 29 Aug 19
Posts: 15
Credit: 159,816
RAC: 0
Message 196 - Posted: 22 Oct 2019, 5:01:39 UTC - in response to Message 195.  


<app_config>
   <project_max_concurrent>2</project_max_concurrent>
</app_config>


This was added to my configuration 30+ days ago and will do nothing to correct the issue discussed in this thread.

The T1 WU's are all attaching to a single core and even max_concurrent = 2 will still leave a single CPU core unused.
ID: 196 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 200 - Posted: 22 Oct 2019, 6:22:13 UTC - in response to Message 196.  
Last modified: 22 Oct 2019, 6:26:12 UTC

The T1 WU's are all attaching to a single core and even max_concurrent = 2 will still leave a single CPU core unused.

It works for me. Running a "top" command shows four cores fully utilized with "nwchem".
Maybe it is not the same for your CPU architecture?
ID: 200 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 201 - Posted: 22 Oct 2019, 6:26:32 UTC - in response to Message 200.  
Last modified: 22 Oct 2019, 6:27:49 UTC

I have 16 virtual cores by the way, if it makes a difference.
ID: 201 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 206 - Posted: 23 Oct 2019, 5:43:44 UTC

I now see some long ones (25 days estimate) in the buffer. So this procedure may not be a fix for them. We will see.
ID: 206 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : New T1 native nwchem work unit affinity problem.

©2024 Benoit DA MOTA - LERIA, University of Angers, France