New T2 affinity problem persistent on Linux

Message boards : Number crunching : New T2 affinity problem persistent on Linux
Message board moderation

To post messages, you must log in.

AuthorMessage
Gunnar Hjern

Send message
Joined: 14 Oct 19
Posts: 7
Credit: 2,614,863
RAC: 0
Message 212 - Posted: 24 Oct 2019, 16:26:37 UTC

It now seems possible to get even two T2-tasks to run in parallel on a 4-core CPU,
at least if the quchempedia project was installed after the 20:th Oct.
I have now at least two (4-CPU) computers that can run two T2-tasks simultaneous, and
getting about 98% on all threads when inspecting with "top" and similar tools.
Good!! :-)

HOWEVER, if the quchempedia project has been installed earlier
(typically between the 14:th to 19:th oct.) it doesn't matter what actions are taken:
The T2-tasks stubbornly continue to run on the same two cores!!! :-(

I have tried to reset the project, and I have repeatedly tried to completely remove the
project from the Boinc manager many times, but without any success.

I have also repeatedly removed the whole boinc software suit from
the computer (using sudo apt-get purge boinc boinc-manager boinc-client),
restarted the machine and re-installed the boinc software again, but without
any success!

Between removal and re-installment I have searched the whole file system tree
for any file named (or containing) "quchem" or "nwchem" but everything seemed
to be erased like it should be.

How is this possible????
What magic system setting has been done to control the affinity???


Please give me a hint of what file to erase, or what setting to change to get
the project to use all four cores wo. any affinity problem.
(I'm currently using Xubuntu 18.04 on most machines, although a few remains
at Xubuntu 14.04)

I have at least 6 more 4-core machines and two 8-cored Xeon:s that I'm planning
to hook up to this very interesting project, but I will not do that before this affinity
issue is sorted out, so that the machines can run efficiently, and not getting tainted
by old installations.

Kindest regards,
Gunnar Hjern
ID: 212 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 217 - Posted: 25 Oct 2019, 15:39:34 UTC - in response to Message 212.  

It is incredible ! We do not set anything to your system. It clearly magic at the moment.

Perhaps somethings was not correctly deleted by " sudo apt-get purge boinc boinc-manager boinc-client"
I can propose you to try it again and then check
/var/lib/boinc-client/projects
and
/var/lib/boinc-client/slots

the affinity settings is in a script called run.sh (if I remember).
you can try:
sudo find /var/lib/boinc-client/ -name "run.sh"
ID: 217 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 14 Oct 19
Posts: 7
Credit: 2,614,863
RAC: 0
Message 221 - Posted: 25 Oct 2019, 20:11:03 UTC - in response to Message 217.  
Last modified: 25 Oct 2019, 20:19:02 UTC

Hi Damotbe!

Thanks for the fast response!

The computer that I was referring to is a bit occupied for the moment, but I will try to reinstall some time this weekend.
Instead I tried to run the project on a 4 core 8 thread Xeon 1245 v2 :
https://quchempedia.univ-angers.fr/athome/show_host_detail.php?hostid=526

Apparently the affinities are not yet quite correct as some processes are sharing
the same cores/threads while other threads are idle - see the "top" dump below.

Running at this time were two T1 and three T2 tasks, but they all shared CPU-threads # 0, 1, 4, and 5,
leaving the other four CPU-threads idle. (At least I think that it is those thread-numbers, based on
what I could see from the CPU-graph applet that I'm running in the top panel.
If you know of any better way to get detailed stats about each CPU core please tell.)

I'll return later this weekend if I (re-)install any other computer.

Kindest regards,
Gunnar


Top dump from the Xeon 1245-computer:
-------------------------------------
top - 21:41:32 up 72 days, 21:28,  4 users,  load average: 8,28, 8,79, 8,89
Tasks: 278 total,  11 running, 267 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1,1 us,  0,1 sy, 50,3 ni, 48,4 id,  0,1 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem:   8116444 total,  7775728 used,   340716 free,   668632 buffers
KiB Swap: 24668152 total,      180 used, 24667972 free.  4319840 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                         
 4491 boinc     39  19 1267260 110316  14444 R  69,1  1,4  18:59.11 nwchem                          
 4388 boinc     39  19 1270036  90188  14232 R  65,5  1,1  19:36.30 nwchem                          
 4519 boinc     39  19 1267268 106368  14164 R  65,1  1,3  18:34.82 nwchem                          
 4360 boinc     39  19 1273396 105776  14216 R  40,5  1,3  12:10.74 nwchem                          
 4415 boinc     39  19 1293968 114948  14084 R  40,2  1,4  11:36.14 nwchem                          
 4387 boinc     39  19 1288444 107036  15392 R  39,9  1,3  11:47.09 nwchem                          
 4490 boinc     39  19 1268276 105596  15864 R  39,9  1,3  11:21.98 nwchem                          
 4518 boinc     39  19 1268484 108728  15616 R  39,9  1,3  11:13.75 nwchem                          
29316 gunnar    20   0  585988  98152  32968 R   5,3  1,2   1138:10 boincmgr                        
 1169 root      20   0  355236 116868  66496 R   3,7  1,4 958:53.72 Xorg                            
 4537 gunnar    20   0 1257804 298636 101488 S   1,3  3,7   0:54.04 firefox                         
 1661 boinc     30  10  386976  47880   9736 S   0,7  0,6 236:46.30 boinc                           
 2119 gunnar    20   0  492028  32236  24796 S   0,7  0,4  24:09.86 xfwm4                           
 1014 root      20   0   19208   2208   1980 S   0,3  0,0   6:07.35 irqbalance                      
ID: 221 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 230 - Posted: 27 Oct 2019, 18:00:13 UTC - in response to Message 221.  
Last modified: 27 Oct 2019, 18:00:30 UTC

Hi Gunnar !

I have the same problem on one Linux. In a terminal I run (as root or with sudo just before "taskset") :

while [[ 1 -eq 1 ]]; do for pid in $(pgrep nwchem); do taskset -p 0xffffffff $pid; done; sleep 60; done


It should help during the time we solve the problem.

Kind regards
Benoit
ID: 230 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 14 Oct 19
Posts: 7
Credit: 2,614,863
RAC: 0
Message 231 - Posted: 27 Oct 2019, 23:23:10 UTC - in response to Message 230.  

Hi Benoit!

Thanks for the script!!!

I've tested it on several different computers and it solved the problem on all of them! :-)
I've appended some of the first outputs below if it can be of any help to you.
I'll wait until tomorrow with fixing the rest of the computers.

Furthermore, it now seems as I mistook myself about that special computer that
I described above. Instead the case seems to be that two T2 tasks can run in parallel
on CPUs with two cores and two hyperthreads on each core, while CPUs with four
physical cores gives the affinity problem. The two other computers that I tested
to start up with T2-task actually was laptops with I5-520M CPUs featuring two cores
and hyperthreading and I therefore assumed that the T2 affinity problem had been solved.

Yesterday I started up half a dozen of new computers, four of which have ordinary
Intel I5 four core CPU. They all showed the same affinity problem, with multiple tasks
sharing the first core, or in the case of the T2:s the first two cores.
I'm definitely not an expert on core affinity issues and therefore totally clueless
of why there is a difference. :-)

Have a nice evening and a good new week to come!!

Kindest regards,
Gunnar



Some of the mask numbers that showed up when I ran the script on a my machines:

Dell Vostro 320 AllInOne  with a Core-II Duo E7400 CPU
(hostid=516)
running two T1 tasks:
pid 17140's current affinity mask: 1
pid 17140's new affinity mask: 3
pid 17141's current affinity mask: 1
pid 17141's new affinity mask: 3

(same mask numbers on another Core-II Duo machine)

----------------------------------------------------------

HP Elite 8300 USDT with an Core i5-3470S CPU
(hostid=509)  (The "magic" computer that I mentioned earlier)
running two T2 tasks:
pid 17140's current affinity mask: 1
pid 17140's new affinity mask: f
pid 17141's current affinity mask: 2
pid 17141's new affinity mask: f
pid 17691's current affinity mask: 1
pid 17691's new affinity mask: f
pid 17692's current affinity mask: 2
pid 17692's new affinity mask: f

--------------------------------------------------------

HP EliteBook 8440p with a hyperthreaded
Core i5 M 540 CPU
(hostid=464)
running one T2 and two T1 tasks
pid 408's current affinity mask: 3
pid 408's new affinity mask: f
pid 409's current affinity mask: c
pid 409's new affinity mask: f
pid 2447's current affinity mask: 3
pid 2447's new affinity mask: f
pid 2727's current affinity mask: 3
pid 2727's new affinity mask: f

The pids 408 and 409 is a T2 task
The pids 2447 and 2727 are two different T1 tasks

-------------------------------------------------------

Two different HP Z220 workstations with hyperthreaded
4 core / 8 thread XEON E3-1245 v2 CPUs
(hostid=526) and (hostid=534)
each of them running four T2 tasks:
pid 10162's current affinity mask: 11
pid 10162's new affinity mask: ff
pid 10163's current affinity mask: 22
pid 10163's new affinity mask: ff
pid 10302's current affinity mask: 11
pid 10302's new affinity mask: ff
pid 10303's current affinity mask: 22
pid 10303's new affinity mask: ff
pid 10850's current affinity mask: 11
pid 10850's new affinity mask: ff
pid 10851's current affinity mask: 22
pid 10851's new affinity mask: ff
pid 10873's current affinity mask: 11
pid 10873's new affinity mask: ff
pid 10874's current affinity mask: 22
pid 10874's new affinity mask: ff

(the pid numbers differ of course on the other one)

(The Linux kernel version doesn't seem to matter as one of them 
is running Xubuntu 14.04, and the other Xubuntu 18.04)

ID: 231 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 232 - Posted: 30 Oct 2019, 10:30:12 UTC - in response to Message 231.  
Last modified: 30 Oct 2019, 10:30:24 UTC

Hi Gunnar !

I'm happy if it works well. At the moment, I have no idea how to solve the problem for the long term...

Kind regards
Benoit
ID: 232 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : New T2 affinity problem persistent on Linux

©2024 Benoit DA MOTA - LERIA, University of Angers, France