Message boards :
Number crunching :
New T2 affinity problem persistent on Linux
Message board moderation
Author | Message |
---|---|
Send message Joined: 14 Oct 19 Posts: 7 Credit: 2,614,863 RAC: 0 |
It now seems possible to get even two T2-tasks to run in parallel on a 4-core CPU, at least if the quchempedia project was installed after the 20:th Oct. I have now at least two (4-CPU) computers that can run two T2-tasks simultaneous, and getting about 98% on all threads when inspecting with "top" and similar tools. Good!! :-) HOWEVER, if the quchempedia project has been installed earlier (typically between the 14:th to 19:th oct.) it doesn't matter what actions are taken: The T2-tasks stubbornly continue to run on the same two cores!!! :-( I have tried to reset the project, and I have repeatedly tried to completely remove the project from the Boinc manager many times, but without any success. I have also repeatedly removed the whole boinc software suit from the computer (using sudo apt-get purge boinc boinc-manager boinc-client), restarted the machine and re-installed the boinc software again, but without any success! Between removal and re-installment I have searched the whole file system tree for any file named (or containing) "quchem" or "nwchem" but everything seemed to be erased like it should be. How is this possible???? What magic system setting has been done to control the affinity??? Please give me a hint of what file to erase, or what setting to change to get the project to use all four cores wo. any affinity problem. (I'm currently using Xubuntu 18.04 on most machines, although a few remains at Xubuntu 14.04) I have at least 6 more 4-core machines and two 8-cored Xeon:s that I'm planning to hook up to this very interesting project, but I will not do that before this affinity issue is sorted out, so that the machines can run efficiently, and not getting tainted by old installations. Kindest regards, Gunnar Hjern |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
It is incredible ! We do not set anything to your system. It clearly magic at the moment. Perhaps somethings was not correctly deleted by " sudo apt-get purge boinc boinc-manager boinc-client" I can propose you to try it again and then check /var/lib/boinc-client/projects and /var/lib/boinc-client/slots the affinity settings is in a script called run.sh (if I remember). you can try: sudo find /var/lib/boinc-client/ -name "run.sh" |
Send message Joined: 14 Oct 19 Posts: 7 Credit: 2,614,863 RAC: 0 |
Hi Damotbe! Thanks for the fast response! The computer that I was referring to is a bit occupied for the moment, but I will try to reinstall some time this weekend. Instead I tried to run the project on a 4 core 8 thread Xeon 1245 v2 : https://quchempedia.univ-angers.fr/athome/show_host_detail.php?hostid=526 Apparently the affinities are not yet quite correct as some processes are sharing the same cores/threads while other threads are idle - see the "top" dump below. Running at this time were two T1 and three T2 tasks, but they all shared CPU-threads # 0, 1, 4, and 5, leaving the other four CPU-threads idle. (At least I think that it is those thread-numbers, based on what I could see from the CPU-graph applet that I'm running in the top panel. If you know of any better way to get detailed stats about each CPU core please tell.) I'll return later this weekend if I (re-)install any other computer. Kindest regards, Gunnar Top dump from the Xeon 1245-computer: ------------------------------------- top - 21:41:32 up 72 days, 21:28, 4 users, load average: 8,28, 8,79, 8,89 Tasks: 278 total, 11 running, 267 sleeping, 0 stopped, 0 zombie %Cpu(s): 1,1 us, 0,1 sy, 50,3 ni, 48,4 id, 0,1 wa, 0,0 hi, 0,0 si, 0,0 st KiB Mem: 8116444 total, 7775728 used, 340716 free, 668632 buffers KiB Swap: 24668152 total, 180 used, 24667972 free. 4319840 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4491 boinc 39 19 1267260 110316 14444 R 69,1 1,4 18:59.11 nwchem 4388 boinc 39 19 1270036 90188 14232 R 65,5 1,1 19:36.30 nwchem 4519 boinc 39 19 1267268 106368 14164 R 65,1 1,3 18:34.82 nwchem 4360 boinc 39 19 1273396 105776 14216 R 40,5 1,3 12:10.74 nwchem 4415 boinc 39 19 1293968 114948 14084 R 40,2 1,4 11:36.14 nwchem 4387 boinc 39 19 1288444 107036 15392 R 39,9 1,3 11:47.09 nwchem 4490 boinc 39 19 1268276 105596 15864 R 39,9 1,3 11:21.98 nwchem 4518 boinc 39 19 1268484 108728 15616 R 39,9 1,3 11:13.75 nwchem 29316 gunnar 20 0 585988 98152 32968 R 5,3 1,2 1138:10 boincmgr 1169 root 20 0 355236 116868 66496 R 3,7 1,4 958:53.72 Xorg 4537 gunnar 20 0 1257804 298636 101488 S 1,3 3,7 0:54.04 firefox 1661 boinc 30 10 386976 47880 9736 S 0,7 0,6 236:46.30 boinc 2119 gunnar 20 0 492028 32236 24796 S 0,7 0,4 24:09.86 xfwm4 1014 root 20 0 19208 2208 1980 S 0,3 0,0 6:07.35 irqbalance |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Hi Gunnar ! I have the same problem on one Linux. In a terminal I run (as root or with sudo just before "taskset") : while [[ 1 -eq 1 ]]; do for pid in $(pgrep nwchem); do taskset -p 0xffffffff $pid; done; sleep 60; done It should help during the time we solve the problem. Kind regards Benoit |
Send message Joined: 14 Oct 19 Posts: 7 Credit: 2,614,863 RAC: 0 |
Hi Benoit! Thanks for the script!!! I've tested it on several different computers and it solved the problem on all of them! :-) I've appended some of the first outputs below if it can be of any help to you. I'll wait until tomorrow with fixing the rest of the computers. Furthermore, it now seems as I mistook myself about that special computer that I described above. Instead the case seems to be that two T2 tasks can run in parallel on CPUs with two cores and two hyperthreads on each core, while CPUs with four physical cores gives the affinity problem. The two other computers that I tested to start up with T2-task actually was laptops with I5-520M CPUs featuring two cores and hyperthreading and I therefore assumed that the T2 affinity problem had been solved. Yesterday I started up half a dozen of new computers, four of which have ordinary Intel I5 four core CPU. They all showed the same affinity problem, with multiple tasks sharing the first core, or in the case of the T2:s the first two cores. I'm definitely not an expert on core affinity issues and therefore totally clueless of why there is a difference. :-) Have a nice evening and a good new week to come!! Kindest regards, Gunnar Some of the mask numbers that showed up when I ran the script on a my machines: Dell Vostro 320 AllInOne with a Core-II Duo E7400 CPU (hostid=516) running two T1 tasks: pid 17140's current affinity mask: 1 pid 17140's new affinity mask: 3 pid 17141's current affinity mask: 1 pid 17141's new affinity mask: 3 (same mask numbers on another Core-II Duo machine) ---------------------------------------------------------- HP Elite 8300 USDT with an Core i5-3470S CPU (hostid=509) (The "magic" computer that I mentioned earlier) running two T2 tasks: pid 17140's current affinity mask: 1 pid 17140's new affinity mask: f pid 17141's current affinity mask: 2 pid 17141's new affinity mask: f pid 17691's current affinity mask: 1 pid 17691's new affinity mask: f pid 17692's current affinity mask: 2 pid 17692's new affinity mask: f -------------------------------------------------------- HP EliteBook 8440p with a hyperthreaded Core i5 M 540 CPU (hostid=464) running one T2 and two T1 tasks pid 408's current affinity mask: 3 pid 408's new affinity mask: f pid 409's current affinity mask: c pid 409's new affinity mask: f pid 2447's current affinity mask: 3 pid 2447's new affinity mask: f pid 2727's current affinity mask: 3 pid 2727's new affinity mask: f The pids 408 and 409 is a T2 task The pids 2447 and 2727 are two different T1 tasks ------------------------------------------------------- Two different HP Z220 workstations with hyperthreaded 4 core / 8 thread XEON E3-1245 v2 CPUs (hostid=526) and (hostid=534) each of them running four T2 tasks: pid 10162's current affinity mask: 11 pid 10162's new affinity mask: ff pid 10163's current affinity mask: 22 pid 10163's new affinity mask: ff pid 10302's current affinity mask: 11 pid 10302's new affinity mask: ff pid 10303's current affinity mask: 22 pid 10303's new affinity mask: ff pid 10850's current affinity mask: 11 pid 10850's new affinity mask: ff pid 10851's current affinity mask: 22 pid 10851's new affinity mask: ff pid 10873's current affinity mask: 11 pid 10873's new affinity mask: ff pid 10874's current affinity mask: 22 pid 10874's new affinity mask: ff (the pid numbers differ of course on the other one) (The Linux kernel version doesn't seem to matter as one of them is running Xubuntu 14.04, and the other Xubuntu 18.04) |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Hi Gunnar ! I'm happy if it works well. At the moment, I have no idea how to solve the problem for the long term... Kind regards Benoit |
©2024 Benoit DA MOTA - LERIA, University of Angers, France