1)
Message boards :
Number crunching :
No new task sent out when wingman aborted or got a validation error
(Message 878)
Posted 11 Jun 2020 by Gunnar Hjern Post: Thanks Jim1348! I completely missed that thread. Guess it's just to wait then, and hope the creds will arrive before X-mas. ;-) Happy Crunching!!! //Gunnar |
2)
Message boards :
Number crunching :
No new task sent out when wingman aborted or got a validation error
(Message 876)
Posted 11 Jun 2020 by Gunnar Hjern Post: Hi Damotbe! I recently discovered that I have several WUs pending and one "validation inconclusive", most of them from early May, where my wingman either aborted or got a validation error, and that no new tasks still have been sent out, although it was several days, and even months since me and my wingman reported the tasks (as aborted/erroneous etc.) The pending WUs are: https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1377219 (2 days since abortion/error) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1354680 (1 month and 4 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1354050 (1 month and 1 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353751 (1 month and 9 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353410 (1 month and 2 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353359 (1 month and 2 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353159 (1 month and 2 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353042 (1 month and 2 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353066 (1 month and 2 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353111 (1 month and 2 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1352934 (1 month...) and the one in validation inconclusive: https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1377219 (3 days) Normally (in other Boinc projects) new replicated tasks are sent out only hours after one of the initially replicated tasks is reported unsuccessful. Does this mean that I never will get any credits for those tasks?? Kindest regards, Gunnar |
3)
Message boards :
Number crunching :
New T2 affinity problem persistent on Linux
(Message 231)
Posted 27 Oct 2019 by Gunnar Hjern Post: Hi Benoit! Thanks for the script!!! I've tested it on several different computers and it solved the problem on all of them! :-) I've appended some of the first outputs below if it can be of any help to you. I'll wait until tomorrow with fixing the rest of the computers. Furthermore, it now seems as I mistook myself about that special computer that I described above. Instead the case seems to be that two T2 tasks can run in parallel on CPUs with two cores and two hyperthreads on each core, while CPUs with four physical cores gives the affinity problem. The two other computers that I tested to start up with T2-task actually was laptops with I5-520M CPUs featuring two cores and hyperthreading and I therefore assumed that the T2 affinity problem had been solved. Yesterday I started up half a dozen of new computers, four of which have ordinary Intel I5 four core CPU. They all showed the same affinity problem, with multiple tasks sharing the first core, or in the case of the T2:s the first two cores. I'm definitely not an expert on core affinity issues and therefore totally clueless of why there is a difference. :-) Have a nice evening and a good new week to come!! Kindest regards, Gunnar Some of the mask numbers that showed up when I ran the script on a my machines: Dell Vostro 320 AllInOne with a Core-II Duo E7400 CPU (hostid=516) running two T1 tasks: pid 17140's current affinity mask: 1 pid 17140's new affinity mask: 3 pid 17141's current affinity mask: 1 pid 17141's new affinity mask: 3 (same mask numbers on another Core-II Duo machine) ---------------------------------------------------------- HP Elite 8300 USDT with an Core i5-3470S CPU (hostid=509) (The "magic" computer that I mentioned earlier) running two T2 tasks: pid 17140's current affinity mask: 1 pid 17140's new affinity mask: f pid 17141's current affinity mask: 2 pid 17141's new affinity mask: f pid 17691's current affinity mask: 1 pid 17691's new affinity mask: f pid 17692's current affinity mask: 2 pid 17692's new affinity mask: f -------------------------------------------------------- HP EliteBook 8440p with a hyperthreaded Core i5 M 540 CPU (hostid=464) running one T2 and two T1 tasks pid 408's current affinity mask: 3 pid 408's new affinity mask: f pid 409's current affinity mask: c pid 409's new affinity mask: f pid 2447's current affinity mask: 3 pid 2447's new affinity mask: f pid 2727's current affinity mask: 3 pid 2727's new affinity mask: f The pids 408 and 409 is a T2 task The pids 2447 and 2727 are two different T1 tasks ------------------------------------------------------- Two different HP Z220 workstations with hyperthreaded 4 core / 8 thread XEON E3-1245 v2 CPUs (hostid=526) and (hostid=534) each of them running four T2 tasks: pid 10162's current affinity mask: 11 pid 10162's new affinity mask: ff pid 10163's current affinity mask: 22 pid 10163's new affinity mask: ff pid 10302's current affinity mask: 11 pid 10302's new affinity mask: ff pid 10303's current affinity mask: 22 pid 10303's new affinity mask: ff pid 10850's current affinity mask: 11 pid 10850's new affinity mask: ff pid 10851's current affinity mask: 22 pid 10851's new affinity mask: ff pid 10873's current affinity mask: 11 pid 10873's new affinity mask: ff pid 10874's current affinity mask: 22 pid 10874's new affinity mask: ff (the pid numbers differ of course on the other one) (The Linux kernel version doesn't seem to matter as one of them is running Xubuntu 14.04, and the other Xubuntu 18.04) |
4)
Message boards :
Number crunching :
New T2 affinity problem persistent on Linux
(Message 221)
Posted 25 Oct 2019 by Gunnar Hjern Post: Hi Damotbe! Thanks for the fast response! The computer that I was referring to is a bit occupied for the moment, but I will try to reinstall some time this weekend. Instead I tried to run the project on a 4 core 8 thread Xeon 1245 v2 : https://quchempedia.univ-angers.fr/athome/show_host_detail.php?hostid=526 Apparently the affinities are not yet quite correct as some processes are sharing the same cores/threads while other threads are idle - see the "top" dump below. Running at this time were two T1 and three T2 tasks, but they all shared CPU-threads # 0, 1, 4, and 5, leaving the other four CPU-threads idle. (At least I think that it is those thread-numbers, based on what I could see from the CPU-graph applet that I'm running in the top panel. If you know of any better way to get detailed stats about each CPU core please tell.) I'll return later this weekend if I (re-)install any other computer. Kindest regards, Gunnar Top dump from the Xeon 1245-computer: ------------------------------------- top - 21:41:32 up 72 days, 21:28, 4 users, load average: 8,28, 8,79, 8,89 Tasks: 278 total, 11 running, 267 sleeping, 0 stopped, 0 zombie %Cpu(s): 1,1 us, 0,1 sy, 50,3 ni, 48,4 id, 0,1 wa, 0,0 hi, 0,0 si, 0,0 st KiB Mem: 8116444 total, 7775728 used, 340716 free, 668632 buffers KiB Swap: 24668152 total, 180 used, 24667972 free. 4319840 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4491 boinc 39 19 1267260 110316 14444 R 69,1 1,4 18:59.11 nwchem 4388 boinc 39 19 1270036 90188 14232 R 65,5 1,1 19:36.30 nwchem 4519 boinc 39 19 1267268 106368 14164 R 65,1 1,3 18:34.82 nwchem 4360 boinc 39 19 1273396 105776 14216 R 40,5 1,3 12:10.74 nwchem 4415 boinc 39 19 1293968 114948 14084 R 40,2 1,4 11:36.14 nwchem 4387 boinc 39 19 1288444 107036 15392 R 39,9 1,3 11:47.09 nwchem 4490 boinc 39 19 1268276 105596 15864 R 39,9 1,3 11:21.98 nwchem 4518 boinc 39 19 1268484 108728 15616 R 39,9 1,3 11:13.75 nwchem 29316 gunnar 20 0 585988 98152 32968 R 5,3 1,2 1138:10 boincmgr 1169 root 20 0 355236 116868 66496 R 3,7 1,4 958:53.72 Xorg 4537 gunnar 20 0 1257804 298636 101488 S 1,3 3,7 0:54.04 firefox 1661 boinc 30 10 386976 47880 9736 S 0,7 0,6 236:46.30 boinc 2119 gunnar 20 0 492028 32236 24796 S 0,7 0,4 24:09.86 xfwm4 1014 root 20 0 19208 2208 1980 S 0,3 0,0 6:07.35 irqbalance |
5)
Message boards :
Number crunching :
New T2 affinity problem persistent on Linux
(Message 212)
Posted 24 Oct 2019 by Gunnar Hjern Post: It now seems possible to get even two T2-tasks to run in parallel on a 4-core CPU, at least if the quchempedia project was installed after the 20:th Oct. I have now at least two (4-CPU) computers that can run two T2-tasks simultaneous, and getting about 98% on all threads when inspecting with "top" and similar tools. Good!! :-) HOWEVER, if the quchempedia project has been installed earlier (typically between the 14:th to 19:th oct.) it doesn't matter what actions are taken: The T2-tasks stubbornly continue to run on the same two cores!!! :-( I have tried to reset the project, and I have repeatedly tried to completely remove the project from the Boinc manager many times, but without any success. I have also repeatedly removed the whole boinc software suit from the computer (using sudo apt-get purge boinc boinc-manager boinc-client), restarted the machine and re-installed the boinc software again, but without any success! Between removal and re-installment I have searched the whole file system tree for any file named (or containing) "quchem" or "nwchem" but everything seemed to be erased like it should be. How is this possible???? What magic system setting has been done to control the affinity??? Please give me a hint of what file to erase, or what setting to change to get the project to use all four cores wo. any affinity problem. (I'm currently using Xubuntu 18.04 on most machines, although a few remains at Xubuntu 14.04) I have at least 6 more 4-core machines and two 8-cored Xeon:s that I'm planning to hook up to this very interesting project, but I will not do that before this affinity issue is sorted out, so that the machines can run efficiently, and not getting tainted by old installations. Kindest regards, Gunnar Hjern |
6)
Message boards :
Number crunching :
Native Linux WU refuses to suspend
(Message 179)
Posted 19 Oct 2019 by Gunnar Hjern Post: I'm experiencing the same problems - work units will not pause when I order it via the boincmgr. They will however stop when I abort a task. Two of the computers were installed Mon. the 14th, and one computer was installed today. I'm also seeing the problem of core affinity, having multiple tasks sharing the same CPU core. See the thread "New T1 native nwchem work unit affinity problem" https://quchempedia.univ-angers.fr/athome/forum_thread.php?id=23 for that issue. I have several machines on which I cannot have VMs running, so Linux native tasks must work if I should continue crunching. Have a nice weekend!! //Gunnar |
7)
Message boards :
Number crunching :
New T1 native nwchem work unit affinity problem.
(Message 178)
Posted 19 Oct 2019 by Gunnar Hjern Post: I have the same problem too, on THREE different computers, one of which I installed yesterday, and thus must be totally fresh and pristine to all its files and scripts. (All of them seems to be running T1 Linux native tasks.) I would gladly turn on several more computers if I knew that they would be working efficiently. Please let us know when Linux computers could run the tasks efficiently. Have a nice weekend!! Kindest regards, Gunnar |
©2023 Benoit DA MOTA - LERIA, University of Angers, France