Posts by Gunnar Hjern

1) Message boards : Number crunching : No new task sent out when wingman aborted or got a validation error (Message 878)
Posted 11 Jun 2020 by Gunnar Hjern
Post:
Thanks Jim1348!
I completely missed that thread.
Guess it's just to wait then, and hope the creds will arrive before X-mas. ;-)
Happy Crunching!!!
//Gunnar
2) Message boards : Number crunching : No new task sent out when wingman aborted or got a validation error (Message 876)
Posted 11 Jun 2020 by Gunnar Hjern
Post:
Hi Damotbe!

I recently discovered that I have several WUs pending and one "validation inconclusive",
most of them from early May, where my wingman either aborted or got a validation error,
and that no new tasks still have been sent out, although it was several days,
and even months since me and my wingman reported the tasks (as aborted/erroneous etc.)

The pending WUs are:
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1377219 (2 days since abortion/error)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1354680 (1 month and 4 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1354050 (1 month and 1 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353751 (1 month and 9 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353410 (1 month and 2 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353359 (1 month and 2 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353159 (1 month and 2 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353042 (1 month and 2 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353066 (1 month and 2 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353111 (1 month and 2 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1352934 (1 month...)

and the one in validation inconclusive:
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1377219 (3 days)

Normally (in other Boinc projects) new replicated tasks are sent out only hours after
one of the initially replicated tasks is reported unsuccessful.

Does this mean that I never will get any credits for those tasks??

Kindest regards,
Gunnar
3) Message boards : Number crunching : New T2 affinity problem persistent on Linux (Message 231)
Posted 27 Oct 2019 by Gunnar Hjern
Post:
Hi Benoit!

Thanks for the script!!!

I've tested it on several different computers and it solved the problem on all of them! :-)
I've appended some of the first outputs below if it can be of any help to you.
I'll wait until tomorrow with fixing the rest of the computers.

Furthermore, it now seems as I mistook myself about that special computer that
I described above. Instead the case seems to be that two T2 tasks can run in parallel
on CPUs with two cores and two hyperthreads on each core, while CPUs with four
physical cores gives the affinity problem. The two other computers that I tested
to start up with T2-task actually was laptops with I5-520M CPUs featuring two cores
and hyperthreading and I therefore assumed that the T2 affinity problem had been solved.

Yesterday I started up half a dozen of new computers, four of which have ordinary
Intel I5 four core CPU. They all showed the same affinity problem, with multiple tasks
sharing the first core, or in the case of the T2:s the first two cores.
I'm definitely not an expert on core affinity issues and therefore totally clueless
of why there is a difference. :-)

Have a nice evening and a good new week to come!!

Kindest regards,
Gunnar



Some of the mask numbers that showed up when I ran the script on a my machines:

Dell Vostro 320 AllInOne  with a Core-II Duo E7400 CPU
(hostid=516)
running two T1 tasks:
pid 17140's current affinity mask: 1
pid 17140's new affinity mask: 3
pid 17141's current affinity mask: 1
pid 17141's new affinity mask: 3

(same mask numbers on another Core-II Duo machine)

----------------------------------------------------------

HP Elite 8300 USDT with an Core i5-3470S CPU
(hostid=509)  (The "magic" computer that I mentioned earlier)
running two T2 tasks:
pid 17140's current affinity mask: 1
pid 17140's new affinity mask: f
pid 17141's current affinity mask: 2
pid 17141's new affinity mask: f
pid 17691's current affinity mask: 1
pid 17691's new affinity mask: f
pid 17692's current affinity mask: 2
pid 17692's new affinity mask: f

--------------------------------------------------------

HP EliteBook 8440p with a hyperthreaded
Core i5 M 540 CPU
(hostid=464)
running one T2 and two T1 tasks
pid 408's current affinity mask: 3
pid 408's new affinity mask: f
pid 409's current affinity mask: c
pid 409's new affinity mask: f
pid 2447's current affinity mask: 3
pid 2447's new affinity mask: f
pid 2727's current affinity mask: 3
pid 2727's new affinity mask: f

The pids 408 and 409 is a T2 task
The pids 2447 and 2727 are two different T1 tasks

-------------------------------------------------------

Two different HP Z220 workstations with hyperthreaded
4 core / 8 thread XEON E3-1245 v2 CPUs
(hostid=526) and (hostid=534)
each of them running four T2 tasks:
pid 10162's current affinity mask: 11
pid 10162's new affinity mask: ff
pid 10163's current affinity mask: 22
pid 10163's new affinity mask: ff
pid 10302's current affinity mask: 11
pid 10302's new affinity mask: ff
pid 10303's current affinity mask: 22
pid 10303's new affinity mask: ff
pid 10850's current affinity mask: 11
pid 10850's new affinity mask: ff
pid 10851's current affinity mask: 22
pid 10851's new affinity mask: ff
pid 10873's current affinity mask: 11
pid 10873's new affinity mask: ff
pid 10874's current affinity mask: 22
pid 10874's new affinity mask: ff

(the pid numbers differ of course on the other one)

(The Linux kernel version doesn't seem to matter as one of them 
is running Xubuntu 14.04, and the other Xubuntu 18.04)

4) Message boards : Number crunching : New T2 affinity problem persistent on Linux (Message 221)
Posted 25 Oct 2019 by Gunnar Hjern
Post:
Hi Damotbe!

Thanks for the fast response!

The computer that I was referring to is a bit occupied for the moment, but I will try to reinstall some time this weekend.
Instead I tried to run the project on a 4 core 8 thread Xeon 1245 v2 :
https://quchempedia.univ-angers.fr/athome/show_host_detail.php?hostid=526

Apparently the affinities are not yet quite correct as some processes are sharing
the same cores/threads while other threads are idle - see the "top" dump below.

Running at this time were two T1 and three T2 tasks, but they all shared CPU-threads # 0, 1, 4, and 5,
leaving the other four CPU-threads idle. (At least I think that it is those thread-numbers, based on
what I could see from the CPU-graph applet that I'm running in the top panel.
If you know of any better way to get detailed stats about each CPU core please tell.)

I'll return later this weekend if I (re-)install any other computer.

Kindest regards,
Gunnar


Top dump from the Xeon 1245-computer:
-------------------------------------
top - 21:41:32 up 72 days, 21:28,  4 users,  load average: 8,28, 8,79, 8,89
Tasks: 278 total,  11 running, 267 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1,1 us,  0,1 sy, 50,3 ni, 48,4 id,  0,1 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem:   8116444 total,  7775728 used,   340716 free,   668632 buffers
KiB Swap: 24668152 total,      180 used, 24667972 free.  4319840 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                         
 4491 boinc     39  19 1267260 110316  14444 R  69,1  1,4  18:59.11 nwchem                          
 4388 boinc     39  19 1270036  90188  14232 R  65,5  1,1  19:36.30 nwchem                          
 4519 boinc     39  19 1267268 106368  14164 R  65,1  1,3  18:34.82 nwchem                          
 4360 boinc     39  19 1273396 105776  14216 R  40,5  1,3  12:10.74 nwchem                          
 4415 boinc     39  19 1293968 114948  14084 R  40,2  1,4  11:36.14 nwchem                          
 4387 boinc     39  19 1288444 107036  15392 R  39,9  1,3  11:47.09 nwchem                          
 4490 boinc     39  19 1268276 105596  15864 R  39,9  1,3  11:21.98 nwchem                          
 4518 boinc     39  19 1268484 108728  15616 R  39,9  1,3  11:13.75 nwchem                          
29316 gunnar    20   0  585988  98152  32968 R   5,3  1,2   1138:10 boincmgr                        
 1169 root      20   0  355236 116868  66496 R   3,7  1,4 958:53.72 Xorg                            
 4537 gunnar    20   0 1257804 298636 101488 S   1,3  3,7   0:54.04 firefox                         
 1661 boinc     30  10  386976  47880   9736 S   0,7  0,6 236:46.30 boinc                           
 2119 gunnar    20   0  492028  32236  24796 S   0,7  0,4  24:09.86 xfwm4                           
 1014 root      20   0   19208   2208   1980 S   0,3  0,0   6:07.35 irqbalance                      
5) Message boards : Number crunching : New T2 affinity problem persistent on Linux (Message 212)
Posted 24 Oct 2019 by Gunnar Hjern
Post:
It now seems possible to get even two T2-tasks to run in parallel on a 4-core CPU,
at least if the quchempedia project was installed after the 20:th Oct.
I have now at least two (4-CPU) computers that can run two T2-tasks simultaneous, and
getting about 98% on all threads when inspecting with "top" and similar tools.
Good!! :-)

HOWEVER, if the quchempedia project has been installed earlier
(typically between the 14:th to 19:th oct.) it doesn't matter what actions are taken:
The T2-tasks stubbornly continue to run on the same two cores!!! :-(

I have tried to reset the project, and I have repeatedly tried to completely remove the
project from the Boinc manager many times, but without any success.

I have also repeatedly removed the whole boinc software suit from
the computer (using sudo apt-get purge boinc boinc-manager boinc-client),
restarted the machine and re-installed the boinc software again, but without
any success!

Between removal and re-installment I have searched the whole file system tree
for any file named (or containing) "quchem" or "nwchem" but everything seemed
to be erased like it should be.

How is this possible????
What magic system setting has been done to control the affinity???


Please give me a hint of what file to erase, or what setting to change to get
the project to use all four cores wo. any affinity problem.
(I'm currently using Xubuntu 18.04 on most machines, although a few remains
at Xubuntu 14.04)

I have at least 6 more 4-core machines and two 8-cored Xeon:s that I'm planning
to hook up to this very interesting project, but I will not do that before this affinity
issue is sorted out, so that the machines can run efficiently, and not getting tainted
by old installations.

Kindest regards,
Gunnar Hjern
6) Message boards : Number crunching : Native Linux WU refuses to suspend (Message 179)
Posted 19 Oct 2019 by Gunnar Hjern
Post:
I'm experiencing the same problems - work units will not pause when I order it via the boincmgr.
They will however stop when I abort a task.
Two of the computers were installed Mon. the 14th, and one computer was installed today.

I'm also seeing the problem of core affinity, having multiple tasks sharing the same CPU core.
See the thread "New T1 native nwchem work unit affinity problem"
https://quchempedia.univ-angers.fr/athome/forum_thread.php?id=23
for that issue.

I have several machines on which I cannot have VMs running, so Linux native tasks must work
if I should continue crunching.

Have a nice weekend!!

//Gunnar
7) Message boards : Number crunching : New T1 native nwchem work unit affinity problem. (Message 178)
Posted 19 Oct 2019 by Gunnar Hjern
Post:
I have the same problem too, on THREE different computers, one of which I installed yesterday,
and thus must be totally fresh and pristine to all its files and scripts.
(All of them seems to be running T1 Linux native tasks.)

I would gladly turn on several more computers if I knew that they would be working efficiently.
Please let us know when Linux computers could run the tasks efficiently.

Have a nice weekend!!

Kindest regards,
Gunnar




©2024 Benoit DA MOTA - LERIA, University of Angers, France