The aborted and resend WU

Message boards : Number crunching : The aborted and resend WU
Message board moderation

To post messages, you must log in.

AuthorMessage
marsinph

Send message
Joined: 13 Nov 19
Posts: 20
Credit: 2,596,165
RAC: 18
Message 1082 - Posted: 24 Sep 2020, 22:32:50 UTC

Hello Damot.
I have lot if WU wautung Wungman. Some frim may 2020 ! Four months ago !!!
And still not resended.
Why not first resend before creation of new WU ?
Everyone will be happy.. For sure the users like me who crunch already months. Sorry for the one who only joined because sprint.
Users who stock WU
I know you are alone and with only two hands.

I also suggest to set a limitation on amount of WU per host !
Best regards
ID: 1082 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 14 Dec 19
Posts: 56
Credit: 16,404,661
RAC: 5
Message 1086 - Posted: 25 Sep 2020, 1:17:36 UTC

I have a bunch of stale unvalidated WUs as well. I wonder if they're from a bad batch that got cancelled before everything finished?
ID: 1086 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 267
Credit: 396,970,961
RAC: 31,949
Message 1089 - Posted: 25 Sep 2020, 6:33:33 UTC - in response to Message 1086.  

The scheduler manage both "nwchem" and "nwchem long" WU. It seems that the mix is not well balanced... I implement somerthing to accelerate wingman resends, but it seems that to many "nwchem" were invalid and it saturates the scheduler. I'll try several other things after the formulaboinc sprint
ID: 1089 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marsinph

Send message
Joined: 13 Nov 19
Posts: 20
Credit: 2,596,165
RAC: 18
Message 1095 - Posted: 25 Sep 2020, 14:31:56 UTC - in response to Message 1086.  

I have a bunch of stale unvalidated WUs as well. I wonder if they're from a bad batch that got cancelled before everything finished?



Not be afraid or surprised, you are not alone. Me too !
About 50% inconclusive.
Then from the total returned 50% valid and 50% wainting Wingman.
Not cry, I have WU since may wautung wungman. The oroblem is that rhose WU are NOT sent again.
ID: 1095 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marsinph

Send message
Joined: 13 Nov 19
Posts: 20
Credit: 2,596,165
RAC: 18
Message 1096 - Posted: 25 Sep 2020, 14:35:45 UTC - in response to Message 1089.  

The scheduler manage both "nwchem" and "nwchem long" WU. It seems that the mix is not well balanced... I implement somerthing to accelerate wingman resends, but it seems that to many "nwchem" were invalid and it saturates the scheduler. I'll try several other things after the formulaboinc sprint



Thank you.
So far I know and remeber, it is a setting inside scheduler.
I am sure Aurum can help. If I am not wrong, he explained it on a other PRJ a long time ago.
ID: 1096 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 14 Dec 19
Posts: 56
Credit: 16,404,661
RAC: 5
Message 1097 - Posted: 25 Sep 2020, 14:43:16 UTC

Everything I know about BOINC servers is thanks to Doctor Google, i.e. I know nothing as Sergeant Schultz would say.
ID: 1097 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 267
Credit: 396,970,961
RAC: 31,949
Message 1102 - Posted: 25 Sep 2020, 15:40:01 UTC - in response to Message 1097.  

my tasks :
waiting for validation (13503)
Validation inconclusive (14374)

so, need to wait ;)
ID: 1102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marsinph

Send message
Joined: 13 Nov 19
Posts: 20
Credit: 2,596,165
RAC: 18
Message 1104 - Posted: 27 Sep 2020, 20:46:52 UTC - in response to Message 1102.  

my tasks :
waiting for validation (13503)
Validation inconclusive (14374)

so, need to wait ;)



Damot, why you produce new WU, there are WU waiting resent since several months !!!
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1386777
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1387129
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1384893

Now I see you have a huge amount of inconclusieve. What is the problem ? Your host ? the WU you produce ? The how they are hadled ?
You also have the same ration, as everyone i think : look https://quchempedia.univ-angers.fr/athome/results.php?hostid=1764
Best friendly regards from Belgium french !
ID: 1104 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 267
Credit: 396,970,961
RAC: 31,949
Message 1108 - Posted: 29 Sep 2020, 8:21:37 UTC - in response to Message 1104.  

as often the causes are multiple.

first of all many hosts produce errors (especially with Virtualbox).

Then, the formulaboinc sprint generated a large amount of requests but after three days, many users cancelled the tasks. I tried to tweak the scheduler, but it is overwhelmed by the situation. I hope to get back to normal in a few weeks. To do so, it would take more computing power than mine to empty the old tasks

Finally, there is the normal functioning of the project: we explore unstable spaces and the calculations are unstable by definition. An unstable molecule leads almost systematically to an error or a divergence (inconclusive). I set the number of points to compensate for this reality and the non-deterministic aspect of the computation time.

It is a project for passionate and persevering volunteers ;-)
ID: 1108 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tullio

Send message
Joined: 5 Sep 20
Posts: 44
Credit: 627,200
RAC: 2,814
Message 1111 - Posted: 29 Sep 2020, 16:06:55 UTC

I have installed a Linux Virtual Machine with SuSE Tumbleweed, a development version, on a Windows 10 PC with plenty of RAM and it is runninh nwchem. But, since I enlisted it in Science United, I cannot choose the projects its runs, this is done by Science United. But I see it running nwchem by the "top" command. Not to start the Windows vs Linux war, but when a tasks is completed and validated on my Windows 10 PC using VirtualBox it is typically faster than a Linux companion even this has a faster CPU than my Intel i5. You can check my completed tasks and verify this.
Tullio.
ID: 1111 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 101
Credit: 13,485,973
RAC: 40,839
Message 1112 - Posted: 29 Sep 2020, 17:48:25 UTC - in response to Message 1111.  

When I tried to verify that on my Win7 64-bit machine, I had so many errors that I could not get very far. My conclusion was that so many cores were left idle that the remaining work units of course ran faster on the other cores.

That is always the case with virtual cores. They run faster when lightly loaded.
But there may be other reasons as well, and your experience may be different.
ID: 1112 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 267
Credit: 396,970,961
RAC: 31,949
Message 1115 - Posted: 30 Sep 2020, 11:54:19 UTC - in response to Message 1112.  

Comparing tasks that already have a lot of variability and some of which are virtualized is very hazardous.
ID: 1115 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 267
Credit: 396,970,961
RAC: 31,949
Message 1121 - Posted: 2 Oct 2020, 12:13:45 UTC - in response to Message 1115.  

I tweak feeder and scheduler again. Normally, retries are accelerated AND the oldest jobs are preferred. I also doubled the capacity in shared memory. Let's see !
ID: 1121 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 17
Credit: 62,575,800
RAC: 30,098
Message 1122 - Posted: 3 Oct 2020, 6:35:11 UTC - in response to Message 1121.  
Last modified: 3 Oct 2020, 7:10:51 UTC

damotbe wrote:
I tweak feeder and scheduler again. Normally, retries are accelerated AND the oldest jobs are preferred. I also doubled the capacity in shared memory. Let's see !
I am noticing that the majority of tasks in my results tables most recently were re-sends of rather old workunits which had been replicated several times by now. My own result statuses are mostly "Completed, validation inconclusive", "Validate error", "Completed, can't validate" --- just like from the computers which ran previous replicas --- with a few " Completed, waiting for validation" thrown in a, and very very rare "Completed and validated".

This change from successfully completed new workunits mostly, to unsuccessful old ones mostly, seems to coincide circa with the time of your posting. So, this part of your plan evidently worked.

However:
I am no longer receiving any new work now. Yet server_status shows plenty of tasks ready to send.

My results tables at the web site show that I received tasks frequently until 3 Oct 2020, 0:07:33 UTC. After that, I received five more tasks in a time window between 1:56...1:58 UTC. Then nothing more.

Is this because my hosts returned so many inconclusive or invalid results during the last ~half day? I believe so. I started another client which has not been active at QuChemPedIA after September 27, and had a good success rate until then. This client received the full complement of 128 tasks immediately, in the usual batches of 20 tasks per request. (All of these tasks were from old workunits too, which failed several times before.)

Conclusion:
I suspect the new preference of old, bad workunits which only produce faulty results lets the scheduler mark one host after another as unreliable. And then the scheduler no longer assigns work to them.

Edit,
I think the large number of old workunits which are faulty but still haven't reached their maximum failure count will soon cause this project to run out of trusted hosts to which the sever would want to assign replicas of previously failed workunits. What happens then? Will everything come to a standstill, or will the server then begin to send tasks from new workunits to the allegedly unreliable hosts again?
ID: 1122 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 267
Credit: 396,970,961
RAC: 31,949
Message 1123 - Posted: 3 Oct 2020, 8:39:33 UTC - in response to Message 1122.  

Thank you for this great feedback! side-effects are always unpredictable and deleterious...
I had been very permissive on the configuration of "reliable" hosts for replication. Normally, there is no constraint for initial jobs (2 first wingmens). but the new setting of the scheduler does not allow to have jobs of this type in reserve... It's discouraging! I'm going to relax again the constraints (which are already 1 month of turnaround). I'm going to look if a default configuration doesn't exist somewhere too!
ID: 1123 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 17
Credit: 62,575,800
RAC: 30,098
Message 1125 - Posted: 3 Oct 2020, 16:27:09 UTC - in response to Message 1123.  

Just now, I switched four of my computers which no longer received work back to QuChemPedIA, and they were given new work right away. Thanks for your quick response and adjustment.
ID: 1125 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 267
Credit: 396,970,961
RAC: 31,949
Message 1127 - Posted: 4 Oct 2020, 7:07:07 UTC - in response to Message 1125.  

Thank to you ! You saved my day! you put your finger on THE problem.
ID: 1127 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : The aborted and resend WU

©2021 Benoit DA MOTA - LERIA, University of Angers, France