The aborted and resend WU

Author	Message
marsinph Send message Joined: 13 Nov 19 Posts: 21 Credit: 2,596,565 RAC: 0	Message 1082 - Posted: 24 Sep 2020, 22:32:50 UTC Hello Damot. I have lot if WU wautung Wungman. Some frim may 2020 ! Four months ago !!! And still not resended. Why not first resend before creation of new WU ? Everyone will be happy.. For sure the users like me who crunch already months. Sorry for the one who only joined because sprint. Users who stock WU I know you are alone and with only two hands. I also suggest to set a limitation on amount of WU per host ! Best regards ID: 1082 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 14 Dec 19 Posts: 68 Credit: 45,744,261 RAC: 0	Message 1086 - Posted: 25 Sep 2020, 1:17:36 UTC I have a bunch of stale unvalidated WUs as well. I wonder if they're from a bad batch that got cancelled before everything finished? ID: 1086 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 1089 - Posted: 25 Sep 2020, 6:33:33 UTC - in response to Message 1086. The scheduler manage both "nwchem" and "nwchem long" WU. It seems that the mix is not well balanced... I implement somerthing to accelerate wingman resends, but it seems that to many "nwchem" were invalid and it saturates the scheduler. I'll try several other things after the formulaboinc sprint ID: 1089 · Rating: 0 · rate: / Reply Quote

marsinph Send message Joined: 13 Nov 19 Posts: 21 Credit: 2,596,565 RAC: 0	Message 1095 - Posted: 25 Sep 2020, 14:31:56 UTC - in response to Message 1086. I have a bunch of stale unvalidated WUs as well. I wonder if they're from a bad batch that got cancelled before everything finished? Not be afraid or surprised, you are not alone. Me too ! About 50% inconclusive. Then from the total returned 50% valid and 50% wainting Wingman. Not cry, I have WU since may wautung wungman. The oroblem is that rhose WU are NOT sent again. ID: 1095 · Rating: 0 · rate: / Reply Quote

marsinph Send message Joined: 13 Nov 19 Posts: 21 Credit: 2,596,565 RAC: 0	Message 1096 - Posted: 25 Sep 2020, 14:35:45 UTC - in response to Message 1089. The scheduler manage both "nwchem" and "nwchem long" WU. It seems that the mix is not well balanced... I implement somerthing to accelerate wingman resends, but it seems that to many "nwchem" were invalid and it saturates the scheduler. I'll try several other things after the formulaboinc sprint Thank you. So far I know and remeber, it is a setting inside scheduler. I am sure Aurum can help. If I am not wrong, he explained it on a other PRJ a long time ago. ID: 1096 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 14 Dec 19 Posts: 68 Credit: 45,744,261 RAC: 0	Message 1097 - Posted: 25 Sep 2020, 14:43:16 UTC Everything I know about BOINC servers is thanks to Doctor Google, i.e. I know nothing as Sergeant Schultz would say. ID: 1097 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 1102 - Posted: 25 Sep 2020, 15:40:01 UTC - in response to Message 1097. my tasks : waiting for validation (13503) Validation inconclusive (14374) so, need to wait ;) ID: 1102 · Rating: 0 · rate: / Reply Quote

marsinph Send message Joined: 13 Nov 19 Posts: 21 Credit: 2,596,565 RAC: 0	Message 1104 - Posted: 27 Sep 2020, 20:46:52 UTC - in response to Message 1102. my tasks : waiting for validation (13503) Validation inconclusive (14374) so, need to wait ;) Damot, why you produce new WU, there are WU waiting resent since several months !!! https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1386777 https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1387129 https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1384893 Now I see you have a huge amount of inconclusieve. What is the problem ? Your host ? the WU you produce ? The how they are hadled ? You also have the same ration, as everyone i think : look https://quchempedia.univ-angers.fr/athome/results.php?hostid=1764 Best friendly regards from Belgium french ! ID: 1104 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 1108 - Posted: 29 Sep 2020, 8:21:37 UTC - in response to Message 1104. as often the causes are multiple. first of all many hosts produce errors (especially with Virtualbox). Then, the formulaboinc sprint generated a large amount of requests but after three days, many users cancelled the tasks. I tried to tweak the scheduler, but it is overwhelmed by the situation. I hope to get back to normal in a few weeks. To do so, it would take more computing power than mine to empty the old tasks Finally, there is the normal functioning of the project: we explore unstable spaces and the calculations are unstable by definition. An unstable molecule leads almost systematically to an error or a divergence (inconclusive). I set the number of points to compensate for this reality and the non-deterministic aspect of the computation time. It is a project for passionate and persevering volunteers ;-) ID: 1108 · Rating: 0 · rate: / Reply Quote

Tullio Send message Joined: 5 Sep 20 Posts: 103 Credit: 2,142,600 RAC: 0	Message 1111 - Posted: 29 Sep 2020, 16:06:55 UTC I have installed a Linux Virtual Machine with SuSE Tumbleweed, a development version, on a Windows 10 PC with plenty of RAM and it is runninh nwchem. But, since I enlisted it in Science United, I cannot choose the projects its runs, this is done by Science United. But I see it running nwchem by the "top" command. Not to start the Windows vs Linux war, but when a tasks is completed and validated on my Windows 10 PC using VirtualBox it is typically faster than a Linux companion even this has a faster CPU than my Intel i5. You can check my completed tasks and verify this. Tullio. ID: 1111 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0	Message 1112 - Posted: 29 Sep 2020, 17:48:25 UTC - in response to Message 1111. When I tried to verify that on my Win7 64-bit machine, I had so many errors that I could not get very far. My conclusion was that so many cores were left idle that the remaining work units of course ran faster on the other cores. That is always the case with virtual cores. They run faster when lightly loaded. But there may be other reasons as well, and your experience may be different. ID: 1112 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 1115 - Posted: 30 Sep 2020, 11:54:19 UTC - in response to Message 1112. Comparing tasks that already have a lot of variability and some of which are virtualized is very hazardous. ID: 1115 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 1121 - Posted: 2 Oct 2020, 12:13:45 UTC - in response to Message 1115. I tweak feeder and scheduler again. Normally, retries are accelerated AND the oldest jobs are preferred. I also doubled the capacity in shared memory. Let's see ! ID: 1121 · Rating: 0 · rate: / Reply Quote

xii5ku Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0	Message 1122 - Posted: 3 Oct 2020, 6:35:11 UTC - in response to Message 1121. Last modified: 3 Oct 2020, 7:10:51 UTC damotbe wrote: I tweak feeder and scheduler again. Normally, retries are accelerated AND the oldest jobs are preferred. I also doubled the capacity in shared memory. Let's see ! I am noticing that the majority of tasks in my results tables most recently were re-sends of rather old workunits which had been replicated several times by now. My own result statuses are mostly "Completed, validation inconclusive", "Validate error", "Completed, can't validate" --- just like from the computers which ran previous replicas --- with a few " Completed, waiting for validation" thrown in a, and very very rare "Completed and validated". This change from successfully completed new workunits mostly, to unsuccessful old ones mostly, seems to coincide circa with the time of your posting. So, this part of your plan evidently worked. However: I am no longer receiving any new work now. Yet server_status shows plenty of tasks ready to send. My results tables at the web site show that I received tasks frequently until 3 Oct 2020, 0:07:33 UTC. After that, I received five more tasks in a time window between 1:56...1:58 UTC. Then nothing more. Is this because my hosts returned so many inconclusive or invalid results during the last ~half day? I believe so. I started another client which has not been active at QuChemPedIA after September 27, and had a good success rate until then. This client received the full complement of 128 tasks immediately, in the usual batches of 20 tasks per request. (All of these tasks were from old workunits too, which failed several times before.) Conclusion: I suspect the new preference of old, bad workunits which only produce faulty results lets the scheduler mark one host after another as unreliable. And then the scheduler no longer assigns work to them. Edit, I think the large number of old workunits which are faulty but still haven't reached their maximum failure count will soon cause this project to run out of trusted hosts to which the sever would want to assign replicas of previously failed workunits. What happens then? Will everything come to a standstill, or will the server then begin to send tasks from new workunits to the allegedly unreliable hosts again? ID: 1122 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 1123 - Posted: 3 Oct 2020, 8:39:33 UTC - in response to Message 1122. Thank you for this great feedback! side-effects are always unpredictable and deleterious... I had been very permissive on the configuration of "reliable" hosts for replication. Normally, there is no constraint for initial jobs (2 first wingmens). but the new setting of the scheduler does not allow to have jobs of this type in reserve... It's discouraging! I'm going to relax again the constraints (which are already 1 month of turnaround). I'm going to look if a default configuration doesn't exist somewhere too! ID: 1123 · Rating: 0 · rate: / Reply Quote

xii5ku Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0	Message 1125 - Posted: 3 Oct 2020, 16:27:09 UTC - in response to Message 1123. Just now, I switched four of my computers which no longer received work back to QuChemPedIA, and they were given new work right away. Thanks for your quick response and adjustment. ID: 1125 · Rating: 0 · rate: / Reply Quote

damotbe Volunteer moderator Project administrator Project developer Project tester Project scientist Help desk expert Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0	Message 1127 - Posted: 4 Oct 2020, 7:07:07 UTC - in response to Message 1125. Thank to you ! You saved my day! you put your finger on THE problem. ID: 1127 · Rating: 0 · rate: / Reply Quote