Message boards :
Number crunching :
The aborted and resend WU
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Nov 19 Posts: 21 Credit: 2,596,565 RAC: 0 |
Hello Damot. I have lot if WU wautung Wungman. Some frim may 2020 ! Four months ago !!! And still not resended. Why not first resend before creation of new WU ? Everyone will be happy.. For sure the users like me who crunch already months. Sorry for the one who only joined because sprint. Users who stock WU I know you are alone and with only two hands. I also suggest to set a limitation on amount of WU per host ! Best regards |
Send message Joined: 14 Dec 19 Posts: 68 Credit: 45,744,261 RAC: 0 |
I have a bunch of stale unvalidated WUs as well. I wonder if they're from a bad batch that got cancelled before everything finished? |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
The scheduler manage both "nwchem" and "nwchem long" WU. It seems that the mix is not well balanced... I implement somerthing to accelerate wingman resends, but it seems that to many "nwchem" were invalid and it saturates the scheduler. I'll try several other things after the formulaboinc sprint |
Send message Joined: 13 Nov 19 Posts: 21 Credit: 2,596,565 RAC: 0 |
I have a bunch of stale unvalidated WUs as well. I wonder if they're from a bad batch that got cancelled before everything finished? Not be afraid or surprised, you are not alone. Me too ! About 50% inconclusive. Then from the total returned 50% valid and 50% wainting Wingman. Not cry, I have WU since may wautung wungman. The oroblem is that rhose WU are NOT sent again. |
Send message Joined: 13 Nov 19 Posts: 21 Credit: 2,596,565 RAC: 0 |
The scheduler manage both "nwchem" and "nwchem long" WU. It seems that the mix is not well balanced... I implement somerthing to accelerate wingman resends, but it seems that to many "nwchem" were invalid and it saturates the scheduler. I'll try several other things after the formulaboinc sprint Thank you. So far I know and remeber, it is a setting inside scheduler. I am sure Aurum can help. If I am not wrong, he explained it on a other PRJ a long time ago. |
Send message Joined: 14 Dec 19 Posts: 68 Credit: 45,744,261 RAC: 0 |
Everything I know about BOINC servers is thanks to Doctor Google, i.e. I know nothing as Sergeant Schultz would say. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
my tasks : waiting for validation (13503) Validation inconclusive (14374) so, need to wait ;) |
Send message Joined: 13 Nov 19 Posts: 21 Credit: 2,596,565 RAC: 0 |
my tasks : Damot, why you produce new WU, there are WU waiting resent since several months !!! https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1386777 https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1387129 https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1384893 Now I see you have a huge amount of inconclusieve. What is the problem ? Your host ? the WU you produce ? The how they are hadled ? You also have the same ration, as everyone i think : look https://quchempedia.univ-angers.fr/athome/results.php?hostid=1764 Best friendly regards from Belgium french ! |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
as often the causes are multiple. first of all many hosts produce errors (especially with Virtualbox). Then, the formulaboinc sprint generated a large amount of requests but after three days, many users cancelled the tasks. I tried to tweak the scheduler, but it is overwhelmed by the situation. I hope to get back to normal in a few weeks. To do so, it would take more computing power than mine to empty the old tasks Finally, there is the normal functioning of the project: we explore unstable spaces and the calculations are unstable by definition. An unstable molecule leads almost systematically to an error or a divergence (inconclusive). I set the number of points to compensate for this reality and the non-deterministic aspect of the computation time. It is a project for passionate and persevering volunteers ;-) |
Send message Joined: 5 Sep 20 Posts: 103 Credit: 2,142,600 RAC: 0 |
I have installed a Linux Virtual Machine with SuSE Tumbleweed, a development version, on a Windows 10 PC with plenty of RAM and it is runninh nwchem. But, since I enlisted it in Science United, I cannot choose the projects its runs, this is done by Science United. But I see it running nwchem by the "top" command. Not to start the Windows vs Linux war, but when a tasks is completed and validated on my Windows 10 PC using VirtualBox it is typically faster than a Linux companion even this has a faster CPU than my Intel i5. You can check my completed tasks and verify this. Tullio. |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
When I tried to verify that on my Win7 64-bit machine, I had so many errors that I could not get very far. My conclusion was that so many cores were left idle that the remaining work units of course ran faster on the other cores. That is always the case with virtual cores. They run faster when lightly loaded. But there may be other reasons as well, and your experience may be different. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Comparing tasks that already have a lot of variability and some of which are virtualized is very hazardous. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
I tweak feeder and scheduler again. Normally, retries are accelerated AND the oldest jobs are preferred. I also doubled the capacity in shared memory. Let's see ! |
Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0 |
damotbe wrote: I tweak feeder and scheduler again. Normally, retries are accelerated AND the oldest jobs are preferred. I also doubled the capacity in shared memory. Let's see !I am noticing that the majority of tasks in my results tables most recently were re-sends of rather old workunits which had been replicated several times by now. My own result statuses are mostly "Completed, validation inconclusive", "Validate error", "Completed, can't validate" --- just like from the computers which ran previous replicas --- with a few " Completed, waiting for validation" thrown in a, and very very rare "Completed and validated". This change from successfully completed new workunits mostly, to unsuccessful old ones mostly, seems to coincide circa with the time of your posting. So, this part of your plan evidently worked. However: I am no longer receiving any new work now. Yet server_status shows plenty of tasks ready to send. My results tables at the web site show that I received tasks frequently until 3 Oct 2020, 0:07:33 UTC. After that, I received five more tasks in a time window between 1:56...1:58 UTC. Then nothing more. Is this because my hosts returned so many inconclusive or invalid results during the last ~half day? I believe so. I started another client which has not been active at QuChemPedIA after September 27, and had a good success rate until then. This client received the full complement of 128 tasks immediately, in the usual batches of 20 tasks per request. (All of these tasks were from old workunits too, which failed several times before.) Conclusion: I suspect the new preference of old, bad workunits which only produce faulty results lets the scheduler mark one host after another as unreliable. And then the scheduler no longer assigns work to them. Edit, I think the large number of old workunits which are faulty but still haven't reached their maximum failure count will soon cause this project to run out of trusted hosts to which the sever would want to assign replicas of previously failed workunits. What happens then? Will everything come to a standstill, or will the server then begin to send tasks from new workunits to the allegedly unreliable hosts again? |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Thank you for this great feedback! side-effects are always unpredictable and deleterious... I had been very permissive on the configuration of "reliable" hosts for replication. Normally, there is no constraint for initial jobs (2 first wingmens). but the new setting of the scheduler does not allow to have jobs of this type in reserve... It's discouraging! I'm going to relax again the constraints (which are already 1 month of turnaround). I'm going to look if a default configuration doesn't exist somewhere too! |
Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0 |
Just now, I switched four of my computers which no longer received work back to QuChemPedIA, and they were given new work right away. Thanks for your quick response and adjustment. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Thank to you ! You saved my day! you put your finger on THE problem. |
©2024 Benoit DA MOTA - LERIA, University of Angers, France