Message boards :
Number crunching :
No new task sent out when wingman aborted or got a validation error
Message board moderation
Author | Message |
---|---|
Send message Joined: 14 Oct 19 Posts: 7 Credit: 2,614,863 RAC: 0 |
Hi Damotbe! I recently discovered that I have several WUs pending and one "validation inconclusive", most of them from early May, where my wingman either aborted or got a validation error, and that no new tasks still have been sent out, although it was several days, and even months since me and my wingman reported the tasks (as aborted/erroneous etc.) The pending WUs are: https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1377219 (2 days since abortion/error) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1354680 (1 month and 4 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1354050 (1 month and 1 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353751 (1 month and 9 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353410 (1 month and 2 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353359 (1 month and 2 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353159 (1 month and 2 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353042 (1 month and 2 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353066 (1 month and 2 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353111 (1 month and 2 days...) https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1352934 (1 month...) and the one in validation inconclusive: https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1377219 (3 days) Normally (in other Boinc projects) new replicated tasks are sent out only hours after one of the initially replicated tasks is reported unsuccessful. Does this mean that I never will get any credits for those tasks?? Kindest regards, Gunnar |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
Does this mean that I never will get any credits for those tasks?? https://quchempedia.univ-angers.fr/athome/forum_thread.php?id=98&postid=869#869 |
Send message Joined: 14 Oct 19 Posts: 7 Credit: 2,614,863 RAC: 0 |
Thanks Jim1348! I completely missed that thread. Guess it's just to wait then, and hope the creds will arrive before X-mas. ;-) Happy Crunching!!! //Gunnar |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
Guess it's just to wait then, and hope the creds will arrive before X-mas. ;-) I am in the same boat. We all are. I just hope it does not sink before getting into port. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Yeah, it's insane that re-sending isn't a priority. On my end, it takes a really long time to complete the batches... |
Send message Joined: 6 Nov 19 Posts: 8 Credit: 156,845 RAC: 0 |
Yeah, it's insane that re-sending isn't a priority. On my end, it takes a really long time to complete the batches... I think there is a way to do that look at: https://boinc.berkeley.edu/trac/wiki/ProjectOptions under the header: Accelerating retries altough you will have to figure out how it neeeds to be set to work. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
I already readed this documentation several times...It is not very clear to me how to choose values. (and experiencing in production is a lit bit dangerous) <reliable_on_priority>X</reliable_on_priority> Results with priority at least reliable_on_priority are treated as "need-reliable". They'll be sent preferentially to reliable hosts. I don't define reliable hosts, so maybe it just exists... But, how to choose X ? no guideline, no example... <reliable_max_avg_turnaround>secs</reliable_max_avg_turnaround> Hosts whose average turnaround is at most reliable_max_avg_turnaround and that have at least 10 consecutive valid results e are considered 'reliable'. Make sure you set this low enough that a significant fraction (e.g. 25%) of your hosts qualify. No reliable hosts need to be defined by myself with a turnaround in seconds... I need to choose this value to be sure to qualify 25% of the hosts ! Did you know the variability of my jobs ? from minutes to days ! So, the best I can do is to define reliability to something huge, like 1 month of turnaround to qualify all hosts? (not sure with this) All hosts ? no, only those with at least 10 consecutive valid results... I have to pray it exists ! So, I directly ask the SQL database for host avg_turnaround and what I can see is that turnarround dos not correspond to tasks runtime... Other observation : hosts without credits have 0.00 avg_turnaround, so the best in terms of reliability... WTF ! After many SQL queries, I decide to look at my computers (very reliable) and avg_turnaround goes from 2 days to 10 days. Then, I project 10 days of turnaround on all hosts and ... more than 90% of the hosts have a turnaround less or equal to 10 days ! But how will evolve the turnaround with the nwchem long tasks (sometimes very long). Back to my first intuition, lets say 1 month (2592000 secs, rounded to 2600000). <reliable_reduced_delay_bound>X</reliable_reduced_delay_bound> When a need-reliable result is sent to a reliable host, multiply the delay bound by reliable_reduced_delay_bound (typically 0.5 or so). Clear to me and miracle: an example ! <reliable_priority_on_over>X</reliable_priority_on_over> <reliable_priority_on_over_except_error>X</reliable_priority_on_over_except_error> If reliable_priority_on_over is nonzero, increase the priority of duplicate jobs by that amount over the job's base priority. Otherwise, if reliable_priority_on_over_except_error is nonzero, increase the priority of duplicates caused by timeout (not error) by that amount. (Typically only one of these is nonzero, and is equal to reliable_on_priority.) Guidelines ! So, not sure to understand but the solution I already think about look like this : <reliable_on_priority>42</reliable_on_priority> <reliable_max_avg_turnaround>2600000</reliable_max_avg_turnaround> <reliable_reduced_delay_bound>0.5</reliable_reduced_delay_bound> <reliable_priority_on_over>42</reliable_priority_on_over> Any advices ? Do I understand well ? |
Send message Joined: 3 Oct 19 Posts: 43 Credit: 40,548,179 RAC: 0 |
Any advices ? Do I understand well ? 42 is what I would have set them to ! You could try asking over here for guidance... https://groups.google.com/a/ssl.berkeley.edu/forum/#!forum/boinc_projects |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
I took advantage of the annual server reboot to change the scheduler policy. I hope it will go in the right direction! Let me know if you notice any problems or strange behaviors. |
©2024 Benoit DA MOTA - LERIA, University of Angers, France