No new task sent out when wingman aborted or got a validation error

Message boards : Number crunching : No new task sent out when wingman aborted or got a validation error
Message board moderation

To post messages, you must log in.

AuthorMessage
Gunnar Hjern

Send message
Joined: 14 Oct 19
Posts: 7
Credit: 2,614,863
RAC: 0
Message 876 - Posted: 11 Jun 2020, 17:10:39 UTC

Hi Damotbe!

I recently discovered that I have several WUs pending and one "validation inconclusive",
most of them from early May, where my wingman either aborted or got a validation error,
and that no new tasks still have been sent out, although it was several days,
and even months since me and my wingman reported the tasks (as aborted/erroneous etc.)

The pending WUs are:
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1377219 (2 days since abortion/error)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1354680 (1 month and 4 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1354050 (1 month and 1 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353751 (1 month and 9 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353410 (1 month and 2 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353359 (1 month and 2 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353159 (1 month and 2 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353042 (1 month and 2 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353066 (1 month and 2 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1353111 (1 month and 2 days...)
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1352934 (1 month...)

and the one in validation inconclusive:
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=1377219 (3 days)

Normally (in other Boinc projects) new replicated tasks are sent out only hours after
one of the initially replicated tasks is reported unsuccessful.

Does this mean that I never will get any credits for those tasks??

Kindest regards,
Gunnar
ID: 876 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 877 - Posted: 11 Jun 2020, 18:09:16 UTC - in response to Message 876.  

Does this mean that I never will get any credits for those tasks??

https://quchempedia.univ-angers.fr/athome/forum_thread.php?id=98&postid=869#869
ID: 877 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gunnar Hjern

Send message
Joined: 14 Oct 19
Posts: 7
Credit: 2,614,863
RAC: 0
Message 878 - Posted: 11 Jun 2020, 20:36:54 UTC - in response to Message 877.  

Thanks Jim1348!
I completely missed that thread.
Guess it's just to wait then, and hope the creds will arrive before X-mas. ;-)
Happy Crunching!!!
//Gunnar
ID: 878 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 879 - Posted: 11 Jun 2020, 20:38:37 UTC - in response to Message 878.  

Guess it's just to wait then, and hope the creds will arrive before X-mas. ;-)

I am in the same boat. We all are. I just hope it does not sink before getting into port.
ID: 879 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 880 - Posted: 12 Jun 2020, 7:23:31 UTC - in response to Message 879.  

Yeah, it's insane that re-sending isn't a priority. On my end, it takes a really long time to complete the batches...
ID: 880 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Henk Haneveld

Send message
Joined: 6 Nov 19
Posts: 8
Credit: 156,845
RAC: 0
Message 881 - Posted: 12 Jun 2020, 12:24:26 UTC - in response to Message 880.  

Yeah, it's insane that re-sending isn't a priority. On my end, it takes a really long time to complete the batches...

I think there is a way to do that look at:

https://boinc.berkeley.edu/trac/wiki/ProjectOptions

under the header: Accelerating retries altough you will have to figure out how it neeeds to be set to work.
ID: 881 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 882 - Posted: 12 Jun 2020, 13:14:26 UTC - in response to Message 881.  

I already readed this documentation several times...It is not very clear to me how to choose values. (and experiencing in production is a lit bit dangerous)

<reliable_on_priority>X</reliable_on_priority>
    Results with priority at least reliable_on_priority are treated as "need-reliable". They'll be sent preferentially to reliable hosts. 

I don't define reliable hosts, so maybe it just exists...
But, how to choose X ? no guideline, no example...

<reliable_max_avg_turnaround>secs</reliable_max_avg_turnaround>
    Hosts whose average turnaround is at most reliable_max_avg_turnaround and that have at least 10 consecutive valid results e are considered 'reliable'. Make sure you set this low enough that a significant fraction (e.g. 25%) of your hosts qualify. 

No reliable hosts need to be defined by myself with a turnaround in seconds... I need to choose this value to be sure to qualify 25% of the hosts ! Did you know the variability of my jobs ? from minutes to days ! So, the best I can do is to define reliability to something huge, like 1 month of turnaround to qualify all hosts? (not sure with this) All hosts ? no, only those with at least 10 consecutive valid results... I have to pray it exists !
So, I directly ask the SQL database for host avg_turnaround and what I can see is that turnarround dos not correspond to tasks runtime... Other observation : hosts without credits have 0.00 avg_turnaround, so the best in terms of reliability... WTF ! After many SQL queries, I decide to look at my computers (very reliable) and avg_turnaround goes from 2 days to 10 days. Then, I project 10 days of turnaround on all hosts and ... more than 90% of the hosts have a turnaround less or equal to 10 days ! But how will evolve the turnaround with the nwchem long tasks (sometimes very long). Back to my first intuition, lets say 1 month (2592000 secs, rounded to 2600000).

<reliable_reduced_delay_bound>X</reliable_reduced_delay_bound>
    When a need-reliable result is sent to a reliable host, multiply the delay bound by reliable_reduced_delay_bound (typically 0.5 or so). 

Clear to me and miracle: an example !

<reliable_priority_on_over>X</reliable_priority_on_over>
<reliable_priority_on_over_except_error>X</reliable_priority_on_over_except_error>
     If reliable_priority_on_over is nonzero, increase the priority of duplicate jobs by that amount over the job's base priority. Otherwise, if reliable_priority_on_over_except_error is nonzero, increase the priority of duplicates caused by timeout (not error) by that amount. (Typically only one of these is nonzero, and is equal to reliable_on_priority.) 

Guidelines ! So, not sure to understand but the solution I already think about look like this :

<reliable_on_priority>42</reliable_on_priority>
<reliable_max_avg_turnaround>2600000</reliable_max_avg_turnaround>
<reliable_reduced_delay_bound>0.5</reliable_reduced_delay_bound>
<reliable_priority_on_over>42</reliable_priority_on_over>


Any advices ? Do I understand well ?
ID: 882 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
PDW

Send message
Joined: 3 Oct 19
Posts: 43
Credit: 40,548,179
RAC: 0
Message 883 - Posted: 12 Jun 2020, 13:24:04 UTC - in response to Message 882.  
Last modified: 12 Jun 2020, 13:24:19 UTC

Any advices ? Do I understand well ?

42 is what I would have set them to !

You could try asking over here for guidance...
https://groups.google.com/a/ssl.berkeley.edu/forum/#!forum/boinc_projects
ID: 883 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 1036 - Posted: 24 Aug 2020, 15:30:46 UTC - in response to Message 883.  

I took advantage of the annual server reboot to change the scheduler policy. I hope it will go in the right direction! Let me know if you notice any problems or strange behaviors.
ID: 1036 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : No new task sent out when wingman aborted or got a validation error

©2024 Benoit DA MOTA - LERIA, University of Angers, France