Some hosts are delaying (long) batch completion

Message boards : Number crunching : Some hosts are delaying (long) batch completion
Message board moderation

To post messages, you must log in.

AuthorMessage
Luigi R.

Send message
Joined: 7 Nov 19
Posts: 31
Credit: 4,245,903
RAC: 0
Message 989 - Posted: 2 Aug 2020, 7:33:48 UTC

Some hosts download work and do not crunch anything.

E.g. 143 540 2447 2537

If we exclude the case they are working offline, there is something wrong here.
ID: 989 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Henk Haneveld

Send message
Joined: 6 Nov 19
Posts: 8
Credit: 156,845
RAC: 0
Message 990 - Posted: 2 Aug 2020, 13:33:59 UTC - in response to Message 989.  

Some hosts download work and do not crunch anything.

E.g. 143 540 2447 2537

If we exclude the case they are working offline, there is something wrong here.

On what fact do you base your statement? That they have not yet returned something does not mean anything.

These results have a very long return date. Until they time-out these host can still do all the work they have downloaded.
ID: 990 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 7 Nov 19
Posts: 31
Credit: 4,245,903
RAC: 0
Message 991 - Posted: 2 Aug 2020, 15:27:39 UTC

Just thinking...

It's an uncommon behaviour that causes high number of pending WUs and it wouldn't be a problem if deadlines weren't so long.
I would like to be sure that my completed work will get validated one day. I'm afraid of the fact that resends could be got by not-crunching hosts again. After 6 months or more, we don't know if this project will be still online.

I guess it's someone bunkering for challenges like Formula BOINC sprints, otherwise it would be a waste of time to wait for resends when someone of active users can crunch them as soon as possible.
ID: 991 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 993 - Posted: 3 Aug 2020, 14:58:10 UTC - in response to Message 991.  

There are always extreme cases, don't make it a generalization. You're whining about points, and on my end, it's about the scientific results I'm waiting for... At the moment, project is running and we don't plan to shutdown it.

The deadlines are too long? In my experience, if they are too short, there are a lot of failures. After a certain number of failures, the workunit is abandoned and I don't get my result.
ID: 993 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 7 Nov 19
Posts: 31
Credit: 4,245,903
RAC: 0
Message 994 - Posted: 3 Aug 2020, 15:16:17 UTC - in response to Message 993.  

There are always extreme cases, don't make it a generalization. You're whining about points, and on my end, it's about the scientific results I'm waiting for... At the moment, project is running and we don't plan to shutdown it.

Your point is logically right. I'm not whining, I'm just raising a scenery that, as volunteer, I would hate so much. If project will run enough time, I've nothing to say anymore about it.

The deadlines are too long? In my experience, if they are too short, there are a lot of failures.

Good.

After a certain number of failures, the workunit is abandoned and I don't get my result.

Can't you run it locally by trusted machine to get your results?


Anyway, thanks for your quick response. ;)
ID: 994 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 999 - Posted: 3 Aug 2020, 15:31:23 UTC - in response to Message 994.  

You're welcome. We don't have enough power locally. We were hoping to be able to divide the calculation time with Boinc, but it's not that simple and it's very time consuming.
ID: 999 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 7 Nov 19
Posts: 31
Credit: 4,245,903
RAC: 0
Message 1120 - Posted: 2 Oct 2020, 8:23:56 UTC - in response to Message 989.  

E.g. 143 540 2447 2537

Host 2537 reported all (completed) tasks on 25 September; host 143, 540 and 2447 did not crunch anything before deadline.
Who was 75% right? :(

P.S. those hosts, if offline, could still report tasks with success before server resends them to other volunteers.
ID: 1120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 24
Credit: 68,559,000
RAC: 0
Message 1287 - Posted: 24 Dec 2020, 11:37:38 UTC
Last modified: 24 Dec 2020, 11:42:28 UTC

@damotbe,
of the 469 nwchem-long tasks which are currently in progress, 359 are located on two compute servers of mine. I worked on nwchem-long until recently, but have this work suspended until January 1. Then these servers will be fully available again and will complete this work (plus any resends which they might receive in the process).

These last remaining nwchem-long workunits have a disproportionally large number of inconclusive results, together with the occasional aborted or error results. That is, chances that these last WUs will end up with two valid results are rather slim. Though if two hosts with the very same hardware/software configuration turn in results, would this improve the chance of valid outcome somewhat?

Besides the slim chances of validation, run times of some of these tasks easily exceed a week now. Nevertheless, the mentioned hosts are allocated for this work, restarting January, for as long as it takes to reach whatever conclusion of these workunits. (It's very stable hardware too, e.g. with ECC RAM, therefore suited for long running work.)

PS,
as an information to users who have the other 110 currently remaining nwchem-long tasks queued: The validation rate which I have seen with nwchem-long went very sharply down during the first half of December, even though I typically was in the position to contribute both of the results needed for validation. If points-per-day are important to you, then these last tasks are no longer viable from that perspective — because of the high ratio of inconclusive and error results, and because of the sometimes dramatically long runtimes.
ID: 1287 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 1305 - Posted: 4 Jan 2021, 8:49:40 UTC - in response to Message 1287.  

it is highly probable that these molecules are non conclusive. I understand that it's very expensive and not worth the cost. This is really the major disadvantage of chemical space exploration. Whatever you choose to do, thank you for your help.
ID: 1305 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 24
Credit: 68,559,000
RAC: 0
Message 1307 - Posted: 7 Jan 2021, 17:56:59 UTC
Last modified: 7 Jan 2021, 17:57:40 UTC

All the nwchem-long work which I still had left from December, plus a dozen resends which I received in January, is already finished by now. Except for 1 last task which is still running. This went a lot quicker than I thought — mostly because I received far fewer resends than I anticipated.

(I expected inconclusive/ invalid results of one computer of mine to turn into resends to another computer of mine, like it still happened in December. But apparently the high ratio of inconclusive/ invalid nwchem-long results from my hosts in January caused the scheduler to assign these resends to other hosts.)

So in short, of the 216 nwchem-longs which are progress at this time, I have got only 1 left now, and the rest became somebody else's problem in the meantime. ;-)
ID: 1307 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Some hosts are delaying (long) batch completion

©2024 Benoit DA MOTA - LERIA, University of Angers, France