Checkpoint?

Message boards : Number crunching : Checkpoint?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
zombie67 [MM]
Avatar

Send message
Joined: 3 Oct 19
Posts: 11
Credit: 5,443,793
RAC: 0
Message 8 - Posted: 4 Oct 2019, 1:31:22 UTC

Do these tasks checkpoint? Both vbox and native?
Reno, NV
Team: SETI.USA
ID: 8 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 11 - Posted: 4 Oct 2019, 6:29:07 UTC - in response to Message 8.  

If I understand well, Vbox WU's checkpoint by themselves and native app (Linux only) doesn't checkpoint.

Of course tasks are very very long...
ID: 11 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Hal Bregg

Send message
Joined: 4 Oct 19
Posts: 4
Credit: 36,086
RAC: 0
Message 37 - Posted: 5 Oct 2019, 10:36:56 UTC - in response to Message 11.  

If I understand well, Vbox WU's checkpoint by themselves and native app (Linux only) doesn't checkpoint.

Of course tasks are very very long...


Like 1 or 2 days? Or more?
ID: 37 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 38 - Posted: 5 Oct 2019, 11:17:18 UTC - in response to Message 37.  
Last modified: 5 Oct 2019, 11:18:19 UTC

I have eight now running for a day on a Ryzen 2700 (Ubuntu), and have seen up to two days.
When they get their work units fixed, that will probably go down though.
ID: 38 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 48 - Posted: 6 Oct 2019, 8:06:45 UTC - in response to Message 37.  

If I understand well, Vbox WU's checkpoint by themselves and native app (Linux only) doesn't checkpoint.

Of course tasks are very very long...


Like 1 or 2 days? Or more?


I see jobs running 1 week... Average is 20 hours.
ID: 48 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Hal Bregg

Send message
Joined: 4 Oct 19
Posts: 4
Credit: 36,086
RAC: 0
Message 182 - Posted: 20 Oct 2019, 9:47:24 UTC

Would suspending the task and running it again start from the same point?

So far I was shutting BOINC with the option to stop running tasks and every time I did it last WU was losing all progress already made. Therefore I had to cancel last task as it was taking forever to complete it. I can't keep my host on 24h at the moment.
ID: 182 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 14 Dec 19
Posts: 68
Credit: 45,744,261
RAC: 0
Message 348 - Posted: 16 Dec 2019, 12:13:46 UTC

Is there any hope that native QCP will get checkpointing added???
Some WUs are running a long time and a lot of electricity may be wasted if one has to start over for any of many reasons.
ID: 348 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 351 - Posted: 16 Dec 2019, 21:27:58 UTC - in response to Message 348.  

That's a concern... VM makes checkpoints, but has several other drawbacks. Native App (Linux) has not, but works better.
One hope, is to get a funding grant or an enthusiastic volunteer.
ID: 351 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 8 Oct 19
Posts: 13
Credit: 2,548,714
RAC: 0
Message 353 - Posted: 17 Dec 2019, 11:07:39 UTC - in response to Message 348.  

Is there any hope that native QCP will get checkpointing added???
Some WUs are running a long time and a lot of electricity may be wasted if one has to start over for any of many reasons.


If it runs for a long time then it's probably a bad task. Abort it.
ID: 353 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[VENETO] boboviz

Send message
Joined: 13 Sep 19
Posts: 69
Credit: 399,347
RAC: 0
Message 362 - Posted: 18 Dec 2019, 20:36:29 UTC - in response to Message 351.  

One hope, is to get a funding grant or an enthusiastic volunteer.

I know you are using NWChem (is open source), but have you modified the code for this project?
If yes, please, publish the github link, so maybe volunteers can help you (like, in the past, Tn-Grid, Poem and other projects).
ID: 362 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomáš Brada

Send message
Joined: 15 Dec 19
Posts: 5
Credit: 88,316
RAC: 0
Message 363 - Posted: 18 Dec 2019, 20:40:46 UTC - in response to Message 351.  

That's a concern... Native App (Linux) has not, but works better. One hope, is to get a funding grant or an enthusiastic volunteer.

That's actually a cool project.
ID: 363 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 365 - Posted: 19 Dec 2019, 6:56:05 UTC - in response to Message 362.  

@boboviz :We don't modify the NWChem project, but if a modified version exist for BOINC, I am interested.

@tomas : Thank you !
ID: 365 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 14 Dec 19
Posts: 68
Credit: 45,744,261
RAC: 0
Message 370 - Posted: 21 Dec 2019, 14:52:50 UTC
Last modified: 21 Dec 2019, 14:54:26 UTC

You really need to make checkpointing work!!!
Is this useful? http://www.nwchem-sw.org/index.php/Release66:RT-TDDFT
NRESTARTS -- Number of restart checkpoints
This sets the number of run-time check points where the time-dependent complex density matrix is saved to file, allowing the simulation to be restarted) from that point. By default this is set to 0. There is no significant computational cost to restart checkpointing, but of course there is some disk I/O cost (which may become somewhat significant for larger systems). For example, in the following example there will be 100 restart points, which corresponds to 1 backup every 100 time steps.
ID: 370 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 371 - Posted: 21 Dec 2019, 16:44:13 UTC - in response to Message 370.  

This option looks nice. We have to make some tests to validate the behaviour !
I add this to the todo list .

Thank you for your help.
ID: 371 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[VENETO] boboviz

Send message
Joined: 13 Sep 19
Posts: 69
Credit: 399,347
RAC: 0
Message 376 - Posted: 24 Dec 2019, 15:26:52 UTC - in response to Message 371.  

A native Windows app will be welcome!! :-)
ID: 376 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 13 Oct 19
Posts: 87
Credit: 6,026,455
RAC: 0
Message 377 - Posted: 24 Dec 2019, 16:17:37 UTC - in response to Message 376.  

A native Windows app will be welcome!! :-)

I agree wholeheartedly!
ID: 377 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Michael H.W. Weber
Avatar

Send message
Joined: 11 Apr 20
Posts: 23
Credit: 442,800
RAC: 0
Message 779 - Posted: 17 Apr 2020, 18:40:10 UTC - in response to Message 371.  

This option looks nice. We have to make some tests to validate the behaviour !
I add this to the todo list .

Has this built-in checkpointing ability been enabled for the Linux native apps by now?

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 779 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 783 - Posted: 18 Apr 2020, 9:37:01 UTC - in response to Message 779.  

Yes, and that's usually pretty decent.
ID: 783 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Michael H.W. Weber
Avatar

Send message
Joined: 11 Apr 20
Posts: 23
Credit: 442,800
RAC: 0
Message 785 - Posted: 18 Apr 2020, 11:28:00 UTC - in response to Message 783.  

Yes, and that's usually pretty decent.

Excellent.

Michael.
President of Rechenkraft.net - Germany's first and largest distributed computing organization.
ID: 785 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProDigit

Send message
Joined: 16 Nov 19
Posts: 44
Credit: 21,290,949
RAC: 0
Message 795 - Posted: 19 Apr 2020, 21:40:12 UTC - in response to Message 351.  
Last modified: 19 Apr 2020, 21:45:30 UTC

That's a concern... VM makes checkpoints, but has several other drawbacks. Native App (Linux) has not, but works better.
One hope, is to get a funding grant or an enthusiastic volunteer.

Why do you say native linux does not have a checkpoint?
Other projects do have a checkpoint on linux?
Yes, in a remote area where I live, the electricity is not always stable, and we have an outage anywhere from once a month to a few times a week.
This is really problematic for tasks taking 3+ days, without checkpoint!

FAH used to set checkpoints every 10 minutes, but I find it excessive.
A checkpoint can be set every hour, and is a much better interval.
An hour lost of crunching is about $0.01-0.02 lost. It's negligible, or for long projects, I'm ok with every 3 to 5 hours.

Do you see weeks of long WU processing on a modern processor (eg: 4Ghz), or an older one (eg:2,xGhz)?
ID: 795 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Checkpoint?

©2024 Benoit DA MOTA - LERIA, University of Angers, France