Message boards :
Number crunching :
Checkpoint?
Message board moderation
Author | Message |
---|---|
Send message Joined: 3 Oct 19 Posts: 11 Credit: 5,443,793 RAC: 0 |
|
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
If I understand well, Vbox WU's checkpoint by themselves and native app (Linux only) doesn't checkpoint. Of course tasks are very very long... |
Send message Joined: 4 Oct 19 Posts: 4 Credit: 36,086 RAC: 0 |
If I understand well, Vbox WU's checkpoint by themselves and native app (Linux only) doesn't checkpoint. Like 1 or 2 days? Or more? |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
I have eight now running for a day on a Ryzen 2700 (Ubuntu), and have seen up to two days. When they get their work units fixed, that will probably go down though. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
If I understand well, Vbox WU's checkpoint by themselves and native app (Linux only) doesn't checkpoint. I see jobs running 1 week... Average is 20 hours. |
Send message Joined: 4 Oct 19 Posts: 4 Credit: 36,086 RAC: 0 |
Would suspending the task and running it again start from the same point? So far I was shutting BOINC with the option to stop running tasks and every time I did it last WU was losing all progress already made. Therefore I had to cancel last task as it was taking forever to complete it. I can't keep my host on 24h at the moment. |
Send message Joined: 14 Dec 19 Posts: 68 Credit: 45,744,261 RAC: 0 |
Is there any hope that native QCP will get checkpointing added??? Some WUs are running a long time and a lot of electricity may be wasted if one has to start over for any of many reasons. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
That's a concern... VM makes checkpoints, but has several other drawbacks. Native App (Linux) has not, but works better. One hope, is to get a funding grant or an enthusiastic volunteer. |
Send message Joined: 8 Oct 19 Posts: 13 Credit: 2,548,714 RAC: 0 |
Is there any hope that native QCP will get checkpointing added??? If it runs for a long time then it's probably a bad task. Abort it. |
Send message Joined: 13 Sep 19 Posts: 69 Credit: 399,347 RAC: 0 |
One hope, is to get a funding grant or an enthusiastic volunteer. I know you are using NWChem (is open source), but have you modified the code for this project? If yes, please, publish the github link, so maybe volunteers can help you (like, in the past, Tn-Grid, Poem and other projects). |
Send message Joined: 15 Dec 19 Posts: 5 Credit: 88,316 RAC: 0 |
That's a concern... Native App (Linux) has not, but works better. One hope, is to get a funding grant or an enthusiastic volunteer. That's actually a cool project. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
@boboviz :We don't modify the NWChem project, but if a modified version exist for BOINC, I am interested. @tomas : Thank you ! |
Send message Joined: 14 Dec 19 Posts: 68 Credit: 45,744,261 RAC: 0 |
You really need to make checkpointing work!!! Is this useful? http://www.nwchem-sw.org/index.php/Release66:RT-TDDFT NRESTARTS -- Number of restart checkpoints This sets the number of run-time check points where the time-dependent complex density matrix is saved to file, allowing the simulation to be restarted) from that point. By default this is set to 0. There is no significant computational cost to restart checkpointing, but of course there is some disk I/O cost (which may become somewhat significant for larger systems). For example, in the following example there will be 100 restart points, which corresponds to 1 backup every 100 time steps. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
This option looks nice. We have to make some tests to validate the behaviour ! I add this to the todo list . Thank you for your help. |
Send message Joined: 13 Sep 19 Posts: 69 Credit: 399,347 RAC: 0 |
A native Windows app will be welcome!! :-) |
Send message Joined: 13 Oct 19 Posts: 87 Credit: 6,026,455 RAC: 0 |
A native Windows app will be welcome!! :-) I agree wholeheartedly! |
Send message Joined: 11 Apr 20 Posts: 23 Credit: 442,800 RAC: 0 |
This option looks nice. We have to make some tests to validate the behaviour ! Has this built-in checkpointing ability been enabled for the Linux native apps by now? Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
Send message Joined: 23 Jul 19 Posts: 289 Credit: 464,119,561 RAC: 0 |
Yes, and that's usually pretty decent. |
Send message Joined: 11 Apr 20 Posts: 23 Credit: 442,800 RAC: 0 |
Yes, and that's usually pretty decent. Excellent. Michael. President of Rechenkraft.net - Germany's first and largest distributed computing organization. |
Send message Joined: 16 Nov 19 Posts: 44 Credit: 21,290,949 RAC: 0 |
That's a concern... VM makes checkpoints, but has several other drawbacks. Native App (Linux) has not, but works better. Why do you say native linux does not have a checkpoint? Other projects do have a checkpoint on linux? Yes, in a remote area where I live, the electricity is not always stable, and we have an outage anywhere from once a month to a few times a week. This is really problematic for tasks taking 3+ days, without checkpoint! FAH used to set checkpoints every 10 minutes, but I find it excessive. A checkpoint can be set every hour, and is a much better interval. An hour lost of crunching is about $0.01-0.02 lost. It's negligible, or for long projects, I'm ok with every 3 to 5 hours. Do you see weeks of long WU processing on a modern processor (eg: 4Ghz), or an older one (eg:2,xGhz)? |
©2024 Benoit DA MOTA - LERIA, University of Angers, France