Long Wus and checkpoint

Message boards : Number crunching : Long Wus and checkpoint
Message board moderation

To post messages, you must log in.

AuthorMessage
[VENETO] boboviz

Send message
Joined: 13 Sep 19
Posts: 69
Credit: 399,347
RAC: 0
Message 573 - Posted: 16 Feb 2020, 14:17:26 UTC

At the end i have some "long" wus on my linux machine.
There is some way i can safe/checkpoint these? After 22hs i'm at 33% and i need to turn off the pc during the night...
ID: 573 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 576 - Posted: 16 Feb 2020, 16:16:00 UTC - in response to Message 573.  

Now, there are checkpoints, you can serenely turn your computer off.
ID: 576 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 4 Jan 20
Posts: 60
Credit: 516,736
RAC: 0
Message 578 - Posted: 16 Feb 2020, 16:34:24 UTC - in response to Message 573.  

The safer way to turn off your computer without drawback for your VM work unit is :

1 °) : Check in the Boinc Manager if the option " Leave task in Memory for non gpu task" is not selected.
2 °) : Suspend your VM.
3 °) : Check in windows task manager that the I/O acces disk are at 0%. (That means that the VM got the time to be correctly suspended.)
4 °) : Close Boinc Manager.
5 °) : Shutdown your computer.
ID: 578 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[VENETO] boboviz

Send message
Joined: 13 Sep 19
Posts: 69
Credit: 399,347
RAC: 0
Message 579 - Posted: 16 Feb 2020, 19:35:44 UTC

I report my first long wu after 25hs 1722131
And validation error :-(
<stderr_txt>
17:31:19 (18422): wrapper (7.5.26014): starting
17:31:19 (18422): wrapper: running worker.sh ()
Jobs starts with 1 cores
OPT
FREQ
TD_singlet
Create output archive
*** WARNING : deprecated key derivation used.
Using -iter or -pbkdf2 would be better.
FREQ.out
OPT.out
TD_singlet.out
Normal termination.
19:06:41 (18422): worker.sh exited; CPU time 79649.304956
19:06:41 (18422): called boinc_finish(0)
worker.sh: riga 6: kill: (18424) - No process corresponding
ID: 579 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 580 - Posted: 16 Feb 2020, 20:14:12 UTC - in response to Message 579.  

Normal termination.
19:06:41 (18422): worker.sh exited; CPU time 79649.304956
19:06:41 (18422): called boinc_finish(0)
worker.sh: riga 6: kill: (18424) - No process corresponding

That is comparable to what matsu_pl observed on my Linux invalids.
https://quchempedia.univ-angers.fr/athome/forum_thread.php?id=60&postid=574#574
ID: 580 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Steve Dodd

Send message
Joined: 21 Oct 19
Posts: 4
Credit: 1,566,995
RAC: 0
Message 584 - Posted: 17 Feb 2020, 7:58:47 UTC

I have a long WU that will not complete by the deadline. Maybe deadlines should be extended (for the time being anyway) for the long WUs.
ID: 584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 587 - Posted: 17 Feb 2020, 11:57:06 UTC - in response to Message 579.  
Last modified: 17 Feb 2020, 11:57:21 UTC

I report my first long wu after 25hs 1722131
And validation error :-(

17:31:19 (18422): wrapper (7.5.26014): starting
17:31:19 (18422): wrapper: running worker.sh ()
Jobs starts with 1 cores
OPT
FREQ
TD_singlet
Create output archive
*** WARNING : deprecated key derivation used.
Using -iter or -pbkdf2 would be better.
FREQ.out
OPT.out
TD_singlet.out
Normal termination.
19:06:41 (18422): worker.sh exited; CPU time 79649.304956
19:06:41 (18422): called boinc_finish(0)
worker.sh: riga 6: kill: (18424) - No process corresponding


We've identified the problem : the FREQ step does not have enough memory. The beta computations will really help us a lot to find the relevant parameters for our simulations. Thank you for your help !
ID: 587 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw
Avatar

Send message
Joined: 3 Oct 19
Posts: 33
Credit: 197,169
RAC: 0
Message 698 - Posted: 21 Mar 2020, 14:11:18 UTC

I have not yet received a "long" work unit on either of my machines.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 698 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Long Wus and checkpoint

©2024 Benoit DA MOTA - LERIA, University of Angers, France