WU failures

Message boards : Number crunching : WU failures
Message board moderation

To post messages, you must log in.

AuthorMessage
Bryan

Send message
Joined: 3 Oct 19
Posts: 14
Credit: 32,908,253
RAC: 0
Message 5 - Posted: 3 Oct 2019, 15:48:39 UTC

I'm running Linux Mint 19 and I'm seeing immediate failures on the Intel_mt WU. The t1 and t2 WU appear to be running although completion time estimates vary between 5 minutes and 20 hours.

The failure on the Intel_mt is saying
execv() failed: : Permission denied

I'm trying 2 different machines: Intel E5-2684 V4 and AMD 2990WX and both fail on the Intel_mt WU.
ID: 5 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 6 - Posted: 3 Oct 2019, 17:28:45 UTC - in response to Message 5.  

It seems that the executable was not found (or permission issue).

Try to detach and reattach the project.
ID: 6 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bryan

Send message
Joined: 3 Oct 19
Posts: 14
Credit: 32,908,253
RAC: 0
Message 7 - Posted: 3 Oct 2019, 19:11:48 UTC
Last modified: 3 Oct 2019, 20:00:08 UTC

I attached another instance and got the same almost instantaneous failure on the 12 intel_mt WU. I opened up the permissions on the project folder to rw for everyone. It failed another 13 WU. The only executable I see in the folder is the wrapper. There are quite a few tar balls.

If it wil help HERE are my hosts.

I did have a t1 WU complete and validate. I have a few other t1 and t2 WU that have been running for several hours.
ID: 7 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw
Avatar

Send message
Joined: 3 Oct 19
Posts: 33
Credit: 197,169
RAC: 0
Message 14 - Posted: 4 Oct 2019, 7:54:29 UTC
Last modified: 4 Oct 2019, 7:55:21 UTC

>>> dsgdb9nsd_nwchem,bath01,000007288,nwchem,1569028807

That one failed after less than a minute. It has failed on every machine that it has been sent too. The end status is a helpful:

Exit status 0 (0x00000000)

>>>
2019-10-03 19:07:29 (25304): Successfully started VM. (PID = '26680')
2019-10-03 19:07:29 (25304): Reporting VM Process ID to BOINC.
2019-10-03 19:07:34 (25304): Guest Log: BIOS: KBD: unsupported int 16h function 03

2019-10-03 19:07:34 (25304): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000

2019-10-03 19:07:34 (25304): Guest Log: vgdrvHeartbeatInit: Setting up heartbeat to trigger every 2000 milliseconds

2019-10-03 19:07:34 (25304): Guest Log: vboxguest: misc device minor 59, IRQ 20, I/O port d020, MMIO at 00000000f0000000 (size 0x400000)

2019-10-03 19:07:34 (25304): VM state change detected. (old = 'poweroff', new = 'running')
2019-10-03 19:07:39 (25304): Guest Log: vboxsf: g_fHostFeatures=0x1 g_fSfFeatures=0x0 g_uSfLastFunction=20

2019-10-03 19:07:44 (25304): Preference change detected
2019-10-03 19:07:44 (25304): Setting CPU throttle for VM. (100%)
2019-10-03 19:07:44 (25304): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 600 seconds))
2019-10-03 19:07:46 (25304): VM is no longer is a running state. It is in 'poweroff'.
2019-10-03 19:07:46 (25304): VM state change detected. (old = 'running', new = 'poweroff')
2019-10-03 19:07:46 (25304): Powering off VM.
<<<

Perhaps interesting, not all machines it has been sent to have exited in the same way. One is "Error while computing" this one has an "Intel_mt", those showing "Validate error" have "vbox_t1".

Other work units have run to completion, returned and validated. I will leave the project enabled on this machine for the time being at least, as it does not seem to be wasting too much crunching time.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 14 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 16 - Posted: 4 Oct 2019, 9:39:57 UTC - in response to Message 14.  

Thank you for your help and patience.

24h since we open to new volunteers and new problems. Sad, but not surprising...
We are making investigations, multiple code versions are running simultaneously and I am beginner in Boinc project management...

Stability will occur... one day ;-)
ID: 16 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 17 - Posted: 4 Oct 2019, 9:53:22 UTC - in response to Message 16.  

Looking at this specific result and WU. It seems that the software runs but chemistry question crash the software. The chemist responds that yes, it will occurs 10-20% of the time but it will crash quickly !
ID: 17 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw
Avatar

Send message
Joined: 3 Oct 19
Posts: 33
Credit: 197,169
RAC: 0
Message 35 - Posted: 4 Oct 2019, 20:00:01 UTC
Last modified: 4 Oct 2019, 20:06:55 UTC

I have not had a reply from Will yet, but he may simply be away. I'll hold that matter open.

The failing tasks fail very quickly, I hope there is something in the error log there that helps. Feel free to ask if there is something I can do to help, and there are lots of us.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 35 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw
Avatar

Send message
Joined: 3 Oct 19
Posts: 33
Credit: 197,169
RAC: 0
Message 61 - Posted: 7 Oct 2019, 13:59:10 UTC - in response to Message 35.  

The failing tasks have, until today, failed after less than a minute, but today, I have had two work units that failed after several hours of crunching.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 61 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
damotbe
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Help desk expert

Send message
Joined: 23 Jul 19
Posts: 289
Credit: 464,119,561
RAC: 0
Message 71 - Posted: 8 Oct 2019, 7:38:24 UTC - in response to Message 61.  

The failing tasks have, until today, failed after less than a minute, but today, I have had two work units that failed after several hours of crunching.


It's more annoying... Crash are always possible after the start but it's quite uncommon.
ID: 71 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : WU failures

©2024 Benoit DA MOTA - LERIA, University of Angers, France