High failure rate

Message boards : Number crunching : High failure rate
Message board moderation

To post messages, you must log in.

AuthorMessage
Aurum
Avatar

Send message
Joined: 14 Dec 19
Posts: 68
Credit: 45,744,261
RAC: 0
Message 1659 - Posted: 21 Feb 2022, 11:18:39 UTC

WUs are failing at the rate of over 11%. That seems high.
Are others getting similar failure rates?
Is anyone looking into reducing the failure rate?
ID: 1659 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 1661 - Posted: 21 Feb 2022, 16:07:34 UTC - in response to Message 1659.  

I was worried about that too in my long-term statistics.
But I just reattached a machine that I had not used for a while, and it seems OK.
https://quchempedia.univ-angers.fr/athome/results.php?hostid=10585

So either it was "fixed", or else it was just the data that was hard to crunch.
ID: 1661 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 14 Dec 19
Posts: 68
Credit: 45,744,261
RAC: 0
Message 1679 - Posted: 24 Feb 2022, 15:34:56 UTC

My failure rate has shot up to 56% today!!!
Time to move on.
ID: 1679 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 1680 - Posted: 24 Feb 2022, 16:54:38 UTC - in response to Message 1679.  

You don't define "failure" and your computers are hidden.
ID: 1680 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw
Avatar

Send message
Joined: 3 Oct 19
Posts: 33
Credit: 197,169
RAC: 0
Message 1719 - Posted: 2 Apr 2022, 9:24:23 UTC
Last modified: 2 Apr 2022, 9:25:33 UTC

I re enabled work from here this morning on one machine, (Intel Windows 8.1 x64), but all work that came crashed after about 15 seconds with...

>>> 1 (0x00000001) Unknown error code
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 1719 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle

Send message
Joined: 26 Jan 22
Posts: 4
Credit: 510,400
RAC: 0
Message 1720 - Posted: 2 Apr 2022, 10:08:28 UTC - in response to Message 1719.  

https://quchempedia.univ-angers.fr/athome/result.php?resultid=10404648
VBoxManage.exe: error: Cannot register the hard disk 'C:\ProgramData\BOINC\slots\8\vm_image.vdi' {2c29d1e5-b43d-46fd-b9c5-69a421363472} because a hard disk 'C:\ProgramData\BOINC\slots\9\vm_image.vdi' with UUID {2c29d1e5-b43d-46fd-b9c5-69a421363472} already exists

"... Cannot register the hard disk ... because a hard disk ... already exists"
There are remains from a previous crash.
It's at least an old disk entry in the VirtualBox Media Register.
You need to cleanup your BOINC slots and your VirtualBox Media Register (best to use the VirtualBox Media Manager from the menu).
ID: 1720 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw
Avatar

Send message
Joined: 3 Oct 19
Posts: 33
Credit: 197,169
RAC: 0
Message 1721 - Posted: 3 Apr 2022, 13:05:40 UTC

Downloaded another batch today, same result, all but one failed quickly with the same error I mentioned above. One unit was different, it ran for 21:33 and then errored out with -108 (0xFFFFFF94) ERR_FOPEN.
I tried to attach a different machine to see if that helped, but it would not allow me to join that one.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 1721 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 23 Feb 22
Posts: 23
Credit: 4,423,400
RAC: 0
Message 1722 - Posted: 6 Apr 2022, 5:16:09 UTC - in response to Message 1721.  

Downloaded another batch today, same result, all but one failed quickly with the same error I mentioned above. One unit was different, it ran for 21:33 and then errored out with -108 (0xFFFFFF94) ERR_FOPEN.
I tried to attach a different machine to see if that helped, but it would not allow me to join that one.
did you get your problem solved in the meantime?

I happened to face the same problem recently on one of my machines, it was at the time where there was an about 1 day's server problem. I suspected that due to this server problem one of the downloaded tasks arrived here corrupt, thus causing damage to the Oracle VM.
I tried to remove remnants of the crashed task in the VM media manager - but nothing was shown there. Still, always I received the same error message which you cited. So I removed and re-installed the VM, but the error still showed up.
Then I wanted to remove the VM again, but it was somehow damaged and could no longer be removed.
Finally, all I could do was to make a complete clean re-installation of Windows10 :-(
Now everything works well. It was interesting to see what severe damage a corrupt file can cause.
ID: 1722 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker

Send message
Joined: 25 Apr 22
Posts: 6
Credit: 101,800
RAC: 0
Message 1732 - Posted: 25 Apr 2022, 19:49:28 UTC

They're all working under Windows here. Three of my wingmen failed it in just over 1 second on Linux. Is this a case of missing libraries? All three of these wingman computers have failed thousands of tasks and managed to complete zero. When it hits 8 failures on a task the server will give up.

https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=3453985

What I don't understand is one of the failures I checked says this:

<core_client_version>7.16.16</core_client_version>
<![CDATA[
<stderr_txt>
21:22:29 (2105923): wrapper (7.5.26014): starting
21:22:29 (2105923): wrapper: running worker.sh ()
Jobs starts with 1 cores
STEP OPT : Starting
Create output archive
OPT.out
Normal termination.
21:22:31 (2105923): worker.sh exited; CPU time 1.217591
21:22:31 (2105923): called boinc_finish(0)

</stderr_txt>
]]>

Which doesn't look like an error to me.
ID: 1732 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 21 Jun 20
Posts: 24
Credit: 68,559,000
RAC: 0
Message 1736 - Posted: 28 Apr 2022, 4:23:29 UTC - in response to Message 1732.  
Last modified: 28 Apr 2022, 5:06:52 UTC

Peter Hucker wrote:
https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=3453985

What I don't understand is one of the failures I checked says this:
[...]
Which doesn't look like an error to me.
There was a longstanding bug in which failures of the application (termination with error exit code) were not passed through the shellscripts which are wrapped around the application. Looks like this bug still exists.

The hosts which failed in your WU link have a 0 % success rate. They only return work which terminated after just a few seconds.

One possibility why they do this *could be* that boinc-client's local filesystem permissions are set up such that the application cannot create the OpenMPI files in /tmp/ompi.$HOSTNAME.$UID. It is possible (and in fact good security policy) to disallow boinc-client and its subprocesses to create any files outside of the boinc data directory, but this policy breaks QuChem's current application.

Successful computer in WU 3453985:
____ client 7.16.6 on Ubuntu 20.04.4
Failing computers:
____ client 7.18.1 on Ubuntu 18.04.6
____ client 7.16.16 on Debian 11
____ client 7.16.16 on Debian 11

*If* it is really the potential filesystem permission problem, then it is not exactly a problem with the client version itself, but with the startup file (systemd service unit file) which launches the client.

I currently have one computer active here myself which runs well. It has got client version 7.16.6 on openSUSE 15.2. My client is permitted to create files outside of its data directory.

- - - - - - - - - - - - - - - -

References for the access permissions issue:

message 1593
On 17 Dec 2021 AF>WildWildWest Sebastien wrote:
To fix this issue, I edited the file /lib/systemd/system/boinc-client.service and replaced ProtectSystem=strict by ProtectSystem=full

systemctl stop boinc-client
sed -i 's/ProtectSystem=strict/ProtectSystem=full/g' /lib/systemd/system/boinc-client.service
systemctl daemon-reload
systemctl start boinc-client

message 1687
On 4 Mar 22 cpuprocess2 wrote:
I have 2 hosts on Debian 11, where one (#10506) works fine and the other (#10563) returns invalid workunits after ~3 seconds. Looks like the difference came down to the BOINC client's systemd service file. 10506 has "PrivateTmp=true" whereas 10563 has "#PrivateTmp=true #Block X11 idle detection". Everything else in the file is the same, including "ProtectSystem=strict". After changing 10563 to use PrivateTmp, it has started returning valid results.

Just checked the boinc-client packages on Debian today. Only 7.16.17+dfsg-2 (no longer available for amd64) included a service file that uses PrivateTmp. The other recent versions (7.16.16+dfsg-1, 7.18.1+dfsg-4) have it commented out and the old version (7.14.2+dfsg-3) doesn't even have that line.

EDIT: Looks like PrivateTmp was commented out to fix idle detection (issue, pull request). Apparently other projects have similar issues. It seems the long-term fix is for the QuChem program to write to the slot folder instead of /tmp.
ID: 1736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 1759 - Posted: 8 Jul 2022, 1:31:11 UTC

If you want to see high failure rates, you don't have to look far.

Just check you "valids" and look at the people who produce invalids in a few seconds.
https://quchempedia.univ-angers.fr/athome/results.php?hostid=13821
https://quchempedia.univ-angers.fr/athome/results.php?hostid=10191
https://quchempedia.univ-angers.fr/athome/results.php?hostid=10140

And these are just the first three I checked. The list goes on and on.
You wonder how they manage to turn their computer on.
ID: 1759 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Diplomat
Avatar

Send message
Joined: 7 Feb 20
Posts: 10
Credit: 6,625,400
RAC: 0
Message 1762 - Posted: 21 Jul 2022, 19:26:49 UTC

Rejoined project ~24 hours ago, all tasks finished in a few seconds and validation inconclusive

<core_client_version>7.18.1</core_client_version>
<![CDATA[
<stderr_txt>
00:08:56 (360824): wrapper (7.5.26014): starting
00:08:56 (360824): wrapper: running worker.sh ()
Jobs starts with 1 cores
STEP OPT : Starting
Create output archive
OPT.out
The command rsautl was deprecated in version 3.0. Use 'pkeyutl' instead.
Normal termination.
00:08:58 (360824): worker.sh exited; CPU time 0.905326
00:08:58 (360824): called boinc_finish(0)

</stderr_txt>
]]>
ID: 1762 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Diplomat
Avatar

Send message
Joined: 7 Feb 20
Posts: 10
Credit: 6,625,400
RAC: 0
Message 1763 - Posted: 21 Jul 2022, 19:29:32 UTC - in response to Message 1762.  

apparently the issue in boinc clinet https://quchempedia.univ-angers.fr/athome/forum_thread.php?id=166#1644
need to try downgrading
ID: 1763 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw
Avatar

Send message
Joined: 3 Oct 19
Posts: 33
Credit: 197,169
RAC: 0
Message 1769 - Posted: 2 Aug 2022, 13:02:46 UTC
Last modified: 2 Aug 2022, 13:06:45 UTC

I re-enabled work fetch from the project to see if the earlier issues were just a memory. It downloaded 18 work units. Four jobs failed after a short period, (ie. less than two minutes), with an exit status of a helpful 0x00000000. The remainder started running, but within an hour, all had entered the "Postponed: VM job unmanageable, restarting later." state. "Later" appears to be 24 hours With the long deadline, this appears to be tolerable however, it simply makes a mess of the BOINC Manager screen. The exit status for these completed units is also 0x00000000, so clearly, failures are not discriminated against... I enabled work fetch again, and since doing so, four more units have arrived, I'll leave it running and see what happens.

Off topic:

This keeps appearing:

Your connection is not private
Attackers might be trying to steal your information from quchempedia.univ-angers.fr (for example, passwords, messages or credit cards). Learn more
NET::ERR_CERT_DATE_INVALID
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 1769 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 13 Oct 19
Posts: 87
Credit: 6,026,455
RAC: 0
Message 1770 - Posted: 2 Aug 2022, 16:31:21 UTC - in response to Message 1769.  

I re-enabled work fetch from the project to see if the earlier issues were just a memory. It downloaded 18 work units. Four jobs failed after a short period, (ie. less than two minutes), with an exit status of a helpful 0x00000000. The remainder started running, but within an hour, all had entered the "Postponed: VM job unmanageable, restarting later." state. "Later" appears to be 24 hours With the long deadline, this appears to be tolerable however, it simply makes a mess of the BOINC Manager screen. The exit status for these completed units is also 0x00000000, so clearly, failures are not discriminated against... I enabled work fetch again, and since doing so, four more units have arrived, I'll leave it running and see what happens.

Off topic:

This keeps appearing:

Your connection is not private
Attackers might be trying to steal your information from quchempedia.univ-angers.fr (for example, passwords, messages or credit cards). Learn more
NET::ERR_CERT_DATE_INVALID

Are you crunching on all 8 processors? if so, freeing one up worked for me. I see far fewer of the "Postponed..." messages any more. I also downgraded the VirtualBox version. I don't understand why, but it seemed to help.

I too have seen the certificate invalid message, but Boinc manages to get work after a minute or so.
ID: 1770 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 3 Oct 19
Posts: 153
Credit: 32,412,973
RAC: 0
Message 1771 - Posted: 2 Aug 2022, 16:36:37 UTC

If you are on Windows, it is best to run VirtualBox 5.2.44.
https://www.virtualbox.org/wiki/Download_Old_Builds_5_2
It has to do with the com interface. Not all projects are up to date on 6.x yet.
ID: 1771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 13 Oct 19
Posts: 87
Credit: 6,026,455
RAC: 0
Message 1772 - Posted: 2 Aug 2022, 16:50:36 UTC - in response to Message 1771.  
Last modified: 2 Aug 2022, 16:54:04 UTC

I am using VB ver 5.2.38 so that makes sense, and with the project no longer having a Windows developer, that's not going to change anytime soon. The part about the com interface is beyond me. But thank you for confirming it wasn’t just my imagination.
ID: 1772 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bikeaddict

Send message
Joined: 29 May 22
Posts: 3
Credit: 6,501,000
RAC: 0
Message 1773 - Posted: 7 Aug 2022, 12:15:22 UTC

My three systems (E5-2690 v4 on the top computers page) suddenly started finishing all tasks in 3-5 seconds today. Nothing unusual in the task stderr output. I've suspended the project and switched to TN-Grid until the cause is determined.
ID: 1773 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bikeaddict

Send message
Joined: 29 May 22
Posts: 3
Credit: 6,501,000
RAC: 0
Message 1774 - Posted: 7 Aug 2022, 20:05:55 UTC - in response to Message 1773.  

Fedora just rolled out BOINC 7.20 and I have auto-updates configured. After editing boinc-client.service with the ProtectSystem and PrivateTmp changes, my machines are processing tasks again.
ID: 1774 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Fabien

Send message
Joined: 27 Jul 22
Posts: 4
Credit: 157,800
RAC: 0
Message 1782 - Posted: 29 Aug 2022, 9:40:27 UTC - in response to Message 1774.  
Last modified: 29 Aug 2022, 9:41:47 UTC

I prefer to keep ProtectSystem to strict inside /usr/lib/systemd/system/boinc-client.service. So i've just added -/tmp to ReadWritePaths= to allow read/write access to /tmp and it works.
Thanks to bikeaddict and xii5ku for the help :)
ID: 1782 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : High failure rate

©2024 Benoit DA MOTA - LERIA, University of Angers, France