Message boards :
Number crunching :
High failure rate
Message board moderation
Author | Message |
---|---|
Send message Joined: 14 Dec 19 Posts: 68 Credit: 45,744,261 RAC: 0 |
WUs are failing at the rate of over 11%. That seems high. Are others getting similar failure rates? Is anyone looking into reducing the failure rate? |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
I was worried about that too in my long-term statistics. But I just reattached a machine that I had not used for a while, and it seems OK. https://quchempedia.univ-angers.fr/athome/results.php?hostid=10585 So either it was "fixed", or else it was just the data that was hard to crunch. |
Send message Joined: 14 Dec 19 Posts: 68 Credit: 45,744,261 RAC: 0 |
My failure rate has shot up to 56% today!!! Time to move on. |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
You don't define "failure" and your computers are hidden. |
Send message Joined: 3 Oct 19 Posts: 33 Credit: 197,169 RAC: 0 |
I re enabled work from here this morning on one machine, (Intel Windows 8.1 x64), but all work that came crashed after about 15 seconds with... >>> 1 (0x00000001) Unknown error code Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Send message Joined: 26 Jan 22 Posts: 4 Credit: 510,400 RAC: 0 |
https://quchempedia.univ-angers.fr/athome/result.php?resultid=10404648 VBoxManage.exe: error: Cannot register the hard disk 'C:\ProgramData\BOINC\slots\8\vm_image.vdi' {2c29d1e5-b43d-46fd-b9c5-69a421363472} because a hard disk 'C:\ProgramData\BOINC\slots\9\vm_image.vdi' with UUID {2c29d1e5-b43d-46fd-b9c5-69a421363472} already exists "... Cannot register the hard disk ... because a hard disk ... already exists" There are remains from a previous crash. It's at least an old disk entry in the VirtualBox Media Register. You need to cleanup your BOINC slots and your VirtualBox Media Register (best to use the VirtualBox Media Manager from the menu). |
Send message Joined: 3 Oct 19 Posts: 33 Credit: 197,169 RAC: 0 |
Downloaded another batch today, same result, all but one failed quickly with the same error I mentioned above. One unit was different, it ran for 21:33 and then errored out with -108 (0xFFFFFF94) ERR_FOPEN. I tried to attach a different machine to see if that helped, but it would not allow me to join that one. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Send message Joined: 23 Feb 22 Posts: 23 Credit: 4,423,400 RAC: 0 |
Downloaded another batch today, same result, all but one failed quickly with the same error I mentioned above. One unit was different, it ran for 21:33 and then errored out with -108 (0xFFFFFF94) ERR_FOPEN.did you get your problem solved in the meantime? I happened to face the same problem recently on one of my machines, it was at the time where there was an about 1 day's server problem. I suspected that due to this server problem one of the downloaded tasks arrived here corrupt, thus causing damage to the Oracle VM. I tried to remove remnants of the crashed task in the VM media manager - but nothing was shown there. Still, always I received the same error message which you cited. So I removed and re-installed the VM, but the error still showed up. Then I wanted to remove the VM again, but it was somehow damaged and could no longer be removed. Finally, all I could do was to make a complete clean re-installation of Windows10 :-( Now everything works well. It was interesting to see what severe damage a corrupt file can cause. |
Send message Joined: 25 Apr 22 Posts: 6 Credit: 101,800 RAC: 0 |
They're all working under Windows here. Three of my wingmen failed it in just over 1 second on Linux. Is this a case of missing libraries? All three of these wingman computers have failed thousands of tasks and managed to complete zero. When it hits 8 failures on a task the server will give up. https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=3453985 What I don't understand is one of the failures I checked says this: <core_client_version>7.16.16</core_client_version> <![CDATA[ <stderr_txt> 21:22:29 (2105923): wrapper (7.5.26014): starting 21:22:29 (2105923): wrapper: running worker.sh () Jobs starts with 1 cores STEP OPT : Starting Create output archive OPT.out Normal termination. 21:22:31 (2105923): worker.sh exited; CPU time 1.217591 21:22:31 (2105923): called boinc_finish(0) </stderr_txt> ]]> Which doesn't look like an error to me. |
Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0 |
Peter Hucker wrote: https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=3453985There was a longstanding bug in which failures of the application (termination with error exit code) were not passed through the shellscripts which are wrapped around the application. Looks like this bug still exists. The hosts which failed in your WU link have a 0 % success rate. They only return work which terminated after just a few seconds. One possibility why they do this *could be* that boinc-client's local filesystem permissions are set up such that the application cannot create the OpenMPI files in /tmp/ompi.$HOSTNAME.$UID. It is possible (and in fact good security policy) to disallow boinc-client and its subprocesses to create any files outside of the boinc data directory, but this policy breaks QuChem's current application. Successful computer in WU 3453985: ____ client 7.16.6 on Ubuntu 20.04.4 Failing computers: ____ client 7.18.1 on Ubuntu 18.04.6 ____ client 7.16.16 on Debian 11 ____ client 7.16.16 on Debian 11 *If* it is really the potential filesystem permission problem, then it is not exactly a problem with the client version itself, but with the startup file (systemd service unit file) which launches the client. I currently have one computer active here myself which runs well. It has got client version 7.16.6 on openSUSE 15.2. My client is permitted to create files outside of its data directory. - - - - - - - - - - - - - - - - References for the access permissions issue: message 1593 On 17 Dec 2021 AF>WildWildWest Sebastien wrote: To fix this issue, I edited the file /lib/systemd/system/boinc-client.service and replaced ProtectSystem=strict by ProtectSystem=full message 1687 On 4 Mar 22 cpuprocess2 wrote: I have 2 hosts on Debian 11, where one (#10506) works fine and the other (#10563) returns invalid workunits after ~3 seconds. Looks like the difference came down to the BOINC client's systemd service file. 10506 has "PrivateTmp=true" whereas 10563 has "#PrivateTmp=true #Block X11 idle detection". Everything else in the file is the same, including "ProtectSystem=strict". After changing 10563 to use PrivateTmp, it has started returning valid results. |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
If you want to see high failure rates, you don't have to look far. Just check you "valids" and look at the people who produce invalids in a few seconds. https://quchempedia.univ-angers.fr/athome/results.php?hostid=13821 https://quchempedia.univ-angers.fr/athome/results.php?hostid=10191 https://quchempedia.univ-angers.fr/athome/results.php?hostid=10140 And these are just the first three I checked. The list goes on and on. You wonder how they manage to turn their computer on. |
Send message Joined: 7 Feb 20 Posts: 10 Credit: 6,625,400 RAC: 0 |
Rejoined project ~24 hours ago, all tasks finished in a few seconds and validation inconclusive <core_client_version>7.18.1</core_client_version> <![CDATA[ <stderr_txt> 00:08:56 (360824): wrapper (7.5.26014): starting 00:08:56 (360824): wrapper: running worker.sh () Jobs starts with 1 cores STEP OPT : Starting Create output archive OPT.out The command rsautl was deprecated in version 3.0. Use 'pkeyutl' instead. Normal termination. 00:08:58 (360824): worker.sh exited; CPU time 0.905326 00:08:58 (360824): called boinc_finish(0) </stderr_txt> ]]> |
Send message Joined: 7 Feb 20 Posts: 10 Credit: 6,625,400 RAC: 0 |
apparently the issue in boinc clinet https://quchempedia.univ-angers.fr/athome/forum_thread.php?id=166#1644 need to try downgrading |
Send message Joined: 3 Oct 19 Posts: 33 Credit: 197,169 RAC: 0 |
I re-enabled work fetch from the project to see if the earlier issues were just a memory. It downloaded 18 work units. Four jobs failed after a short period, (ie. less than two minutes), with an exit status of a helpful 0x00000000. The remainder started running, but within an hour, all had entered the "Postponed: VM job unmanageable, restarting later." state. "Later" appears to be 24 hours With the long deadline, this appears to be tolerable however, it simply makes a mess of the BOINC Manager screen. The exit status for these completed units is also 0x00000000, so clearly, failures are not discriminated against... I enabled work fetch again, and since doing so, four more units have arrived, I'll leave it running and see what happens. Off topic: This keeps appearing: Your connection is not private Attackers might be trying to steal your information from quchempedia.univ-angers.fr (for example, passwords, messages or credit cards). Learn more NET::ERR_CERT_DATE_INVALID Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Send message Joined: 13 Oct 19 Posts: 87 Credit: 6,026,455 RAC: 0 |
I re-enabled work fetch from the project to see if the earlier issues were just a memory. It downloaded 18 work units. Four jobs failed after a short period, (ie. less than two minutes), with an exit status of a helpful 0x00000000. The remainder started running, but within an hour, all had entered the "Postponed: VM job unmanageable, restarting later." state. "Later" appears to be 24 hours With the long deadline, this appears to be tolerable however, it simply makes a mess of the BOINC Manager screen. The exit status for these completed units is also 0x00000000, so clearly, failures are not discriminated against... I enabled work fetch again, and since doing so, four more units have arrived, I'll leave it running and see what happens. Are you crunching on all 8 processors? if so, freeing one up worked for me. I see far fewer of the "Postponed..." messages any more. I also downgraded the VirtualBox version. I don't understand why, but it seemed to help. I too have seen the certificate invalid message, but Boinc manages to get work after a minute or so. |
Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0 |
If you are on Windows, it is best to run VirtualBox 5.2.44. https://www.virtualbox.org/wiki/Download_Old_Builds_5_2 It has to do with the com interface. Not all projects are up to date on 6.x yet. |
Send message Joined: 13 Oct 19 Posts: 87 Credit: 6,026,455 RAC: 0 |
I am using VB ver 5.2.38 so that makes sense, and with the project no longer having a Windows developer, that's not going to change anytime soon. The part about the com interface is beyond me. But thank you for confirming it wasn’t just my imagination. |
Send message Joined: 29 May 22 Posts: 3 Credit: 6,501,000 RAC: 0 |
My three systems (E5-2690 v4 on the top computers page) suddenly started finishing all tasks in 3-5 seconds today. Nothing unusual in the task stderr output. I've suspended the project and switched to TN-Grid until the cause is determined. |
Send message Joined: 29 May 22 Posts: 3 Credit: 6,501,000 RAC: 0 |
Fedora just rolled out BOINC 7.20 and I have auto-updates configured. After editing boinc-client.service with the ProtectSystem and PrivateTmp changes, my machines are processing tasks again. |
Send message Joined: 27 Jul 22 Posts: 4 Credit: 157,800 RAC: 0 |
I prefer to keep ProtectSystem to strict inside /usr/lib/systemd/system/boinc-client.service. So i've just added -/tmp to ReadWritePaths= to allow read/write access to /tmp and it works. Thanks to bikeaddict and xii5ku for the help :) |
©2024 Benoit DA MOTA - LERIA, University of Angers, France