High failure rate

Author	Message
Aurum Send message Joined: 14 Dec 19 Posts: 68 Credit: 45,744,261 RAC: 0	Message 1659 - Posted: 21 Feb 2022, 11:18:39 UTC WUs are failing at the rate of over 11%. That seems high. Are others getting similar failure rates? Is anyone looking into reducing the failure rate? ID: 1659 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0	Message 1661 - Posted: 21 Feb 2022, 16:07:34 UTC - in response to Message 1659. I was worried about that too in my long-term statistics. But I just reattached a machine that I had not used for a while, and it seems OK. https://quchempedia.univ-angers.fr/athome/results.php?hostid=10585 So either it was "fixed", or else it was just the data that was hard to crunch. ID: 1661 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 14 Dec 19 Posts: 68 Credit: 45,744,261 RAC: 0	Message 1679 - Posted: 24 Feb 2022, 15:34:56 UTC My failure rate has shot up to 56% today!!! Time to move on. ID: 1679 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0	Message 1680 - Posted: 24 Feb 2022, 16:54:38 UTC - in response to Message 1679. You don't define "failure" and your computers are hidden. ID: 1680 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 3 Oct 19 Posts: 33 Credit: 197,169 RAC: 0	Message 1719 - Posted: 2 Apr 2022, 9:24:23 UTC Last modified: 2 Apr 2022, 9:25:33 UTC I re enabled work from here this morning on one machine, (Intel Windows 8.1 x64), but all work that came crashed after about 15 seconds with... >>> 1 (0x00000001) Unknown error code Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 1719 · Rating: 0 · rate: / Reply Quote

computezrmle Send message Joined: 26 Jan 22 Posts: 4 Credit: 510,400 RAC: 0	Message 1720 - Posted: 2 Apr 2022, 10:08:28 UTC - in response to Message 1719. https://quchempedia.univ-angers.fr/athome/result.php?resultid=10404648 VBoxManage.exe: error: Cannot register the hard disk 'C:\ProgramData\BOINC\slots\8\vm_image.vdi' {2c29d1e5-b43d-46fd-b9c5-69a421363472} because a hard disk 'C:\ProgramData\BOINC\slots\9\vm_image.vdi' with UUID {2c29d1e5-b43d-46fd-b9c5-69a421363472} already exists "... Cannot register the hard disk ... because a hard disk ... already exists" There are remains from a previous crash. It's at least an old disk entry in the VirtualBox Media Register. You need to cleanup your BOINC slots and your VirtualBox Media Register (best to use the VirtualBox Media Manager from the menu). ID: 1720 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 3 Oct 19 Posts: 33 Credit: 197,169 RAC: 0	Message 1721 - Posted: 3 Apr 2022, 13:05:40 UTC Downloaded another batch today, same result, all but one failed quickly with the same error I mentioned above. One unit was different, it ran for 21:33 and then errored out with -108 (0xFFFFFF94) ERR_FOPEN. I tried to attach a different machine to see if that helped, but it would not allow me to join that one. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 1721 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 23 Feb 22 Posts: 23 Credit: 4,423,400 RAC: 0	Message 1722 - Posted: 6 Apr 2022, 5:16:09 UTC - in response to Message 1721. Downloaded another batch today, same result, all but one failed quickly with the same error I mentioned above. One unit was different, it ran for 21:33 and then errored out with -108 (0xFFFFFF94) ERR_FOPEN. I tried to attach a different machine to see if that helped, but it would not allow me to join that one. did you get your problem solved in the meantime? I happened to face the same problem recently on one of my machines, it was at the time where there was an about 1 day's server problem. I suspected that due to this server problem one of the downloaded tasks arrived here corrupt, thus causing damage to the Oracle VM. I tried to remove remnants of the crashed task in the VM media manager - but nothing was shown there. Still, always I received the same error message which you cited. So I removed and re-installed the VM, but the error still showed up. Then I wanted to remove the VM again, but it was somehow damaged and could no longer be removed. Finally, all I could do was to make a complete clean re-installation of Windows10 :-( Now everything works well. It was interesting to see what severe damage a corrupt file can cause. ID: 1722 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 25 Apr 22 Posts: 6 Credit: 101,800 RAC: 0	Message 1732 - Posted: 25 Apr 2022, 19:49:28 UTC They're all working under Windows here. Three of my wingmen failed it in just over 1 second on Linux. Is this a case of missing libraries? All three of these wingman computers have failed thousands of tasks and managed to complete zero. When it hits 8 failures on a task the server will give up. https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=3453985 What I don't understand is one of the failures I checked says this: <core_client_version>7.16.16</core_client_version> <![CDATA[ <stderr_txt> 21:22:29 (2105923): wrapper (7.5.26014): starting 21:22:29 (2105923): wrapper: running worker.sh () Jobs starts with 1 cores STEP OPT : Starting Create output archive OPT.out Normal termination. 21:22:31 (2105923): worker.sh exited; CPU time 1.217591 21:22:31 (2105923): called boinc_finish(0) </stderr_txt> ]]> Which doesn't look like an error to me. ID: 1732 · Rating: 0 · rate: / Reply Quote

xii5ku Send message Joined: 21 Jun 20 Posts: 24 Credit: 68,559,000 RAC: 0	Message 1736 - Posted: 28 Apr 2022, 4:23:29 UTC - in response to Message 1732. Last modified: 28 Apr 2022, 5:06:52 UTC Peter Hucker wrote: https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=3453985 What I don't understand is one of the failures I checked says this: [...] Which doesn't look like an error to me. There was a longstanding bug in which failures of the application (termination with error exit code) were not passed through the shellscripts which are wrapped around the application. Looks like this bug still exists. The hosts which failed in your WU link have a 0 % success rate. They only return work which terminated after just a few seconds. One possibility why they do this could be that boinc-client's local filesystem permissions are set up such that the application cannot create the OpenMPI files in /tmp/ompi.$HOSTNAME.$UID. It is possible (and in fact good security policy) to disallow boinc-client and its subprocesses to create any files outside of the boinc data directory, but this policy breaks QuChem's current application. Successful computer in WU 3453985: ____ client 7.16.6 on Ubuntu 20.04.4 Failing computers: ____ client 7.18.1 on Ubuntu 18.04.6 ____ client 7.16.16 on Debian 11 ____ client 7.16.16 on Debian 11 If it is really the potential filesystem permission problem, then it is not exactly a problem with the client version itself, but with the startup file (systemd service unit file) which launches the client. I currently have one computer active here myself which runs well. It has got client version 7.16.6 on openSUSE 15.2. My client is permitted to create files outside of its data directory. - - - - - - - - - - - - - - - - References for the access permissions issue: message 1593 On 17 Dec 2021 AF>WildWildWest Sebastien wrote: To fix this issue, I edited the file /lib/systemd/system/boinc-client.service and replaced ProtectSystem=strict by ProtectSystem=full systemctl stop boinc-client sed -i 's/ProtectSystem=strict/ProtectSystem=full/g' /lib/systemd/system/boinc-client.service systemctl daemon-reload systemctl start boinc-client message 1687 On 4 Mar 22 cpuprocess2 wrote: I have 2 hosts on Debian 11, where one (#10506) works fine and the other (#10563) returns invalid workunits after ~3 seconds. Looks like the difference came down to the BOINC client's systemd service file. 10506 has "PrivateTmp=true" whereas 10563 has "#PrivateTmp=true #Block X11 idle detection". Everything else in the file is the same, including "ProtectSystem=strict". After changing 10563 to use PrivateTmp, it has started returning valid results. Just checked the boinc-client packages on Debian today. Only 7.16.17+dfsg-2 (no longer available for amd64) included a service file that uses PrivateTmp. The other recent versions (7.16.16+dfsg-1, 7.18.1+dfsg-4) have it commented out and the old version (7.14.2+dfsg-3) doesn't even have that line. EDIT: Looks like PrivateTmp was commented out to fix idle detection (issue, pull request). Apparently other projects have similar issues. It seems the long-term fix is for the QuChem program to write to the slot folder instead of /tmp. ID: 1736 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0	Message 1759 - Posted: 8 Jul 2022, 1:31:11 UTC If you want to see high failure rates, you don't have to look far. Just check you "valids" and look at the people who produce invalids in a few seconds. https://quchempedia.univ-angers.fr/athome/results.php?hostid=13821 https://quchempedia.univ-angers.fr/athome/results.php?hostid=10191 https://quchempedia.univ-angers.fr/athome/results.php?hostid=10140 And these are just the first three I checked. The list goes on and on. You wonder how they manage to turn their computer on. ID: 1759 · Rating: 0 · rate: / Reply Quote

Diplomat Send message Joined: 7 Feb 20 Posts: 10 Credit: 6,625,400 RAC: 0	Message 1762 - Posted: 21 Jul 2022, 19:26:49 UTC Rejoined project ~24 hours ago, all tasks finished in a few seconds and validation inconclusive <core_client_version>7.18.1</core_client_version> <![CDATA[ <stderr_txt> 00:08:56 (360824): wrapper (7.5.26014): starting 00:08:56 (360824): wrapper: running worker.sh () Jobs starts with 1 cores STEP OPT : Starting Create output archive OPT.out The command rsautl was deprecated in version 3.0. Use 'pkeyutl' instead. Normal termination. 00:08:58 (360824): worker.sh exited; CPU time 0.905326 00:08:58 (360824): called boinc_finish(0) </stderr_txt> ]]> ID: 1762 · Rating: 0 · rate: / Reply Quote

Diplomat Send message Joined: 7 Feb 20 Posts: 10 Credit: 6,625,400 RAC: 0	Message 1763 - Posted: 21 Jul 2022, 19:29:32 UTC - in response to Message 1762. apparently the issue in boinc clinet https://quchempedia.univ-angers.fr/athome/forum_thread.php?id=166#1644 need to try downgrading ID: 1763 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 3 Oct 19 Posts: 33 Credit: 197,169 RAC: 0	Message 1769 - Posted: 2 Aug 2022, 13:02:46 UTC Last modified: 2 Aug 2022, 13:06:45 UTC I re-enabled work fetch from the project to see if the earlier issues were just a memory. It downloaded 18 work units. Four jobs failed after a short period, (ie. less than two minutes), with an exit status of a helpful 0x00000000. The remainder started running, but within an hour, all had entered the "Postponed: VM job unmanageable, restarting later." state. "Later" appears to be 24 hours With the long deadline, this appears to be tolerable however, it simply makes a mess of the BOINC Manager screen. The exit status for these completed units is also 0x00000000, so clearly, failures are not discriminated against... I enabled work fetch again, and since doing so, four more units have arrived, I'll leave it running and see what happens. Off topic: This keeps appearing: Your connection is not private Attackers might be trying to steal your information from quchempedia.univ-angers.fr (for example, passwords, messages or credit cards). Learn more NET::ERR_CERT_DATE_INVALID Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 1769 · Rating: 0 · rate: / Reply Quote

swiftmallard Send message Joined: 13 Oct 19 Posts: 87 Credit: 6,026,455 RAC: 0	Message 1770 - Posted: 2 Aug 2022, 16:31:21 UTC - in response to Message 1769. I re-enabled work fetch from the project to see if the earlier issues were just a memory. It downloaded 18 work units. Four jobs failed after a short period, (ie. less than two minutes), with an exit status of a helpful 0x00000000. The remainder started running, but within an hour, all had entered the "Postponed: VM job unmanageable, restarting later." state. "Later" appears to be 24 hours With the long deadline, this appears to be tolerable however, it simply makes a mess of the BOINC Manager screen. The exit status for these completed units is also 0x00000000, so clearly, failures are not discriminated against... I enabled work fetch again, and since doing so, four more units have arrived, I'll leave it running and see what happens. Off topic: This keeps appearing: Your connection is not private Attackers might be trying to steal your information from quchempedia.univ-angers.fr (for example, passwords, messages or credit cards). Learn more NET::ERR_CERT_DATE_INVALID Are you crunching on all 8 processors? if so, freeing one up worked for me. I see far fewer of the "Postponed..." messages any more. I also downgraded the VirtualBox version. I don't understand why, but it seemed to help. I too have seen the certificate invalid message, but Boinc manages to get work after a minute or so. ID: 1770 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 3 Oct 19 Posts: 153 Credit: 32,412,973 RAC: 0	Message 1771 - Posted: 2 Aug 2022, 16:36:37 UTC If you are on Windows, it is best to run VirtualBox 5.2.44. https://www.virtualbox.org/wiki/Download_Old_Builds_5_2 It has to do with the com interface. Not all projects are up to date on 6.x yet. ID: 1771 · Rating: 0 · rate: / Reply Quote

swiftmallard Send message Joined: 13 Oct 19 Posts: 87 Credit: 6,026,455 RAC: 0	Message 1772 - Posted: 2 Aug 2022, 16:50:36 UTC - in response to Message 1771. Last modified: 2 Aug 2022, 16:54:04 UTC I am using VB ver 5.2.38 so that makes sense, and with the project no longer having a Windows developer, that's not going to change anytime soon. The part about the com interface is beyond me. But thank you for confirming it wasn’t just my imagination. ID: 1772 · Rating: 0 · rate: / Reply Quote

bikeaddict Send message Joined: 29 May 22 Posts: 3 Credit: 6,501,000 RAC: 0	Message 1773 - Posted: 7 Aug 2022, 12:15:22 UTC My three systems (E5-2690 v4 on the top computers page) suddenly started finishing all tasks in 3-5 seconds today. Nothing unusual in the task stderr output. I've suspended the project and switched to TN-Grid until the cause is determined. ID: 1773 · Rating: 0 · rate: / Reply Quote

bikeaddict Send message Joined: 29 May 22 Posts: 3 Credit: 6,501,000 RAC: 0	Message 1774 - Posted: 7 Aug 2022, 20:05:55 UTC - in response to Message 1773. Fedora just rolled out BOINC 7.20 and I have auto-updates configured. After editing boinc-client.service with the ProtectSystem and PrivateTmp changes, my machines are processing tasks again. ID: 1774 · Rating: 0 · rate: / Reply Quote

Fabien Send message Joined: 27 Jul 22 Posts: 4 Credit: 157,800 RAC: 0	Message 1782 - Posted: 29 Aug 2022, 9:40:27 UTC - in response to Message 1774. Last modified: 29 Aug 2022, 9:41:47 UTC I prefer to keep ProtectSystem to strict inside /usr/lib/systemd/system/boinc-client.service. So i've just added -/tmp to ReadWritePaths= to allow read/write access to /tmp and it works. Thanks to bikeaddict and xii5ku for the help :) ID: 1782 · Rating: 0 · rate: / Reply Quote