Posts by xii5ku

1) Message boards : Number crunching : High failure rate (Message 1736) Posted 28 Apr 2022 by xii5ku Post: Peter Hucker wrote: https://quchempedia.univ-angers.fr/athome/workunit.php?wuid=3453985 What I don't understand is one of the failures I checked says this: [...] Which doesn't look like an error to me. There was a longstanding bug in which failures of the application (termination with error exit code) were not passed through the shellscripts which are wrapped around the application. Looks like this bug still exists. The hosts which failed in your WU link have a 0 % success rate. They only return work which terminated after just a few seconds. One possibility why they do this could be that boinc-client's local filesystem permissions are set up such that the application cannot create the OpenMPI files in /tmp/ompi.$HOSTNAME.$UID. It is possible (and in fact good security policy) to disallow boinc-client and its subprocesses to create any files outside of the boinc data directory, but this policy breaks QuChem's current application. Successful computer in WU 3453985: ____ client 7.16.6 on Ubuntu 20.04.4 Failing computers: ____ client 7.18.1 on Ubuntu 18.04.6 ____ client 7.16.16 on Debian 11 ____ client 7.16.16 on Debian 11 If it is really the potential filesystem permission problem, then it is not exactly a problem with the client version itself, but with the startup file (systemd service unit file) which launches the client. I currently have one computer active here myself which runs well. It has got client version 7.16.6 on openSUSE 15.2. My client is permitted to create files outside of its data directory. - - - - - - - - - - - - - - - - References for the access permissions issue: message 1593 On 17 Dec 2021 AF>WildWildWest Sebastien wrote: To fix this issue, I edited the file /lib/systemd/system/boinc-client.service and replaced ProtectSystem=strict by ProtectSystem=full systemctl stop boinc-client sed -i 's/ProtectSystem=strict/ProtectSystem=full/g' /lib/systemd/system/boinc-client.service systemctl daemon-reload systemctl start boinc-client message 1687 On 4 Mar 22 cpuprocess2 wrote: I have 2 hosts on Debian 11, where one (#10506) works fine and the other (#10563) returns invalid workunits after ~3 seconds. Looks like the difference came down to the BOINC client's systemd service file. 10506 has "PrivateTmp=true" whereas 10563 has "#PrivateTmp=true #Block X11 idle detection". Everything else in the file is the same, including "ProtectSystem=strict". After changing 10563 to use PrivateTmp, it has started returning valid results. Just checked the boinc-client packages on Debian today. Only 7.16.17+dfsg-2 (no longer available for amd64) included a service file that uses PrivateTmp. The other recent versions (7.16.16+dfsg-1, 7.18.1+dfsg-4) have it commented out and the old version (7.14.2+dfsg-3) doesn't even have that line. EDIT: Looks like PrivateTmp was commented out to fix idle detection (issue, pull request). Apparently other projects have similar issues. It seems the long-term fix is for the QuChem program to write to the slot folder instead of /tmp.
2) Message boards : Number crunching : ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time (Message 1652) Posted 12 Feb 2022 by xii5ku Post: I found the following idea via the Rosetta@home message board, originally posted by @computezrmle at the Cosmology@home message board: http://www.cosmologyathome.org/forum_thread.php?id=7769&postid=22921 On Dec 5 2021 computezrmle wrote: Volunteers frequently affected by the postponed issue may try a different vboxwrapper. BOINC's wiki pages mention communication problems between vboxwrapper and VirtualBox 6.x, especially on Windows. They offer premade executables that may solve the problems: https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables It would be the job of the project developers to test those vboxwrappers and distribute them to the clients. As long as this is not done volunteers could use the following steps as a workaround: 1. Download an alternative vboxwrapper from the page mentioned above (or use one you got from another project, e.g. LHC@home) 2. Start the BOINC client but suspend computing 3. Change to the project directory, e.g. projects/www.cosmologyathome.org, and replace the vboxwrapper there with the test version; the filename must be the name of the old vboxwrapper 4. Resume computing -> check the logfiles of tasks started after the patch Each restart of the BOINC client will replace the patch with the original vboxwrapper from the project server. This can be avoided setting <dont_check_file_sizes>1</dont_check_file_sizes> in cc_config.xml, but then all other automatic updates will also not work. I haven't tried this myself yet.
3) Message boards : Number crunching : ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time (Message 1649) Posted 11 Feb 2022 by xii5ku Post: I am running QuChem on Linux, therefore am not observing this here. But I got the same occasionally at Cosmology@home with the "camb_boinc2docker" application, and very frequently at Rosetta@home with the "rosetta python projects" application. (I've got Vbox 6.1.28, that's apparently a factor for the frequency of such events.) I suspect that vboxwrapper simply doesn't cope with the large latencies which a Vbox VM can sometimes exhibit. IOW my guess is that someone set a timeout too small somewhere. I am currently running "rosetta python projects" (merely 16 or fewer tasks at one on a computer with plenty of cores and 256 GB RAM) and am restarting the boinc client twice a day. Otherwise the client would run out of work eventually, since it does not request new work as long as there is one or more "postponed" task in the buffer. :-(
4) Message boards : Number crunching : Stuck tasks (Message 1648) Posted 11 Feb 2022 by xii5ku Post: Yogurt789 wrote: I seem to be getting tasks that get stuck and don't really go anywhere. For example, I have one task that has been running for 8 days and 6 hours, but has only registered 01:24:00 of CPU time. These tasks also never seem to exceed 3 seconds of CPU time since the last checkpoint. Does anybody know what exactly is going on here? Should I just abort these tasks? I am running only Linux, hence am not observing this here at QuChemPedIA@home. But I get this phenomenon with the VirtualBox based "rosetta python projects" application of Rosetta@home occasionally. I am not aware of any other way to deal with those stuck tasks than to abort them. Here is a script which periodically checks for the presence of tasks with CPU time << elapsed time and aborts these. You need to edit the project URL in the script to adapt it from Rosetta@home to QuChemPedIA@home. Alas the script interpreter is 'bash', hence it is not entirely straightforward to run on Windows. Cygwin should work, WSL might work. Furthermore, the script requires a fairly recent version of 'boinccmd'. I don't know precisely how recent, but 7.16.17 works, 7.16.6 does not work. #!/bin/bash # Edit this: # a list of hosts, each optionally with GUI port number appended # (may be just a single host, or dozens of hosts) hosts=( "localhost" "computer_a" "computer_b:31420" ) # Edit this: # the password from gui_rpc_auth.cfg # This script expects the same password on all hosts. # Can be set to "" if you have empty gui_rpc_auth.cfg's. password="$(cat /var/lib/boinc/gui_rpc_auth.cfg)" # Edit this if you want to apply this to a different project. project_url="https://boinc.bakerlab.org/rosetta/" # Change this from "abort" to "suspend" if you prefer. task_op="abort" # Before a task hasn't been executing for some time, other task stats # may still be imprecise. The script therefore does not touch any # tasks which haven't been executing for at least this many seconds. # You can use integer numbers here, but not floating point numbers. # E.g.: 5 * 60 for 5 minutes. min_elapsed_time=$((5 * 60)) # After tasks were aborted, boinc-client may cease to request # new work due to "Communication deferred". To avoid this, should a # project update be forced after one or more tasks were aborted? # Set to 1 for yes, 0 for no. force_project_update=1 # Loop intervals. # You probably don't need to edit these. check_every_n_minutes=10 timestamp_every_n_minutes=120 # That's it; there is probably no need to edit anything from here on. delay=$((${check_every_n_minutes}60/${#hosts[]}+1)) ts=${timestamp_every_n_minutes} echo "Monitoring ${hosts[]}." for ((;;)) do (( (ts += check_every_n_minutes) >= timestamp_every_n_minutes )) && { date; ts=0; } for host in ${hosts[]} do # Edit this if you run on Cygwin: # boinccmd="/cygdrive/c/ProgramFiles/BOINC/boinccmd --host ${host} --passwd ${password}" if [ -n "${password}" ] then boinccmd="boinccmd --host ${host} --passwd ${password}" else boinccmd="boinccmd --host ${host}" fi tasks=$(${boinccmd} --get_tasks) \|\| { sleep ${delay}; continue; } unset name url state ett cct while read line do case ${line} in [1-9] ) i=${line%)};; "name: " ) name[$i]=${line#"name: "};; "project URL: " ) url[$i]=${line#"project URL: "};; "active_task_state: " ) state[$i]=${line#"active_task_state: "};; "elapsed task time: " ) tmp=${line#"elapsed task time: "}; ett[$i]=${tmp%.};; "current CPU time: "* ) tmp=${line#"current CPU time: "}; cct[$i]=${tmp%.};; esac done <<< "${tasks}" n=0 for j in ${!name[]} do # Skip tasks # - which do not belong to this project, # - which are not currently running, # - which have been running for less than $min_elapsed_time seconds, # - which have a CPU time of more than 50% of elapsed time. [ "${url[$j]}" != "${project_url}" ] && continue [ "${state[$j]}" != "EXECUTING" ] && continue e=${ett[$j]}; ((e < min_elapsed_time)) && continue c=${cct[$j]}; ((e < 2c)) && continue printf "${host}: ${task_op} ${name[$j]}\t" printf "(elapsed: %02d:%02d:%02d," $((e/3600)) $((e%3600/60)) $((e%60)) printf " CPU: %02d:%02d:%02d)\n" $((c/3600)) $((c%3600/60)) $((c%60)) ${boinccmd} --task "${project_url}" "${name[$j]}" "${task_op}" ((n++)) done ((force_project_update && n)) && { sleep 1; ${boinccmd} --project "${project_url}" update; } sleep ${delay} done done Source: AnandTech forum
5) Message boards : Number crunching : Host ID 1388 corrupted (Message 1647) Posted 5 Feb 2022 by xii5ku Post: 9890 — Ubuntu 18.04.5, BOINC version 7.16.16, anonymous owner, currently not requesting work 9904 — Debian 11, BOINC version 7.16.16, anonymous owner, currently not requesting work 10176 — Debian Sid, BOINC version 7.18.1, anonymous owner, currently not requesting work 9617 — Windows+vbox, 100% error rate, @TribbleRED (PM sent) Anyway; workunits generally succeed eventually even if some of their tasks go through such fast failing hosts, thanks to the workunit configuration of "max # of error/total/success tasks" = 8, 10, 6.
6) Message boards : Number crunching : Host ID 1388 corrupted (Message 1646) Posted 5 Feb 2022 by xii5ku Post: @damotbe, perhaps post into the News feed ( = create a boinc notice) that Linux users who use a 3rd party boinc-client package are asked to check whether they produce any valid results at QuChemPedIA at all. If not, they should revert to their distribution's stock boinc-client package. Edit: 10053 — Debian 11, BOINC version 7.16.16, anonymous owner 8589 — Windows+vbox, 100% error rate, @raddoc
7) Message boards : Number crunching : Host ID 1388 corrupted (Message 1645) Posted 5 Feb 2022 by xii5ku Post: 9800 — Debian 11, BOINC version 7.16.16, anonymous owner
8) Message boards : Number crunching : Some hosts are delaying (long) batch completion (Message 1307) Posted 7 Jan 2021 by xii5ku Post: All the nwchem-long work which I still had left from December, plus a dozen resends which I received in January, is already finished by now. Except for 1 last task which is still running. This went a lot quicker than I thought — mostly because I received far fewer resends than I anticipated. (I expected inconclusive/ invalid results of one computer of mine to turn into resends to another computer of mine, like it still happened in December. But apparently the high ratio of inconclusive/ invalid nwchem-long results from my hosts in January caused the scheduler to assign these resends to other hosts.) So in short, of the 216 nwchem-longs which are progress at this time, I have got only 1 left now, and the rest became somebody else's problem in the meantime. ;-)
9) Message boards : Number crunching : Some hosts are delaying (long) batch completion (Message 1287) Posted 24 Dec 2020 by xii5ku Post: @damotbe, of the 469 nwchem-long tasks which are currently in progress, 359 are located on two compute servers of mine. I worked on nwchem-long until recently, but have this work suspended until January 1. Then these servers will be fully available again and will complete this work (plus any resends which they might receive in the process). These last remaining nwchem-long workunits have a disproportionally large number of inconclusive results, together with the occasional aborted or error results. That is, chances that these last WUs will end up with two valid results are rather slim. Though if two hosts with the very same hardware/software configuration turn in results, would this improve the chance of valid outcome somewhat? Besides the slim chances of validation, run times of some of these tasks easily exceed a week now. Nevertheless, the mentioned hosts are allocated for this work, restarting January, for as long as it takes to reach whatever conclusion of these workunits. (It's very stable hardware too, e.g. with ECC RAM, therefore suited for long running work.) PS, as an information to users who have the other 110 currently remaining nwchem-long tasks queued: The validation rate which I have seen with nwchem-long went very sharply down during the first half of December, even though I typically was in the position to contribute both of the results needed for validation. If points-per-day are important to you, then these last tasks are no longer viable from that perspective — because of the high ratio of inconclusive and error results, and because of the sometimes dramatically long runtimes.
10) Message boards : Number crunching : The aborted and resend WU (Message 1125) Posted 3 Oct 2020 by xii5ku Post: Just now, I switched four of my computers which no longer received work back to QuChemPedIA, and they were given new work right away. Thanks for your quick response and adjustment.
11) Message boards : Number crunching : The aborted and resend WU (Message 1122) Posted 3 Oct 2020 by xii5ku Post: damotbe wrote: I tweak feeder and scheduler again. Normally, retries are accelerated AND the oldest jobs are preferred. I also doubled the capacity in shared memory. Let's see ! I am noticing that the majority of tasks in my results tables most recently were re-sends of rather old workunits which had been replicated several times by now. My own result statuses are mostly "Completed, validation inconclusive", "Validate error", "Completed, can't validate" --- just like from the computers which ran previous replicas --- with a few " Completed, waiting for validation" thrown in a, and very very rare "Completed and validated". This change from successfully completed new workunits mostly, to unsuccessful old ones mostly, seems to coincide circa with the time of your posting. So, this part of your plan evidently worked. However: I am no longer receiving any new work now. Yet server_status shows plenty of tasks ready to send. My results tables at the web site show that I received tasks frequently until 3 Oct 2020, 0:07:33 UTC. After that, I received five more tasks in a time window between 1:56...1:58 UTC. Then nothing more. Is this because my hosts returned so many inconclusive or invalid results during the last ~half day? I believe so. I started another client which has not been active at QuChemPedIA after September 27, and had a good success rate until then. This client received the full complement of 128 tasks immediately, in the usual batches of 20 tasks per request. (All of these tasks were from old workunits too, which failed several times before.) Conclusion: I suspect the new preference of old, bad workunits which only produce faulty results lets the scheduler mark one host after another as unreliable. And then the scheduler no longer assigns work to them. Edit, I think the large number of old workunits which are faulty but still haven't reached their maximum failure count will soon cause this project to run out of trusted hosts to which the sever would want to assign replicas of previously failed workunits. What happens then? Will everything come to a standstill, or will the server then begin to send tasks from new workunits to the allegedly unreliable hosts again?
12) Message boards : Number crunching : Quchem multithreading (Message 927) Posted 11 Jul 2020 by xii5ku Post: "NWChem long" is not a multithreaded application in the stricter sense, in which one process maintains several threads which operate on shared data. Instead, the application spawns one or more separate processes, each operating on own data. These sub-processes are synchronizing with each other only occasionally, via message passing. Therefore I presume that processor cache does not make a difference for what little or large overhead there might be for running "NWChem long" with more than a single thread. Architecturally, a single job of this application could even be spread over an Ethernet cluster of computers, or an Infiniband cluster. When the message passing has to go via a network like this, then there might be a bigger performance impact from communications latency though.
13) Message boards : Number crunching : Long work units. (Message 926) Posted 11 Jul 2020 by xii5ku Post: adrianxw wrote: I have added the config file and set my preference to only crunch the longs, but still have not received a work unit. There is only "NWChem long" work available (server_status.php), which requires either Linux, or Windows with VirtualBox in beta testing (apps.php). Beta applications require "Run test applications?" switched on in the project preferences.
14) Message boards : Number crunching : "Multithreading" in prefs (Message 925) Posted 11 Jul 2020 by xii5ku Post: Sometimes it is convenient to be able to finish the work within less than half a day after download, rather than in two or three days.
15) Message boards : Number crunching : "Multithreading" in prefs (Message 911) Posted 27 Jun 2020 by xii5ku Post: xii5ku wrote: it seems the thread count does not have to be a power of two. I started a 14-threaded tasks successfully, but it has yet to complete, and to validate of course. This 14-threaded task validated by now. I shall try more of this, perhaps next week.
16) Message boards : Number crunching : Suspicious near-instant results with NWChem long t4 (Message 909) Posted 24 Jun 2020 by xii5ku Post: @crashtech: It looks like you have three "good" hosts with Mint 19.3 and boinc version 7.9.3, and two "bad" hosts with Mint 19.3 and boinc version 7.17.0. Right? (On the other hand, when I look at wingmen of my own results, there are circa two hosts which are persistently spamming the project recently with bogus few-seconds results, and these two hosts have Mint 19.3 and boinc version 7.9.3. Their owner is anonymous, hence we have no way to wake up the pilot.)
17) Message boards : Number crunching : "Multithreading" in prefs (Message 907) Posted 22 Jun 2020 by xii5ku Post: *** * * Wish list * * *** 1. Allow to fix the threads per task to one value only, instead of a range. Currently, Linux users can choose between exactly 1 thread/task, or randomly 1...2 threads/task, or randomly 1/2/4 threads per task, or randomly 1/2/4/8 threads per task. Frankly, the random options are completely bogus. Whenever I run projects with multithreaded applications on my hosts, I always configure the client to run tasks with one uniform thread count per task. Not doing this will soon confuse the work queue management, and leave me with an under-utilized host, which I detest to no end. 2. Allow other thread counts than just 1, 2, 4, or 8. I have yet to run proper tests, but it seems to me that while there is a throughput loss when going from 1 to >1 thread per task (as expected, due to inter-process synchronization overhead), there is very good scaling from a few to several more threads per task (as expected from such a long-running workload). And what's more, it seems the thread count does not have to be a power of two. I started a 14-threaded tasks successfully, but it has yet to complete, and to validate of course. Thread counts which are not a power of two will be good to have on 6-, 12, 14-, ... -core hosts. 3. Allow more than 8 threads per task. I ran several 16-threaded tasks by now. The few which validated by now appear to have reasonable scaling compared with 4- and 8-threaded tasks. I tried to start one 32-threaded task but it exited after a few seconds. I.e. there must have been an error which was not reported upwards to the boinc wrapper. I need to do some offline tests to check this out. 4. As a band-aid until some of the above can be implemented: Allow the anonymous platform. I set up clients with an app_info.xml and the necessary project files in order to implement the above items locally. But when these clients requested work, the request failed with "HTTP internal server error". For now, I am working around the lack of app_info.xml support by manual editing of the nwchem_t1_worker_0.19.sh file. The modification does not persist over a client restart though, of course. PS, while it is for certain that the single-threaded jobs will give best throughput, I am personally not keen on tasks which might require several days to complete.
18) Message boards : Number crunching : Suspicious near-instant results with NWChem long t4 (Message 906) Posted 22 Jun 2020 by xii5ku Post: @crashtech, in addition to ProtectSystem=full, you could try: PrivateTmp=false
19) Message boards : Number crunching : Suspicious near-instant results with NWChem long t4 (Message 903) Posted 21 Jun 2020 by xii5ku Post: @crashtech, "df" reports "file system disk space usage", i.e. the used space and available space in the filesystem in which the optionally given file or directory resides. My main intention was to verify how much free space is left in your /tmp. We now know that there is plenty of space left in it. (There are 180 GBytes available in /tmp.) As for the boinc-client.service unit file: Compared with the boinc-client.service file on my computers, yours has several extra lines. The following four, explained in "man systemd.exec", stick out to me: ProtectHome=true Most likely harmless to the NWChem (...long) application. PrivateTmp=true In theory this should be OK for NWChem long. ProtectSystem=strict This is probably the culprit! As I understand the documentation, this will make /tmp read-only. Either relax this from strict to full, or append -/tmp to the ReadWritePaths line. Then restart the boinc-client service. Or maybe you even need to reboot, I don't know. Then fetch one QuChemPedIA task and see if it runs normally. ProtectControlGroups=true In theory this should be OK. (Documentation of the systemd service file format is spread over "man systemd.unit", "man systemd.service", and "man systemd.exec".)
20) Message boards : Number crunching : Long work units. (Message 901) Posted 21 Jun 2020 by xii5ku Post: adrianxw wrote: I WANT a long work unit, but just one. I have set already 1 CPU. This "projects/quchempedia.univ-angers.fr_athome/app_config.xml" file limits boinc-client to start at most one "NWChem long" task at any time: <app_config> <app> <name>nwchem_long</name> <max_concurrent>1</max_concurrent> </app> </app_config> The following simpler "projects/quchempedia.univ-angers.fr_athome/app_config.xml" file limits boinc-client to start at most one QuChemPedIA task of any of the available applications: <app_config> <project_max_concurrent>1</project_max_concurrent> </app_config> The following setting in your project preferences limits you to one QuChemPedIA task in progress on each of your hosts: Max # jobs: 1 The limit on tasks in progress is enforced by the server, i.e. works independently of any client-side settings.

Next 20