Posts by niarbeht

1) Questions and Answers : Unix/Linux : Any alternative to the current taskset clobbering? (Message 1637)
Posted 13 Jan 2022 by niarbeht
Post:
https://imgur.com/a/LYC3k7n

Maybe step out of the thread, Jim.
2) Questions and Answers : Unix/Linux : Any alternative to the current taskset clobbering? (Message 1633)
Posted 13 Jan 2022 by niarbeht
Post:
Here's what a new run.sh might look like:

#!/usr/bin/env bash



set -m

_affinity() {
    sleep 2
    for pid in $(pgrep nwchem); do taskset -a -p $INHERITEDMASK $pid >/dev/null ; done
    sleep 30
    for pid in $(pgrep nwchem); do taskset -a -p $INHERITEDMASK $pid >/dev/null; done
}

_step_check_run() {
    step_name=$1
    step_nw=$2
    step_out=$3
    dep_out=$4
    if [[ $dep_out == None ]] || ( [[ -e $dep_out ]] &&  grep -q "Total *times *cpu" $dep_out ); then
	echo -n "STEP $step_name : "
	if [[ -e $step_out ]] && grep -q "Total *times *cpu" $step_out; then
	    echo "already done"
	else
	    if [[ -e $step_out ]]; then
		echo "Continue from last checkpoint"
	    else
		echo "Starting"
	    fi
	    _affinity &
	    mpirun --allow-run-as-root -np $NP nwchem $step_nw >& $step_out
	fi
    fi
}

echo "STEP OPT : Starting"

INHERITEDMASK="0x$(taskset -p $$ | awk '{print $NF}')"

echo "Inherited mask: $INHERITEDMASK"

if [[ ! -e OPT.out ]]; then
    _affinity &
    mpirun --allow-run-as-root -np $NP nwchem OPT.nw >& OPT.out
else
    _step_check_run "OPT" OPT-restart.nw OPT.out None
fi
_step_check_run "SINGLE POINT" SP-restart.nw SP-restart.out OPT.out

_step_check_run "FREQ" FREQ.nw FREQ.out SP-restart.out

_step_check_run "TD (SINGLET)" TD_singlet.nw TD_singlet.out SP-restart.out



# fin OPT
# fin Solver


echo "Create output archive"
ALL=$(openssl enc -aes-256-ctr -k secret -P -md sha256 2>/dev/null)
KEY=$(echo $ALL | grep "key=[0-9A-F]*" -o | cut -d '=' -f 2)
SALT=$(echo $ALL | grep "salt=[0-9A-F]*" -o | cut -d '=' -f 2)
IV=$(echo $ALL | grep "iv =[0-9A-F]*" -o | cut -d '=' -f 2)

tar zcvf output.tar.gz  *.out 
openssl enc -aes-256-ctr -K "$KEY" -S "$SALT" -iv "$IV" -e -in output.tar.gz -out output.inc
rm -rf  *.out 
echo "$ALL" | openssl rsautl -encrypt -pubin -inkey pub.pem -out key.inc


DO NOTE that I haven't tested this yet, as it turns out that injecting changes to the run.sh is... really hard, actually. Whoever your software developer was did a really good job ensuring the environment was not something an outside user could easily modify.
3) Questions and Answers : Unix/Linux : Any alternative to the current taskset clobbering? (Message 1632)
Posted 13 Jan 2022 by niarbeht
Post:
A lead on where this might go:

#!/bin/bash

INHERITEDMASK="0x$(taskset -p $$ | awk '{print $NF}')"

echo $INHERITEDMASK


This should display the mask of the currently-running process. I'll see about shoving this into the run.sh later.

EDIT: I don't know how widespread the "awk" command is, but I suspect it should be on effectively every UNIX system that can run BOINC, so it's probably safe to use. Probably. I mean, I'm pretty sure awk is older than I am.
4) Questions and Answers : Unix/Linux : Any alternative to the current taskset clobbering? (Message 1631)
Posted 13 Jan 2022 by niarbeht
Post:
I just looked at things again today. It might be possible to use taskset to retrieve the mask of the currently-running process of the run script, which should be the same as whatever the user decided to limit BOINC to in their systemd service unit, and then apply that mask in place of the current affinity fix. I might take a poke at it later and see if I can make an example script.

As a sidenote, the difference between letting units run on any core, but only using 75% of processor threads, and letting units only run on a specific subset of processor threads, is the difference between 1.3 million PPD in F@H and 1.5 million PPD in F@H. This is with AMD OpenCL units. If I remember right, the processor-time requirements are more pronounced for Nvidia units.

I'll see if I can't figure out that taskset stuff.
5) Message boards : Number crunching : Fast wu,s invalid (Message 1623)
Posted 9 Jan 2022 by niarbeht
Post:
it works!
6) Questions and Answers : Unix/Linux : Workunits failure after upgrading Debian to 11 (bullseye) (Message 1622)
Posted 9 Jan 2022 by niarbeht
Post:
Never mind, the Arch boxes run fine most of the time.

This is horribly confusing.
7) Questions and Answers : Unix/Linux : Any alternative to the current taskset clobbering? (Message 1619)
Posted 7 Jan 2022 by niarbeht
Post:
To reserve a core for the GPU (I do Folding on all my GPUs), just set the BOINC Preference to use less than 100% of the CPUs.
For example, on a 12 core Ryzen 3600, I set it to use 95%, which leaves one virtual core free.

To run only one task per core, just set QuChemPedIA preferences "Max # CPUs 1".


Wow. That is not how any of that works. At all.

Setting the usage limits in the BOINC client doesn't actually restrict how many cores get used, it restricts the number of threads that work units can use. As such, if you set that number to 75% on a 12-core, 24-thread system, you wind up with a total number of work units running that equals eighteen threads. If the operating system scheduler doesn't keep those threads contained to specific cores, it results in latency issues for the F@H work units, as context switching takes time.

As for "To run only one task per core", that's not the issue. The issue is the project's run script using taskset to work around how mpirun handles unit execution.[/quote]
8) Questions and Answers : Unix/Linux : Any alternative to the current taskset clobbering? (Message 1617)
Posted 6 Jan 2022 by niarbeht
Post:
I've got my boinc systemd unit file set up so that only certain processor cores will be used. This results in F@H GPU units not being bound by waiting for the CPU for whatever reason that F@H OpenCL units need the CPU.

Anyway, so I noticed that when I'm running this projects units, they run on every core and thread on my system. Which is, as noted in the first paragraph, not desired behavior, as it results in my F@H GPU units choking.

I looked into things, and it appears that you guys are clobbering my processor limitations by using taskset to let your units run on every CPU. I poked and prodded things a bunch, and figured out that this is a workaround for a limitation of working with MPI. I can understand and appreciate this, but...

Well, it'd be nice if there were a good way for me to control what taskset is being fed. I don't know how you could set it up, but it sure would be nice if your run.sh files checked for a variable and, if present, tried to feed that to taskset instead. I dunno.

As it is, I'm gonna let my current units for this project finish and not get any new units on my desktop for now, I guess. Luckily, I can still let your units run on my server, since it doesn't run F@H GPU units at all.
9) Questions and Answers : Unix/Linux : Workunits failure after upgrading Debian to 11 (bullseye) (Message 1613)
Posted 3 Jan 2022 by niarbeht
Post:
Never mind, both don't work, they're just failing in different ways.
10) Questions and Answers : Unix/Linux : Workunits failure after upgrading Debian to 11 (bullseye) (Message 1612)
Posted 3 Jan 2022 by niarbeht
Post:
So, the whole "It's because Debian has a newer kernel" thing doesn't quite pan out, I think, because my Arch Linux boxes run work units just fine, but my Debian (Proxmox) box doesn't.

I dunno. I don't know anything about the build environment.

Doesn't work:
uname -a
Linux serverofpie 5.13.19-1-pve #1 SMP PVE 5.13.19-3 (Tue, 23 Nov 2021 13:31:19 +0100) x86_64 GNU/Linux


Works:
uname -a
Linux HolyPie 5.15.10-arch1-1 #1 SMP PREEMPT Fri, 17 Dec 2021 11:17:37 +0000 x86_64 GNU/Linux


Both have python2 and python3 installed.




©2024 Benoit DA MOTA - LERIA, University of Angers, France