Receive Email Notifications of Your Slurm Jobs 🔒
When resources that you need for running your computing job are not available, you can still submit your resource allocation request so that your request will be put in the corresponding first-come, first-served queue (Slurm partition). And you can receive an E-mail notification once your Slurm request has been assigned the requested resources. This post explains how-to.
1) Keep your Slurm sessions running: to keep your cluster sessions running even if your network connections to the cluster get disconnected, please use X2Go for GUI sessions (with “–x11”), and X2Go or tmux for non-GUI sessions (without “–x11”). Here is an example for starting a tmux session:
tmux new -s tmuxSESSIONname
2) Submit your request using srun or sbatch: If you have already debugged your code, we’d suggest that you use sbatch. The following “gpudebug” examples each asks for one GPU. For CPU only jobs, please use “cpu” (default) or “gpu” partition.
2.1) Submit a Slurm job using srun and receive E-mail notifications. For example,
srun -p gpudebug -c 10 --gpus=1 --mail-type=ALL --mail-user=YOUR-EMAIL-ADDRESS --pty bash -i
2.2) Submit a Slurm job using sbatch and receive E-mail notifications.
Below is an example sbatch script (we can put two number signs “##” at the beginning of a line to comment it out):
#!/bin/bash ## job resources and settings #SBATCH --job-name="USERNAME_JOBNAME" #SBATCH --partition=gpudebug #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=10 #SBATCH --mem=20000 #SBATCH --gpus-per-node=1 #SBATCH --output=USERNAME_JOBNAME.out #SBATCH --mail-user=YOUR-EMAIL-ADDRESS #SBATCH --mail-type=ALL ## commands to run hostname date +'%y-%m-%d %H:%M:%S' which python source /opt/scripts/user/anaconda3-2023.03.init.sh which python python --version conda activate emostyle which python python --version python -c "import torch; print(torch.cuda.is_available());" date +'%y-%m-%d %H:%M:%S'
Note: the above example asks for:
- resources from “gpudebug” partition (#SBATCH –partition=gpudebug)
- 1x node (#SBATCH –nodes=1)
- 10x CPU cores (#SBATCH –ntasks=1, #SBATCH –cpus-per-task=10)
- 20GB RAM (#SBATCH –mem=20000)
- 1x GPU (#SBATCH –gpus-per-node=1)
- redirect standard output / error to “USERNAME_JOBNAME.out” file (#SBATCH –output=USERNAME_JOBNAME.out)
To run the example (SBATCH_EXAMPLE.sh, for example), sign into the cluster head node, open a terminal and run:
sbatch SBATCH_EXAMPLE.sh
To check the latest output of the job, for example, we can use “tail” command:
tail USERNAME_JOBNAME.out
And below is an example output file “USERNAME_JOBNAME.out”:
neurocomp12 24-10-19 18:01:10 /usr/bin/python /pkgsGPU/anaconda3-2023.03/bin/python Python 3.10.12 /pkgsGPU/anaconda3-2023.03/envs/emostyle/bin/python Python 3.9.19 True 24-10-19 18:01:19
Recent Comments