slurm

 

SLURM Topics

How do I submit jobs to the new SLURM compute cluster?

Slurm is the new scheduler on the NCF and the larger Odyssey cluster. For some basic information about it, see the RC page here. They also have a helpful FAQ.

There are three ways to submit things to the cluster. (1) via the command line (similar to bsub), (2) via a batch script (how RC wants you to do it), and (3) interactively - great for testing things out before running all of your subjects, or for anything that you need graphics for while crunching numbers. You can submit a job from the command line, either from a workstation (being phased out), or from a VNC session.

Which method should you choose? That depends on your situation. If you have a script that you built previously using bsub, then the one-line call, which is most similar to bsub, will probably be what you desire. However, less information about the flags you used to submit your script are saved, and the command can get kind of tricky when you want to include a bunch of things. The batch script is useful when you want to do a few things in addition to run your computationally intensive script, like say change directories, or make a directory, and then run your script. This is essentially a bash script, and can be very powerful. The interactive session is great for testing things out, or for computations that also require graphics. For instance, if you want to make sure your SPM script works, you can launch matlab and run it - but it is run on the compute cluster instead of a workstation - so you are sure things are the same. Hopefully our recent upgrade has alerted people to the issue of keeping things as consistent as possible. If you just want to do graphics, aka look at your data without computing, you should use VNC and vglru.

Slurm Flags: First, I will go over some of the basic flags/variables you will have to set regardless of what way you decide to use the cluster.

-p the queue you want to use (ncf, ncf_interact, ncf_bigmem)

--mem is the max amount of memory reserved for your job, in MB.  4000 is probably more than most jobs will take, but it is a good starting point. If you have a TR less than 2 seconds, or 1.5mm data, you will most likely need more memory then usual.  See below for how to figure out how much your job actual took, so you can be more resonable in future calls of the same or similar scripts. Your script will get killed if it exceeds the memory requested, and if you consistently over request your priority for submitting will be low. If you find you need a lot of memory (~50 gigs or more), you might need the bigmem queue.

-t is the max amount of time your script will be allowed to run, in minutes.  If it goes longer, it will be killed.  If you are unsure, be generous here. Over estimating time doesn't hurt your a href="http://cbs.fas.harvard.edu/science/core-facilities/neuroimaging/information-investigators/slurm#slurm_priority">priority.  See below for how to tell how long it actually took.

If your job will take a long time, you can use the format: D-HH:MM, so if you wanted to request 4 days, 2 hours and 15 min, it would be:

-t 4-02:15.

Alternatively, you can provide one number that represents the minutes, so if you wanted to allow 2 hours, it would be

-t 120.

-o specifies the output file, where things that would normally be written out to the screen go. The %j will be replaced by the job number. If you don't specify this in the output file name, it can be lost, which makes checking on how much memory and time your job took difficult.

-o /ncf/mri/01/users/mcmains/myscript_%j_output.out

If you don't specify this, an output file will automatically get created in the directory it was run from, called:

slurm-jobnumber.out

Where jobnumber is the number your job received.

To see the progress of your script, you can 'more' the .out file:

more slurm-jobnumber.out

--mail-type Include this if you want it to send you an email when it is done.  This will send you an email when it is done that has the jobid in the subject line.

--mail-type=END

This unfortunately won't contain the output of your script, but it will tell you if it completed successfully or failed, at least in the execution of your script.  It won't for instance, know if fcfast failed for some reason, but if you have an error in your homemade batch script, it will come back as FAILED.  Here is an example email subject line: SLURM Job_id=42754682 Name=sbatch Ended, Run time 00:00:06, FAILED, ExitCode 1

Submitting via the command line

From a workstation or login node you can submit a job via the sbatch command using the --wrap flag. An example is:

sbatch -p ncf --mem=1024 -t 240 -o /ncf/mylab/mysubjects/outfiles/reconall_mysubj1_%j --wrap="recon-all -all -subjid mysubj1"

--wrap takes the executable script you want to run, followed by any flags it takes, all in double quotes.

If the script you want to submit is not in your path, such as when you make your own script to run, you need to make sure you give the full pathname to the script and that it is executable (chmod u+x scriptname). If it is a standard script like procfast or recon-all this is not necessary.

Submitting via a batch script

For this method, you will create a file, say via gedit, (gedit my_first_script.sh)that contains the flags discussed above. This is a bash script (as our cluster uses bash, as opposed to tcsh, or csh). Therefore, the first line of your file will always have the line below, which specifies it is a bash script. The next several lines specify all the flags talked about above, and then the last lines would be what you want to run. This example generates (via $RANDOM) a bunch of random numbers, puts them in a file, and then sorts them. You could also simply call recon-all, by putting recon-all -all -subjid mysubj1, after the last #SBATCH line.


 

#!/bin/bash
#
#SBATCH -p ncf # partition (queue)
#SBATCH --mem 100 # memory
#SBATCH -t 0-2:00 # time (D-HH:MM)
#SBATCH -o myscript_%j_output.out # STDOUT
#SBATCH --mail-type=END # notifications for job done

for i in {1..100000}; do
echo $RANDOM >> SomeRandomNumbers.txt
done

sort SomeRandomNumbers.txt


To run you script, type below from the location where it is, or you have to specify the full path to where it is.

sbatch my_first_script.sh

 

Running interactively

This is launched via the command srun. For example:

srun -p ncf_interact --pty --x11=first --mem 4000 -t 0-06:00 /bin/bash

This will launch a command line shell on the interactive queue with 4,000MB of RAM for 6 hours. When it starts, you will notice the prompt changes from your_user_name@ncf_something to @ncf_something_else. From here, you can test scripts, or launch Matlab.

Importantly, the interactive session assumes you want to be interacting with it. So if you go more than an hour without any kind of input, it will assume you have left the session and will terminate it.

 Useful commands for interacting with the slurm cluster - how to cancel a job.

Research computing has an extensive page discussion some useful slurm commands, located here. I have provided some of the most commonly needed ones below.

sacct

This will show you all of your recent jobs, running, pending, and completed. This can return a lot if you have or are running a lot of jobs. If you want to get information about whether a particular job is running:

sacct -j jobnumber

To cancel a job:

scancel jobnumber

To cancel all jobs you have running:

scancel -u your_user_name

To cancel all jobs you have running on a particular queue:

scancel -u your_user_name -p queue_name

To cancel all pending jobs:

scancel -t PENDING -u your_user_name

How do I run my matlab script via sbatch?

This is only complicated because of all the quotes. Here is an example script that takes four inputs, two strings followed by two numbers.

sbatch -p ncf -n 1 -t 04-15:01 --mem=2000 --wrap="matlab -nodisplay -nosplash -nojvm -r $'myscript(\'${subject}\',\'test\',9,0);exit'"

Generally, everything following the -r gets put in quotes. Given the --wrap command is in quotes, we enter quote unhappiness. The first thing that will come after the -r is a $', followed by your script name. If you need to use quotes in your function call, such as around strings, you need to put a \ before the single quote. The whole thing then ends with a single quote followed by a double quote. In this example, the first input passed is a bash variable that is set somewhere else to be a string, hence the quotes, followed by a string (test) and two numbers, (9,0). The exit makes sure that matlab closes after it finishes.

How do I figure out how much memory and time my script took?

This is useful to know so that you can request the appropriate amount of time and memory when you run your script, or something similar, again. You want to be as accurate as possible so that resources can be fairly spread across multiple jobs. In addition, if the cluster is being heavily used, and you request a bunch of memory, it make take awhile for the requested memory to become available. Your script will stay pending until your requested memory is available.

sacct -j jobnumber --format=MaxRSS,elapsed,reqmem,timelimit

This will return something that looks like:

    MaxRSS    Elapsed     ReqMem  Timelimit 

----------         ----------     ----------    ---------- 

                    00:01:40

8820K        00:01:45        1024Mn   02:00:00

This will usually return two lines, you want to use the second one. So if this was your script, you would think that it took 1 min and 45 seconds, and 8.82 MB of memory. When you ran it, you had specified that it would take 2 hours and 1024MB, way too long and too much memory. So if you were to run it again, you might use a command that requested slightly more memory (~20%) and time than it needed, which would look like:

sbatch -p ncf --mem=10.5 -t 4 --wrap="/ncf/mylab/myspace/myscript.sh"

 

My job needs A LOT of memory!!

The max amount of memory you can request on the regular cluster is about 250gigs. However, it might take a long time for your job to run because this is taking a substantial portion of the available memory on the cluster. If you find yourself needing a lot of memory (> 30gigs) you might consider using the big memory node, that has 3 T of available memory.

sbatch -p ncf_bigmem --mem=30000 -t 0-02:00 -o ./myoutput_%j --wrap="/ncf/mylab/myspace/myscript.sh"

The key is using the ncf_bigmem queue. To submit a job to this queue you must request at least 30 gigs. Also, please keep in mind that this is not the exact same hardware as the regular compute cluster. Therefore, the numbers you get back might be slightly different then the ones you would get if you ran it on the regular cluster. So make sure to not say run all your controls here and all your patients on the regular compute. That being said, if you need it for one subject, but don't need 50 gigs for all subjects, you could request 50 for the others just to end up on this node. Keep in mind repeated over requesting will hurt your priority and might end you up on the monthly 'bad' list, that will result in very friendly people from RC contacting you to make sure you know what you are doing.

Troubleshooting and common slurm errors.

A variety of problems can arise when running jobs on the NCF. Many are related to resource mis-allocation, but there are other common problems as well. Even your script seems to have finished successfully, you should look at the last line of the output file to make sure it wasn't killed by the job handler.

tail myscript_71827678.out

This will show you the last 5 lines of the output file. You don't want the last line to say something like:

slurmstepd: error: Exceeded step memory limit at some point.

ErrorLikely cause
JOB <jobid> CANCELLED AT <time>
DUE TO TIME LIMIT
You did not specify enough time in your batch submission script. The -t option sets time in minutes or can also take D-HH:MM form (0-12:30 for 12.5 hours)
Job <jobid> exceeded <mem> memory limit,
being killed
Your job is attempting to use more memory than you've requested for it. Either increase the amount of memory requested by --mem or --mem-per-cpu or, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the -Xmx JVM option. This could potentially be reduced.
slurm_receive_msg:
Socket timed out on send/recv operation
This message indicates a failure of the SLURM controller. Though there are many possible explanations, it is generally due to an overwhelming number of jobs being submitted, or, occasionally, finishing simultaneously. Try waiting a bit and resubmitting. If the problem persists, email RC (rchelp [at] fas [dot] harvard [dot] edu).
JOB <jobid> CANCELLED AT <time>
DUE TO NODE FAILURE

This message may arise for a variety of reasons, but it indicates that the host on which your job was running can no longer be contacted by SLURM.

 

What is my priority?

To see your priority score:

sshare -U

Your priority score is the last number that comes up. Larger is better. Generally a priority score above .5 is considered good, below .5 is bad. This score is currently based on the combined information from all the members of the group.

Why is my job pending?

To see your pending jobs, you can type:

squeue

This should return something like:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
73271197 ncf myscript mcmains R 0:30 1 ncfc22

If it is pending, the reason will usually be Resources or Priority. If it is resources, it means there aren't enough free nodes/cores or enough memory to run your job. If its priority, it means their are people above you. See here for more about priority.

To see how many people are ahead of you in the queue:

showq-slurm –p ncf –o

Jobs listed at the top are next in line.

How do I submit a job that uses more than 1 core (runs in parallel)?

If you have a script that can take advantage of multiple cores, you can request them via sbatch. There are several important flags. Keep in mind that requesting more than 1 core only helps you if your script utilizes some kind of parallelization.

-n the number of compute cores you want.   8 is the polite max.

-N 1 this requests that the cores are all on one node. Only change this to >1 if you know your code uses a message passing protocol like MPI. SLURM makes no assumptions on this parameter -- if you request more than one core (-n > 1) and your forget this parameter, your job may be scheduled across multiple nodes, which you don't want.

--mem When requesting multiple cores, this is the amount of memory shared by all your cores. If your cores are spread out over multiple nodes (using something like MPI), you want to use the flag --mem-per-cpu which requests memory for each core.

How do submit a script for a bunch of subjects - loop over subjects?

This partly depends on how you are submitting your script, via the --wrap flag, or via a batch script. I will cover both here. Regardless of how you do it, you want to practice good etiquette, don't submit a bunch of jobs that run for less than 5 min, and pause between submitting each script.

Via the --wrap flag

You want to create a script, ie text file that will contain your script, using a text editor like gedit. This can be in any language you want, here I will demonstrate it with bash.


#!/bin/bash
#set your subjects
subjects=(150101_subj1 150102_subj2 150102_subj3)
#loop over your subjects
for subj in ${subjects[*]}; do
echo $subj
sbatch -p ncf -t 2-0:00 --mem=1024 -o ${subj}_%j.out --wrap="recon-all -subjid ${subj} -all"
sleep 1 # pause to be kind to the scheduler
done


 

You would then want to make this executable (chmod u+x my_script.sh). And would run it via the command line:
./my_script.sh

Via a batch script

This requires creating two scripts. One similar to what we showed above, that loops over your subjects, and one that contains the sbatch flags (your batch script), which is like what is described here, under the heading Submitting via a batch script. First, lets make our script to loop over subjects, my_script.sh:


#!/bin/bash
#set your subjects
subjects=(150101_subj1 150102_subj2 150102_subj3)
#loop over your subjects
for subj in ${subjects[*]}; do
echo $subj
sbatch -o ${subj}_%j.out mybatch_script.sh ${subj}
#you can follow your batch script call with any number of inputs it needs, in this case we are passing it one with the subject ID
sleep 1 # pause to be kind to the scheduler
done


 

Now we can write our batch script:


#!/bin/bash
#SBATCH -p ncf # partition (queue)
#SBATCH --mem 1024 # memory
#SBATCH -t 2-0:00 # time (D-HH:MM)
recon-all -subjid ${1} -all

This will take an argument (subj). As it is a bash script, it will automatically parse inputs you give it when it is called, and place them in variables (1,2,3), reflecting the order in which they followed the script name. In this case, $1 gets assigned the subject ID. After making sure your script is executable (chmod u+x my_script.sh), you can run it:
./my_script.sh