Slurm: Important Commands
Start time of job or resources : squeue --start¶
The command can be used by any user to displays the estimated start time of a job based a number of analysis types based on historical usage, earliest available reservable resources, and priority based backlog. The command squeue is explained in detail on the squeue-manpage.
Access¶
By default, this command can be run by any user.
List of your submitted jobs : squeue¶
Displays information about active, pending and/or recently completed jobs. The command displays all own active and pending jobs. The command squeue is explained in detail on the squeue-manpage.
Access¶
By default, this command can be run by any user.
Flags¶
Flag | Description |
---|---|
-l, --long | Report more of the available information for the selected jobs or job steps, subject to any constraints specified. |
Examples¶
squeue
example on HoreKa (Only your own jobs are displayed!).
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1760728 normal sh vu1498 R 33:43 1 fh2n1274
1760673 normal sh vu1498 R 2:56:48 1 fh2n0501
1760585 normal sh vu1498 R 5:37:21 1 fh2n1402
1760583 normal sh vu1498 R 5:37:38 1 fh2n1396
1760065 normal fh2_400 vu1498 R 11:28:57 20 fh2n[0481-0483,1433-1448,1451]
1756096 normal fh2_256R vu1498 R 12:16:45 16 fh2n[0217-0232]
1759242 normal fh2_256R vu1498 R 1-07:37:46 16 fh2n[0145,0149-0153,0155-0156,0165,0209-0215]
1756685 normal fh2_64R vu1498 R 1-16:01:52 4 fh2n[1518-1520,1526]
1756683 normal fh2_64R vu1498 R 1-16:02:18 4 fh2n[1466,1469-1470,1479]
$ squeue -l
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1760728 normal sh vu1498 RUNNING 34:25 12:00:00 1 fh2n1274
1760673 normal sh vu1498 RUNNING 2:57:30 12:00:00 1 fh2n0501
1760585 normal sh vu1498 RUNNING 5:38:03 12:00:00 1 fh2n1402
1760583 normal sh vu1498 RUNNING 5:38:20 12:00:00 1 fh2n1396
1760065 normal fh2_400 vu1498 RUNNING 11:29:39 2-00:00:00 20 fh2n[0481-0483,1433-1448,1451]
1756096 normal fh2_256R vu1498 RUNNING 12:17:27 2-00:00:00 16 fh2n[0217-0232]
1759242 normal fh2_256R vu1498 RUNNING 1-07:38:28 2-00:00:00 16 fh2n[0145,0149-0153,0155-0156,0165,0209-0215]
1756685 normal fh2_64R vu1498 RUNNING 1-16:02:34 2-00:00:00 4 fh2n[1518-1520,1526]
1756683 normal fh2_64R vu1498 RUNNING 1-16:03:00 2-00:00:00 4 fh2n[1466,1469-1470,1479]
- The output of
squeue
shows how many jobs of yours are running or pending and how many nodes are in use by your jobs.
Shows free resources : sinfo_t_idle¶
The Slurm command sinfo is used to view partition and node information for a system running Slurm. It incorporates down time, reservations, and node state information in determining the available backfill window. The sinfo command can only be used by the administrator.
SCC has prepared a special script (sinfo_t_idle) to find out how many processors are available for immediate use on the system. It is anticipated that users will use this information to submit jobs that meet these criteria and thus obtain quick job turnaround times.
Access¶
By default, this command can be used by any user (sinfo
can only be used by the administrator).
Example¶
The following command displays what resources are available for immediate use for the whole partition.
$ sinfo_t_idle
Partition develop : 2 nodes idle
Partition normal : 0 nodes idle
Partition long : 0 nodes idle
Partition xnodes : 0 nodes idle
Partition visu : 9 nodes idle
For the above example the request for 1 node in the partition visu can be run immediately.
Detailed job information : scontrol show job¶
scontrol show job displays detailed job state information and diagnostic output for all or a specified job of yours. Detailed information is available for active, pending and recently completed jobs. The command scontrol is explained in detail on the the scontrol-manpage.
Display the state of all your jobs in normal mode: scontrol show job
Display the state of a job with <jobid> in normal mode: scontrol show job <jobid>
Access¶
End users can use scontrol show job to view the status of their own jobs only.
Arguments¶
Option | Example |
---|---|
-d | Display the state with jobid 8370992 in detailed mode: scontrol -d show job 8370992 |
Example for scontrol show job
¶
Here is an example from HoreKa.
squeue # show my own jobs (here the userid is replaced!)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1760162 normal 15grad ab1234 R 11:37:33 28 fh2n[0485-0487,0537-0544,1481-1488,1553-1558,1561-1562,1568]
$
$ # now, see what's up with my pending job with jobid 1760162
$
$ scontrol show job 1760162
JobId=1760162 JobName=15grad
UserId=ab1234(17105) GroupId=fh2-project-cxyz(500411) MCS_label=N/A
Priority=1 Nice=0 Account=fh2-project-hivxyz QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=11:39:27 TimeLimit=3-00:00:00 TimeMin=N/A
SubmitTime=2020-04-13T22:14:11 EligibleTime=2020-04-13T22:14:11
AccrueTime=2020-04-13T22:14:11
StartTime=2020-04-14T04:17:01 EndTime=2020-04-17T04:17:01 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2020-04-14T04:17:01
Partition=normal AllocNode:Sid=fh2n1992:530
ReqNodeList=(null) ExcNodeList=(null)
NodeList=fh2n[0485-0487,0537-0544,1481-1488,1553-1558,1561-1562,1568]
BatchHost=fh2n0485
NumNodes=28 NumCPUs=560 NumTasks=560 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=560,mem=875G,node=28,billing=560
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryNode=32000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/pfs/work5/fh2-project-cxyz/ab1234/15grad/runVOF_Fh1_collated.sh
WorkDir=/pfs/work5/fh2-project-cxyz/ab1234/15grad
StdErr=/pfs/work5/fh2-project-cxyz/ab1234/15grad/slurm-1760162.out
StdIn=/dev/null
StdOut=/pfs/work5/fh2-project-cxyz/ab1234/15grad/slurm-1760162.out
Power=
You can use standard Linux pipe commands to filter the very detailed scontrol show job output.
- Is the job already running?
$ scontrol show job 1760162 | grep -i state
JobState=RUNNING Reason=None Dependency=(null)
Cancel Slurm Jobs¶
The scancel command is used to cancel jobs. The command scancel is explained in detail on the scancel-manpage.
Canceling own jobs : scancel¶
scancel is used to signal or cancel jobs, job arrays or job steps. The command is:
$ scancel [-i] <job-id>
$ scancel -t <job_state_name>
Flag | Description | Example |
---|---|---|
-i, --interactive | Interactive mode | Cancel the job 987654 interactively. scancel -i 987654 |
-t, --state | (n/a) | Restrict the scancel operation to jobs in a certain state. "job_state_name" may have a value of either "PENDING", "RUNNING" or "SUSPENDED". Cancel all jobs in state "PENDING". scancel -t "PENDING" |