[HPC 101] Job Debugging: Why Did My Job Fail?
Published:
In the real world, hitting “Submit” is just the beginning.
Welcome to the finale of the HPC 101 series.
So far, we have covered the essentials: Logging in, Moving Data, and Managing Environments. Finally, you submitted your job.
But sometimes, things go wrong.
- Your job stays “Pending” forever.
- It crashes 2 seconds after starting.
- It runs for 3 days but produces empty files.
Today, we will learn the “Survival Skills” for HPC. We will cover how to debug failed jobs, how to check your resource efficiency, and why you are stuck in the queue.
Table of Contents
- 1. In-depth Monitoring (scontrol)
- 2. The Emergency Button (scancel)
- 3. The Detective Work (sacct & Logs)
- 4. Resource Efficiency (seff)
- 5. Why is my job pending? (Fairshare)
- 6. Summary & Cheatsheet
(Click the image to watch the tutorial on YouTube)
> 1. In-depth Monitoring (scontrol)
You submitted a job. You type squeue --me. It says P (Pending). Ok, but after 10 minutes, it’s still pending. Or maybe it’s running, but you don’t know where.
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345 cpu bash user123 P 0:00 1 (Priority)
12346 gpu bash user123 P 0:00 1 (Resources)
squeue gives you a quick summary, but sometimes you need the Full Report.
Use the command scontrol show job <JOBID>.
$ scontrol show job 12345
JobId=12345 JobName=bash
UserId=user123(123456) GroupId=users(1000)
...
JobState=PENDING Reason=Resources
...
StartTime=2026-01-25T21:06:11 EndTime=Unknown
NodeList=(null)
WorkDir=/home/user123/my_project
Command=/bin/bash
...
Key fields to look for:
- JobState & Reason: Tells you exactly why it is waiting (e.g.,
Resources,Priority). - StartTime: The scheduler’s estimated start time. (Note: This can change if higher priority jobs enter the queue).
- NodeList: If running, this shows which specific compute node you are using (e.g.,
compute-node-01). - WorkDir: Confirms where your script is running and where output files will be saved.
Linux Tip: What is
grep? The output ofscontrolis very long. We can filter it using a pipe|andgrep.
|(Pipe): Takes the output of the left command and passes it to the right command.grep: Think of it as “Ctrl + F” for the terminal. It prints only the lines containing your keyword.
# Show me ONLY the StartTime line $ scontrol show job 12345 | grep StartTime StartTime=2026-01-25T22:00:00 EndTime=2026-01-25T23:00:00>
> 2. The Emergency Button (scancel)
Oops! You just realized you requested 100 nodes instead of 1 node. Or maybe your code is stuck in an infinite loop.
Don’t just let it fail. Kill it immediately.
# Cancel a specific job
$ scancel 12345
# Cancel ALL jobs by user
$ scancel -u user123
# Cancel a specific job
$ qdel 12345
# Cancel ALL jobs (depends on system, usually manual loop or specific command)
$ qselect -u user123 | xargs qdel
> 3. The Detective Work (sacct & Logs)
You came back from coffee, and your job is gone from the queue. Did it finish? Or did it fail?
Since it is not in the queue (squeue), we need to check the History.
Step 1: Check the State (sacct)
The command is sacct (Slurm Accounting). By default, the output is messy, so we use format options.
$ sacct -j 12345 --format=JobID,State,AllocCPUS,ReqMem,MaxRSS,Elapsed,ExitCode
JobID State AllocCPUS ReqMem MaxRSS Elapsed ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
12345 FAILED 1 2G 00:00:11 137:0
12345.batch FAILED 1 00:00:11 137:0
Common States:
- COMPLETED: Success! (Exit Code
0:0) - CANCELLED: The job was killed.
- TIMEOUT: The job ran longer than the requested
--time. - FAILED: The code crashed (Non-zero exit code).
Step 2: Read the Logs
sacct tells you what happened, but not why. To find the “why”, look at the output file you defined in your script (e.g., #SBATCH -o result.out).
# Look at the END of the file first
$ tail -n 20 result.out
Common Error Messages:
command not found: Did youmodule load?ModuleNotFoundError: Did youconda activateor install the package?killed/oom-kill: You ran out of memory.
Step 3: Get Notified (Pro Tip)
Jobs often fail when you are not watching. Let Slurm email you. Add this to your job script:
#SBATCH --mail-type=FAIL,END
#SBATCH [email protected]
- FAIL: Notify only when it crashes.
- END: Notify when it finishes (success or failure).
> 4. Resource Efficiency (seff)
This is the most important part for becoming a “Power User”.
Imagine you reserved a banquet table for 40 people, but you ate dinner alone. The restaurant manager (Scheduler) would be angry.
In HPC, this happens when you request --cpus-per-task=40 but your python script only uses 1 core.
How do you check your efficiency? Use seff.
$ seff 12345
Job ID: 12345
Cluster: cluster
User/Group: user123/users
State: COMPLETED (exit code 0)
Cores: 8
CPU Utilized: 00:01:14
CPU Efficiency: 10.23% of 00:01:22 core-walltime
Job Wall-clock time: 00:01:22
Memory Utilized: 12.09 MB
Memory Efficiency: 0.15% of 8.00 GB (8.00 GB/node)
Note: Some clusters may not have seff enabled. In that case, use sacct with AveCPU, MaxRSS.
$ sacct -j 12345 --format=JobID,State,AveCPU,MaxRSS
obID State AveCPU MaxRSS
------------ ---------- ---------- ----------
12345 COMPLETED
12345.batch COMPLETED 00:01:14 12384K
How to interpret the output:
- CPU Efficiency:
- Bad (< 50%): You requested too many cores. If your code is not parallelized, request only 1 core.
-
Good (~ 90%): You are utilizing resources well.
- Memory Efficiency:
- Bad (< 10%): You requested too much RAM. Reduce
--memnext time. - Dangerous (> 95%): You are on the edge of crashing (OOM). Increase
--memslightly (e.g., by 20%).
Why does this matter? Smaller jobs fit into “gaps” in the cluster easier. By requesting only what you need, your jobs will start faster!
> 5. Why is my job pending? (Fairshare)
Sometimes, your job stays in PD (Pending) state with reason Priority or Resources, even though there seem to be empty nodes.
This is likely due to Fairshare. Think of it as a “Karma System”.
- The cluster is a shared resource.
- If you ran thousands of heavy jobs last week, your “Karma” goes down. You wait in line.
- If you haven’t used the cluster for a while, your “Karma” is high. You jump the queue.
Checking the Reason Explicitly
Instead of guessing, you can ask Slurm exactly why you are waiting:
$ squeue -j 12345 -o "%.18i %.9T %.30R"
JOBID STATE NODELIST(REASON)
12345 PENDING (Priority)
This reveals the specific REASON code:
- Priority: Just wait. It’s Fairshare logic.
- Resources: The cluster is busy, or you requested a specific node that is busy.
- QOSMaxJobsLimit: You hit the limit of allowed running jobs.
- Dependency: It’s waiting for another job to finish.
Don’t panic. Usually, you just need to wait.
> 6. Summary & Cheatsheet
Debugging Mindset (Read This Once)
If a job fails, always ask these questions in order:
- Did it start? (
squeue,scontrol) -> If not, check your script syntax. - Did it finish or crash? (
sacct) -> Check the State. - Why did it crash? (
logs) -> Read the.errfile. - Did I request the right resources? (
seff) -> Check memory usage. - Can I make it smaller? -> Smaller jobs run faster.
Congratulations! You have officially graduated from HPC 101. You are no longer just a guest; you are a resident of the cluster.
| Goal | Command |
|---|---|
| Check Details | scontrol show job <JOBID> |
| Kill Job | scancel <JOBID> |
| Check History | sacct -j <JOBID> |
| Check Efficiency | seff <JOBID> |
What’s Next? In the next series, we will change gears completely. We will stop being a “User” and start thinking like an “Engineer”. I will start a new series on How to Build an HPC Cluster from scratch.
See you in the next series!