[HPC 101] Job Debugging: Why Did My Job Fail?

7 minute read

Published:

In the real world, hitting “Submit” is just the beginning.

Welcome to the finale of the HPC 101 series.

So far, we have covered the essentials: Logging in, Moving Data, and Managing Environments. Finally, you submitted your job.

But sometimes, things go wrong.

  • Your job stays “Pending” forever.
  • It crashes 2 seconds after starting.
  • It runs for 3 days but produces empty files.

Today, we will learn the “Survival Skills” for HPC. We will cover how to debug failed jobs, how to check your resource efficiency, and why you are stuck in the queue.

Table of Contents

(Click the image to watch the tutorial on YouTube)


> 1. In-depth Monitoring (scontrol)

You submitted a job. You type squeue --me. It says P (Pending). Ok, but after 10 minutes, it’s still pending. Or maybe it’s running, but you don’t know where.

$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             12345       cpu     bash  user123  P       0:00      1 (Priority)
             12346       gpu     bash  user123  P       0:00      1 (Resources)

squeue gives you a quick summary, but sometimes you need the Full Report. Use the command scontrol show job <JOBID>.

$ scontrol show job 12345
JobId=12345 JobName=bash
   UserId=user123(123456) GroupId=users(1000)
   ...
   JobState=PENDING Reason=Resources
   ...
   StartTime=2026-01-25T21:06:11 EndTime=Unknown
   NodeList=(null)
   WorkDir=/home/user123/my_project
   Command=/bin/bash
   ...

Key fields to look for:

  1. JobState & Reason: Tells you exactly why it is waiting (e.g., Resources, Priority).
  2. StartTime: The scheduler’s estimated start time. (Note: This can change if higher priority jobs enter the queue).
  3. NodeList: If running, this shows which specific compute node you are using (e.g., compute-node-01).
  4. WorkDir: Confirms where your script is running and where output files will be saved.

Linux Tip: What is grep? The output of scontrol is very long. We can filter it using a pipe | and grep.

  • | (Pipe): Takes the output of the left command and passes it to the right command.
  • grep: Think of it as “Ctrl + F” for the terminal. It prints only the lines containing your keyword.
# Show me ONLY the StartTime line
$ scontrol show job 12345 | grep StartTime
StartTime=2026-01-25T22:00:00 EndTime=2026-01-25T23:00:00> 


> 2. The Emergency Button (scancel)

Oops! You just realized you requested 100 nodes instead of 1 node. Or maybe your code is stuck in an infinite loop.

Don’t just let it fail. Kill it immediately.

# Cancel a specific job
$ scancel 12345

# Cancel ALL jobs by user
$ scancel -u user123
# Cancel a specific job
$ qdel 12345

# Cancel ALL jobs (depends on system, usually manual loop or specific command)
$ qselect -u user123 | xargs qdel


> 3. The Detective Work (sacct & Logs)

You came back from coffee, and your job is gone from the queue. Did it finish? Or did it fail? Since it is not in the queue (squeue), we need to check the History.

Step 1: Check the State (sacct)

The command is sacct (Slurm Accounting). By default, the output is messy, so we use format options.

$ sacct -j 12345 --format=JobID,State,AllocCPUS,ReqMem,MaxRSS,Elapsed,ExitCode
JobID             State  AllocCPUS     ReqMem     MaxRSS    Elapsed ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
12345            FAILED          1         2G              00:00:11    137:0
12345.batch      FAILED          1                         00:00:11    137:0

Common States:

  • COMPLETED: Success! (Exit Code 0:0)
  • CANCELLED: The job was killed.
  • TIMEOUT: The job ran longer than the requested --time.
  • FAILED: The code crashed (Non-zero exit code).

Step 2: Read the Logs

sacct tells you what happened, but not why. To find the “why”, look at the output file you defined in your script (e.g., #SBATCH -o result.out).

# Look at the END of the file first
$ tail -n 20 result.out

Common Error Messages:

  • command not found: Did you module load?
  • ModuleNotFoundError: Did you conda activate or install the package?
  • killed / oom-kill: You ran out of memory.

Step 3: Get Notified (Pro Tip)

Jobs often fail when you are not watching. Let Slurm email you. Add this to your job script:

#SBATCH --mail-type=FAIL,END
#SBATCH [email protected]
  • FAIL: Notify only when it crashes.
  • END: Notify when it finishes (success or failure).


> 4. Resource Efficiency (seff)

This is the most important part for becoming a “Power User”.

Imagine you reserved a banquet table for 40 people, but you ate dinner alone. The restaurant manager (Scheduler) would be angry. In HPC, this happens when you request --cpus-per-task=40 but your python script only uses 1 core.

How do you check your efficiency? Use seff.

$ seff 12345
Job ID: 12345
Cluster: cluster
User/Group: user123/users
State: COMPLETED (exit code 0)
Cores: 8
CPU Utilized: 00:01:14
CPU Efficiency: 10.23% of 00:01:22 core-walltime
Job Wall-clock time: 00:01:22
Memory Utilized: 12.09 MB
Memory Efficiency: 0.15% of 8.00 GB (8.00 GB/node)

Note: Some clusters may not have seff enabled. In that case, use sacct with AveCPU, MaxRSS.

$ sacct -j 12345 --format=JobID,State,AveCPU,MaxRSS
obID             State     AveCPU     MaxRSS 
------------ ---------- ---------- ---------- 
12345         COMPLETED                       
12345.batch   COMPLETED   00:01:14     12384K

How to interpret the output:

  • CPU Efficiency:
  • Bad (< 50%): You requested too many cores. If your code is not parallelized, request only 1 core.
  • Good (~ 90%): You are utilizing resources well.

  • Memory Efficiency:
  • Bad (< 10%): You requested too much RAM. Reduce --mem next time.
  • Dangerous (> 95%): You are on the edge of crashing (OOM). Increase --mem slightly (e.g., by 20%).

Why does this matter? Smaller jobs fit into “gaps” in the cluster easier. By requesting only what you need, your jobs will start faster!


> 5. Why is my job pending? (Fairshare)

Sometimes, your job stays in PD (Pending) state with reason Priority or Resources, even though there seem to be empty nodes.

This is likely due to Fairshare. Think of it as a “Karma System”.

  • The cluster is a shared resource.
  • If you ran thousands of heavy jobs last week, your “Karma” goes down. You wait in line.
  • If you haven’t used the cluster for a while, your “Karma” is high. You jump the queue.

Checking the Reason Explicitly

Instead of guessing, you can ask Slurm exactly why you are waiting:

$ squeue -j 12345 -o "%.18i %.9T %.30R"
             JOBID     STATE               NODELIST(REASON)
             12345   PENDING               (Priority)

This reveals the specific REASON code:

  • Priority: Just wait. It’s Fairshare logic.
  • Resources: The cluster is busy, or you requested a specific node that is busy.
  • QOSMaxJobsLimit: You hit the limit of allowed running jobs.
  • Dependency: It’s waiting for another job to finish.

Don’t panic. Usually, you just need to wait.


> 6. Summary & Cheatsheet

Debugging Mindset (Read This Once)

If a job fails, always ask these questions in order:

  1. Did it start? (squeue, scontrol) -> If not, check your script syntax.
  2. Did it finish or crash? (sacct) -> Check the State.
  3. Why did it crash? (logs) -> Read the .err file.
  4. Did I request the right resources? (seff) -> Check memory usage.
  5. Can I make it smaller? -> Smaller jobs run faster.

Congratulations! You have officially graduated from HPC 101. You are no longer just a guest; you are a resident of the cluster.

Goal Command
Check Details scontrol show job <JOBID>
Kill Job scancel <JOBID>
Check History sacct -j <JOBID>
Check Efficiency seff <JOBID>

What’s Next? In the next series, we will change gears completely. We will stop being a “User” and start thinking like an “Engineer”. I will start a new series on How to Build an HPC Cluster from scratch.

See you in the next series!