Skip to main content
  1. Portfolio/

pkg_audit: Cluster Package Consistency Audit Tool

Will Paik
Author
Will Paik
I optimize large-scale GPU clusters for AI/ML workloads. Outside of work, I build a mini-supercomputer from consumer hardware and document every step of it here.

A command-line utility for detecting and remediating RPM package inconsistencies across HPC cluster nodes. Built to address a real operational problem: nodes that silently diverge over time cause hard-to-diagnose job failures, and tracking down the cause by hand does not scale.

What it does
#

The tool compares installed packages between a baseline node and one or more target nodes, separating results into three distinct categories:

Missing packages are present in the baseline but absent on the target. A dnf install quick-fix command is included in the report.

Extra packages exist on the target but not in the baseline. These are reported separately and left untouched by default, since they are often installed intentionally (GPU-specific tools, local debugging utilities).

Version mismatches are packages present on both sides but at different versions. Each mismatch includes an action field (upgrade or downgrade) derived from RPM’s own version comparison logic, so downstream automation knows exactly what to do.

In partition sweep mode, the tool queries Slurm for all active nodes, prompts for an interactive baseline selection, audits every target node, and writes per-node report files only for nodes with differences. A separate extras_summary.txt groups all extra packages by node across the full sweep.

Why it was built
#

Managing a multi-node HPC cluster means nodes drift. A one-off dnf install here, a skipped update there, and the environment across nodes is no longer consistent. The existing approach of SSHing into nodes individually and comparing rpm -qa output by hand does not work at scale and misidentifies version differences as missing packages. This tool was built to replace that workflow with something repeatable and automation-friendly.

pkg_audit workflow

Sample output
#

Partition sweep
#

[INFO] Fetching node list for partition: cpu
       Found 4 up node(s): compute-[01-04]
       SSH      : ssh
       Format   : text

Select a baseline node:
  [  1] compute-01
  [  2] compute-02
  [  3] compute-03
  [  4] compute-04

Enter node number or hostname: 1

[INFO] Starting Partition Sweep
       Partition : cpu
       Baseline  : compute-01
       Targets   : 3 node(s)
       Parallel  : 1 job(s)
======================================================
Summary:
  [OK]  compute-02
  [DIFF] compute-03  (2 issue(s))
  [DIFF] compute-04  (1 issue(s))
======================================================
Results:
  Clean : 1 / 4
  Diffs : 2 / 4

Reports saved to: ./pkg_audit_reports/
  ./pkg_audit_reports/audit_compute-03.txt
  ./pkg_audit_reports/audit_compute-04.txt

Extras summary: ./pkg_audit_reports/extras_summary.txt
======================================================

Per-node report (text)
#

======================================================
  Package Audit Report
  Baseline : compute-01
  Target   : compute-03
  Generated: Thu May 22 10:30:01 EDT 2026
======================================================

[MISSING] 1 package(s) in baseline but NOT in compute-03:
------------------------------------------------------
  nvtop                                     (baseline: 3.3.1-2.el10_1)

  >> Quick Fix:
  ssh compute-03 'sudo dnf install -y nvtop'

[VERSION MISMATCH] 1 package(s) with different versions:
------------------------------------------------------
  curl          baseline: 8.12.1-2.el10_1.2    target: 8.12.1-1.el10_1    action: upgrade

======================================================

Per-node report (JSON)
#

{
  "node": "compute-03",
  "baseline": "compute-01",
  "generated": "2026-05-22T14:30:01Z",
  "missing": [
    {"name": "nvtop", "baseline_ver": "3.3.1-2.el10_1", "action": "install"}
  ],
  "extra": [],
  "version_mismatch": [
    {
      "name": "curl",
      "baseline_ver": "8.12.1-2.el10_1.2",
      "target_ver": "8.12.1-1.el10_1",
      "action": "upgrade"
    }
  ]
}

Ansible integration
#

JSON output is structured for direct use with the included Ansible playbooks:

  • remediate.yml reads each node’s JSON report and installs missing packages, upgrading or downgrading version mismatches as the action field specifies.
  • remove_extra.yml is kept as a separate file to require a deliberate choice before removing anything. It supports --check dry runs and is designed to be reviewed against extras_summary.txt before execution.

Technical details
#

  • Language: Bash 4+
  • Package query: rpm --queryformat for clean name/version separation
  • Version comparison: python3-rpm for RPM-native version ordering
  • Scheduler integration: Slurm (sinfo, scontrol) for partition sweep mode
  • Output formats: text (human-readable), JSON (Ansible-ready), CSV (scripting/spreadsheet)
  • Parallelism: GNU Parallel with xargs -P fallback; sequential by default for login node safety
  • SSH: plain ssh by default, optional sudo mode via -s flag for clusters with restricted inter-node access
  • Target platform: RPM-based Linux (Rocky Linux 9/10, RHEL, CentOS)

Repository
#

github.com/willgpaik/pkg_audit