A command-line utility for detecting and remediating RPM package inconsistencies across HPC cluster nodes. Built to address a real operational problem: nodes that silently diverge over time cause hard-to-diagnose job failures, and tracking down the cause by hand does not scale.
What it does #
The tool compares installed packages between a baseline node and one or more target nodes, separating results into three distinct categories:
Missing packages are present in the baseline but absent on the target. A dnf install quick-fix command is included in the report.
Extra packages exist on the target but not in the baseline. These are reported separately and left untouched by default, since they are often installed intentionally (GPU-specific tools, local debugging utilities).
Version mismatches are packages present on both sides but at different versions. Each mismatch includes an action field (upgrade or downgrade) derived from RPM’s own version comparison logic, so downstream automation knows exactly what to do.
In partition sweep mode, the tool queries Slurm for all active nodes, prompts for an interactive baseline selection, audits every target node, and writes per-node report files only for nodes with differences. A separate extras_summary.txt groups all extra packages by node across the full sweep.
Why it was built #
Managing a multi-node HPC cluster means nodes drift. A one-off dnf install here, a skipped update there, and the environment across nodes is no longer consistent. The existing approach of SSHing into nodes individually and comparing rpm -qa output by hand does not work at scale and misidentifies version differences as missing packages. This tool was built to replace that workflow with something repeatable and automation-friendly.
Sample output #
Partition sweep #
[INFO] Fetching node list for partition: cpu
Found 4 up node(s): compute-[01-04]
SSH : ssh
Format : text
Select a baseline node:
[ 1] compute-01
[ 2] compute-02
[ 3] compute-03
[ 4] compute-04
Enter node number or hostname: 1
[INFO] Starting Partition Sweep
Partition : cpu
Baseline : compute-01
Targets : 3 node(s)
Parallel : 1 job(s)
======================================================
Summary:
[OK] compute-02
[DIFF] compute-03 (2 issue(s))
[DIFF] compute-04 (1 issue(s))
======================================================
Results:
Clean : 1 / 4
Diffs : 2 / 4
Reports saved to: ./pkg_audit_reports/
./pkg_audit_reports/audit_compute-03.txt
./pkg_audit_reports/audit_compute-04.txt
Extras summary: ./pkg_audit_reports/extras_summary.txt
======================================================Per-node report (text) #
======================================================
Package Audit Report
Baseline : compute-01
Target : compute-03
Generated: Thu May 22 10:30:01 EDT 2026
======================================================
[MISSING] 1 package(s) in baseline but NOT in compute-03:
------------------------------------------------------
nvtop (baseline: 3.3.1-2.el10_1)
>> Quick Fix:
ssh compute-03 'sudo dnf install -y nvtop'
[VERSION MISMATCH] 1 package(s) with different versions:
------------------------------------------------------
curl baseline: 8.12.1-2.el10_1.2 target: 8.12.1-1.el10_1 action: upgrade
======================================================Per-node report (JSON) #
{
"node": "compute-03",
"baseline": "compute-01",
"generated": "2026-05-22T14:30:01Z",
"missing": [
{"name": "nvtop", "baseline_ver": "3.3.1-2.el10_1", "action": "install"}
],
"extra": [],
"version_mismatch": [
{
"name": "curl",
"baseline_ver": "8.12.1-2.el10_1.2",
"target_ver": "8.12.1-1.el10_1",
"action": "upgrade"
}
]
}Ansible integration #
JSON output is structured for direct use with the included Ansible playbooks:
remediate.ymlreads each node’s JSON report and installs missing packages, upgrading or downgrading version mismatches as theactionfield specifies.remove_extra.ymlis kept as a separate file to require a deliberate choice before removing anything. It supports--checkdry runs and is designed to be reviewed againstextras_summary.txtbefore execution.
Technical details #
- Language: Bash 4+
- Package query:
rpm --queryformatfor clean name/version separation - Version comparison:
python3-rpmfor RPM-native version ordering - Scheduler integration: Slurm (
sinfo,scontrol) for partition sweep mode - Output formats: text (human-readable), JSON (Ansible-ready), CSV (scripting/spreadsheet)
- Parallelism: GNU Parallel with
xargs -Pfallback; sequential by default for login node safety - SSH: plain
sshby default, optional sudo mode via-sflag for clusters with restricted inter-node access - Target platform: RPM-based Linux (Rocky Linux 9/10, RHEL, CentOS)