[{"content":"","date":"25 5월 2026","externalUrl":null,"permalink":"/tags/ansible/","section":"Tags","summary":"","title":"Ansible","type":"tags"},{"content":"","date":"25 5월 2026","externalUrl":null,"permalink":"/tags/bash/","section":"Tags","summary":"","title":"Bash","type":"tags"},{"content":"","date":"25 5월 2026","externalUrl":null,"permalink":"/tags/benchmarking/","section":"Tags","summary":"","title":"Benchmarking","type":"tags"},{"content":"","date":"25 5월 2026","externalUrl":null,"permalink":"/tags/distributed-training/","section":"Tags","summary":"","title":"Distributed-Training","type":"tags"},{"content":"","date":"25 5월 2026","externalUrl":null,"permalink":"/tags/hpc/","section":"Tags","summary":"","title":"Hpc","type":"tags"},{"content":"A command-line utility for detecting and remediating RPM package inconsistencies across HPC cluster nodes. Built to address a real operational problem: nodes that silently diverge over time cause hard-to-diagnose job failures, and tracking down the cause by hand does not scale.\nWhat it does # The tool compares installed packages between a baseline node and one or more target nodes, separating results into three distinct categories:\nMissing packages are present in the baseline but absent on the target. A dnf install quick-fix command is included in the report.\nExtra packages exist on the target but not in the baseline. These are reported separately and left untouched by default, since they are often installed intentionally (GPU-specific tools, local debugging utilities).\nVersion mismatches are packages present on both sides but at different versions. Each mismatch includes an action field (upgrade or downgrade) derived from RPM\u0026rsquo;s own version comparison logic, so downstream automation knows exactly what to do.\nIn partition sweep mode, the tool queries Slurm for all active nodes, prompts for an interactive baseline selection, audits every target node, and writes per-node report files only for nodes with differences. A separate extras_summary.txt groups all extra packages by node across the full sweep.\nWhy it was built # Managing a multi-node HPC cluster means nodes drift. A one-off dnf install here, a skipped update there, and the environment across nodes is no longer consistent. The existing approach of SSHing into nodes individually and comparing rpm -qa output by hand does not work at scale and misidentifies version differences as missing packages. This tool was built to replace that workflow with something repeatable and automation-friendly.\nSample output # Partition sweep # [INFO] Fetching node list for partition: cpu Found 4 up node(s): compute-[01-04] SSH : ssh Format : text Select a baseline node: [ 1] compute-01 [ 2] compute-02 [ 3] compute-03 [ 4] compute-04 Enter node number or hostname: 1 [INFO] Starting Partition Sweep Partition : cpu Baseline : compute-01 Targets : 3 node(s) Parallel : 1 job(s) ====================================================== Summary: [OK] compute-02 [DIFF] compute-03 (2 issue(s)) [DIFF] compute-04 (1 issue(s)) ====================================================== Results: Clean : 1 / 4 Diffs : 2 / 4 Reports saved to: ./pkg_audit_reports/ ./pkg_audit_reports/audit_compute-03.txt ./pkg_audit_reports/audit_compute-04.txt Extras summary: ./pkg_audit_reports/extras_summary.txt ====================================================== Per-node report (text) # ====================================================== Package Audit Report Baseline : compute-01 Target : compute-03 Generated: Thu May 22 10:30:01 EDT 2026 ====================================================== [MISSING] 1 package(s) in baseline but NOT in compute-03: ------------------------------------------------------ nvtop (baseline: 3.3.1-2.el10_1) \u0026gt;\u0026gt; Quick Fix: ssh compute-03 \u0026#39;sudo dnf install -y nvtop\u0026#39; [VERSION MISMATCH] 1 package(s) with different versions: ------------------------------------------------------ curl baseline: 8.12.1-2.el10_1.2 target: 8.12.1-1.el10_1 action: upgrade ====================================================== Per-node report (JSON) # { \u0026#34;node\u0026#34;: \u0026#34;compute-03\u0026#34;, \u0026#34;baseline\u0026#34;: \u0026#34;compute-01\u0026#34;, \u0026#34;generated\u0026#34;: \u0026#34;2026-05-22T14:30:01Z\u0026#34;, \u0026#34;missing\u0026#34;: [ {\u0026#34;name\u0026#34;: \u0026#34;nvtop\u0026#34;, \u0026#34;baseline_ver\u0026#34;: \u0026#34;3.3.1-2.el10_1\u0026#34;, \u0026#34;action\u0026#34;: \u0026#34;install\u0026#34;} ], \u0026#34;extra\u0026#34;: [], \u0026#34;version_mismatch\u0026#34;: [ { \u0026#34;name\u0026#34;: \u0026#34;curl\u0026#34;, \u0026#34;baseline_ver\u0026#34;: \u0026#34;8.12.1-2.el10_1.2\u0026#34;, \u0026#34;target_ver\u0026#34;: \u0026#34;8.12.1-1.el10_1\u0026#34;, \u0026#34;action\u0026#34;: \u0026#34;upgrade\u0026#34; } ] } Ansible integration # JSON output is structured for direct use with the included Ansible playbooks:\nremediate.yml reads each node\u0026rsquo;s JSON report and installs missing packages, upgrading or downgrading version mismatches as the action field specifies. remove_extra.yml is kept as a separate file to require a deliberate choice before removing anything. It supports --check dry runs and is designed to be reviewed against extras_summary.txt before execution. Technical details # Language: Bash 4+ Package query: rpm --queryformat for clean name/version separation Version comparison: python3-rpm for RPM-native version ordering Scheduler integration: Slurm (sinfo, scontrol) for partition sweep mode Output formats: text (human-readable), JSON (Ansible-ready), CSV (scripting/spreadsheet) Parallelism: GNU Parallel with xargs -P fallback; sequential by default for login node safety SSH: plain ssh by default, optional sudo mode via -s flag for clusters with restricted inter-node access Target platform: RPM-based Linux (Rocky Linux 9/10, RHEL, CentOS) Repository # github.com/willgpaik/pkg_audit\n","date":"25 5월 2026","externalUrl":null,"permalink":"/portfolio/pkg-audit/","section":"Portfolio","summary":"","title":"pkg_audit: Cluster Package Consistency Audit Tool","type":"portfolio"},{"content":"","date":"25 5월 2026","externalUrl":null,"permalink":"/portfolio/","section":"Portfolio","summary":"","title":"Portfolio","type":"portfolio"},{"content":"","date":"25 5월 2026","externalUrl":null,"permalink":"/tags/pytorch/","section":"Tags","summary":"","title":"Pytorch","type":"tags"},{"content":"A reproducible benchmark suite for characterizing PyTorch Distributed Data Parallel (DDP) training performance on NVIDIA GPU clusters. Built for pre-production validation of large-scale HPC infrastructure, and generalized for broader use on any Slurm-based system. What it does # The benchmark runs ResNet training on synthetic on-device data to isolate GPU compute and NCCL communication from storage I/O. It measures two complementary scaling modes:\nWeak scaling keeps per-GPU batch size fixed while adding GPUs. This answers whether each GPU stays productive as the system grows. Throughput per GPU should stay flat; a drop indicates communication overhead.\nStrong scaling keeps the global batch fixed while adding GPUs. This answers how much faster the same workload runs with more resources. The result is expressed as speedup and parallel efficiency relative to a single-GPU baseline.\nResults are collected as JSON per run and aggregated into tables (text) and plots (PNG). GPU activity is sampled during measurement via nvidia-smi so low utilization configurations are flagged automatically.\nWhy it was built # Statewide AI research infrastructure serving hundreds of researchers needs to be validated before opening to users. This benchmark was developed to stress-test GPU compute and inter-node communication on B200 and RTX Pro 6000 Blackwell hardware before cluster launch, and to produce numbers that can be reported to stakeholders and researchers in a reproducible way.\nTechnical details # Language: Python 3.11+, Bash Framework: PyTorch 2.7+ with torch.distributed / NCCL Scheduler: Slurm (torchrun + srun pattern for multi-node) Precision: BF16 by default (FP32 and FP16 supported) Measurement: time-based with synchronized step counting across ranks, warmup phase to absorb cuDNN autotuning and optional torch.compile JIT cost Outputs: per-run JSON, terminal tables, matplotlib PNG plots Repository # github.com/willgpaik/pytorch-ddp-scaling-benchmark\n","date":"25 5월 2026","externalUrl":null,"permalink":"/portfolio/pytorch-ddp-bench/","section":"Portfolio","summary":"","title":"PyTorch DDP Scaling Benchmark","type":"portfolio"},{"content":"","date":"25 5월 2026","externalUrl":null,"permalink":"/tags/sysadmin/","section":"Tags","summary":"","title":"Sysadmin","type":"tags"},{"content":"","date":"25 5월 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"The cluster has storage and authentication. Now it needs a brain.\nWelcome back to HPC From Scratch. In Episode 4, we set up NFS shared storage, FreeIPA centralized authentication, and Ansible for cluster management. Every node shares the same home directory and user accounts work everywhere.\nBut right now, if you want to run a job, you SSH into a compute node and run it directly. That is fine for one person on one node. It falls apart the moment two people try to use the same node at the same time, or when you need to coordinate work across multiple nodes. That is what a job scheduler solves.\nThis episode covers Slurm: why we build it from source, how Munge handles authentication between nodes, what slurm.conf actually controls, and how to submit your first real cluster job.\n*(Click the image to watch the tutorial on YouTube)* \u0026gt; 1. What Slurm Actually Does # Without a job scheduler, a shared cluster works like a kitchen with no coordination. Everyone grabs resources when they want them. One person\u0026rsquo;s job starves another. There is no way to ask for two nodes at once and have them guaranteed to be free at the same time.\nSlurm is the receptionist from the HPC 101 series, at scale. It tracks every CPU, every gigabyte of memory, and every GPU across all nodes. When you submit a job, Slurm holds it in a queue until the requested resources are available, then assigns it to the right nodes and runs it.\nThe three components we need:\nslurmctld runs on the management node (arbiter). It is the controller: maintains the queue, makes scheduling decisions, and talks to the compute nodes.\nslurmd runs on each compute node. It receives job assignments from the controller, runs the actual work, and reports back.\nslurmdbd also runs on arbiter. It connects Slurm to a MariaDB database and records every job: who ran it, how long it took, how much CPU and memory it used. This powers seff, sacct, and fair share scheduling.\nOur cluster layout:\n{:style=\u0026ldquo;display:table; margin:0 auto; max-width:100%; height:auto; background-color:#ffffff; border-radius:4px; padding:10px;\u0026rdquo;}\n\u0026gt; 2. Why Build from Source # The obvious question is why not just dnf install slurm. There are two reasons.\nVersion control. When you run dnf upgrade on all nodes, Slurm gets upgraded too. A version mismatch between slurmctld and slurmd breaks the cluster. The controller and compute nodes must run identical versions. Building from source and distributing RPMs means you control exactly when Slurm gets updated, separate from the rest of the system.\nFeature support. Rocky Linux 10 runs cgroup v2 by default. Older Slurm builds default to cgroup v1, which causes job accounting and memory tracking to fail silently. Building from source lets you pass --with cgroupv2 explicitly. Similarly, PMIx support for MPI job launching requires build flags that are not included in the standard distribution packages.\nThe build process compiles Slurm on the management node (arbiter) and packages it as RPMs, which then get distributed to all other nodes via Ansible.\n# Build on arbiter, targeting Slurm 25.11.1 rpmbuild -ta slurm-25.11.1.tar.bz2 \\ --define \u0026#34;_slurm_sysconfdir /etc/slurm\u0026#34; \\ --with cgroupv2 \\ --with pmix EPEL for runtime dependencies # The build pulls in gtk2-devel as a development dependency, which causes the resulting slurm base RPM to depend on the GTK2 runtime libraries libgdk-x11-2.0.so.0 and libgtk-x11-2.0.so.0 (used by sview, Slurm\u0026rsquo;s GUI viewer). On Rocky Linux 10 these libraries are not in the default repositories. They live in EPEL, so EPEL must be enabled on every node before the install step in section 4, or dnf rejects the local RPMs with a depsolve error.\n[wpaik@arbiter ansible]$ ansible all_nodes -b -m dnf -a \u0026#34;name=epel-release state=present\u0026#34; If you prefer to avoid the GTK2 dependency entirely, pass --without gtk to rpmbuild and sview gets dropped from the build. HPC compute nodes never run sview anyway, so this is the cleaner option for a headless cluster.\nAll build dependencies, the full build playbook, and the RPM distribution playbook are in the GitHub repository.\n\u0026gt; 3. Munge: The Authentication Layer # Before Slurm can communicate between nodes, it needs a way to verify that messages are actually coming from the cluster and not from somewhere else. That is Munge\u0026rsquo;s job.\nMunge generates encrypted tokens using a shared secret key. Every node in the cluster has the same key at /etc/munge/munge.key. When slurmctld sends a message to slurmd, it attaches a Munge token. The compute node decrypts it with the shared key and verifies the message is legitimate.\nThe key is generated once on arbiter and distributed to all nodes by Ansible:\n# Generate key on arbiter dd if=/dev/urandom bs=1 count=1024 \u0026gt; /etc/munge/munge.key chmod 400 /etc/munge/munge.key chown munge:munge /etc/munge/munge.key Critical: Slurm UID must match across all nodes.\nMunge verifies not just the key but also the UID of the process that created the token. If the slurm user has UID 386 on arbiter and UID 990 on interceptor-01, Munge will reject the token with a security violation error. The cluster will appear to start but jobs will never run.\nWe set a fixed UID of 1111 for the Slurm user on every node before installing Slurm:\ngroupadd -g 1111 slurm useradd -u 1111 -g slurm -s /bin/bash -d /var/lib/slurm slurm Verify all nodes have matching UIDs:\n[wpaik@arbiter ansible]$ ansible all_nodes -m shell -a \u0026#34;id slurm\u0026#34; -b arbiter.cluster.local | rc=0 \u0026gt;\u0026gt; uid=1111(slurm) gid=1111(slurm) groups=1111(slurm) interceptor-01.cluster.local | rc=0 \u0026gt;\u0026gt; uid=1111(slurm) gid=1111(slurm) groups=1111(slurm) interceptor-02.cluster.local | rc=0 \u0026gt;\u0026gt; uid=1111(slurm) gid=1111(slurm) groups=1111(slurm) corsair-01.cluster.local | rc=0 \u0026gt;\u0026gt; uid=1111(slurm) gid=1111(slurm) groups=1111(slurm) carrier.cluster.local | rc=0 \u0026gt;\u0026gt; uid=1111(slurm) gid=1111(slurm) groups=1111(slurm) All matching. Verify Munge is running and the shared key works:\n# Test Munge authentication locally $ munge -n | unmunge # Test across nodes $ munge -n | ssh interceptor-01.cluster.local unmunge STATUS: Success (0) ENCODE_HOST: arbiter.cluster.local (192.168.50.50) DECODE_HOST: interceptor-01.cluster.local (192.168.50.15) MUNGE_UID: slurm (1111) Note on firewall: Worker nodes have firewalld disabled. The login node (carrier) has its internal interface in the trusted zone. If you are running firewalld on compute nodes, open ports 6817 (slurmctld), 6818 (slurmd), and 6819 (slurmdbd).\n\u0026gt; 4. Installing Slurm # After building the RPMs on arbiter, Ansible distributes and installs them across the cluster. Each node gets a different set of packages depending on its role.\nNode type Packages Management (arbiter) slurm, slurmctld, slurmdbd, mariadb Compute (interceptor, corsair) slurm, slurmd, slurm-libpmi Login (carrier) slurm, slurm-contribs (includes seff) slurm-libpmi on the compute nodes provides the PMI2 and PMIx libraries that MPI implementations use to launch parallel processes via srun. Without it, MPI jobs fail with PMI version errors when trying to use srun as the launcher.\nslurm-contribs on the login node includes seff, the job efficiency tool. It reads accounting data from slurmdbd and shows you exactly how much CPU and memory your job actually used versus what you requested.\nThe install playbook expects two things to already be true: EPEL is enabled on every node (section 2), and the Ansible controller\u0026rsquo;s remote_tmp points to a local path on the target nodes (set in Episode 4\u0026rsquo;s ansible.cfg). The second one matters because the install copies RPMs through Ansible\u0026rsquo;s staging directory. If that directory lives on NFS (the default location on this cluster, since /home is NFS-mounted), the RPMs inherit the nfs_t SELinux context, and dnf rejects them with a confusing No match for argument error even though the file is plainly on disk. The remote_tmp = /var/tmp/.ansible-${USER}/tmp line in ansible.cfg keeps the staging area on local disk and avoids the trap.\nAfter installation completes successfully, pin the Slurm version in dnf so a future dnf upgrade does not pull a different build (most notably from EPEL, which ships its own slurm packages without our cgroup v2 and PMIx flags). The install playbook handles this as its last step:\nansible all_nodes -b -m shell -a \u0026#34;echo \u0026#39;exclude=slurm*\u0026#39; \u0026gt;\u0026gt; /etc/dnf/dnf.conf\u0026#34; # Verify ansible all_nodes -b -m shell -a \u0026#34;grep slurm /etc/dnf/dnf.conf\u0026#34; The order matters: pin after the install succeeds, never before. Pinning before install causes dnf to refuse to install slurm at all, again with a No match for argument error. When you eventually need to upgrade Slurm, remove the line first, rebuild, reinstall, and the playbook re-adds the pin at the end.\nThe complete installation playbooks are in the GitHub repository under ep05-slurm/playbooks/.\n\u0026gt; 5. Configuring Slurm # All Slurm configuration lives in /etc/slurm/slurm.conf on every node. The file must be identical across the cluster. We generate it on arbiter and distribute it via Ansible.\nHere is the complete slurm.conf for this cluster:\n# Cluster identity ClusterName=cluster SlurmctldHost=arbiter SlurmUser=slurm AuthType=auth/munge # Scheduling SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory # Logging SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=debug SlurmdLogFile=/var/log/slurm/slurmd.log # State and PID files StateSaveLocation=/var/spool/slurmctld SlurmdSpoolDir=/var/spool/slurmd SlurmctldPidFile=/run/slurm/slurmctld.pid SlurmdPidFile=/run/slurm/slurmd.pid # Cgroup (v2) ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity # Job accounting JobAcctGatherType=jobacct_gather/cgroup JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=arbiter.cluster.local AccountingStoragePort=6819 JobCompType=jobcomp/none AccountingStorageTRES=gres/gpu AccountingStoreFlags=job_comment,job_env,job_script # GPU support ReturnToService=1 GresTypes=gpu # MPI default MpiDefault=pmix # Nodes NodeName=interceptor-01 CPUs=8 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=15413 State=UNKNOWN NodeName=interceptor-02 CPUs=8 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=15413 State=UNKNOWN NodeName=corsair-01 CPUs=16 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=30802 Gres=gpu:nvidia_geforce_gtx_1660_super:1 State=UNKNOWN # Partitions PartitionName=cpu Nodes=interceptor-01,interceptor-02 Default=YES MaxTime=INFINITE State=UP PartitionName=gpu Nodes=corsair-01 Default=NO MaxTime=INFINITE State=UP A few things worth noting:\nRealMemory values come from running free -m on each node, same as in Episode 2 for the iGPU memory trap. The values here reflect what the OS actually reports after hardware reservations. Do not use the installed RAM number.\nThe M715q nodes each have 16GB installed, but the integrated Vega GPU reserves a portion as VRAM. The exact amount depends on the BIOS UMA Frame Buffer Size setting. If this is left on Auto, different nodes may end up with slightly different values even with identical hardware. In Episode 2 we pinned arbiter\u0026rsquo;s UMA setting to 256MB explicitly. If your compute nodes still show different free -m totals, check the UMA setting in each node\u0026rsquo;s BIOS and pin them to the same value. The slurm.conf RealMemory for each node should match that node\u0026rsquo;s actual free -m total output.\nMpiDefault=pmix sets PMIx as the default MPI process management interface for srun. Without this, srun defaults to PMI2, which causes compatibility errors with OpenMPI when launching parallel jobs. If you see MPI jobs hanging or failing with PMI version errors, this is the first thing to check.\nSelectTypeParameters=CR_Core_Memory tells Slurm to track both cores and memory when allocating resources. This is required for seff to report memory usage accurately.\nThe cgroup configuration lives in a separate file:\n# /etc/slurm/cgroup.conf ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=no ConstrainDevices=yes ConstrainCores and ConstrainRAMSpace enforce the resource limits you request in your job script. If your job tries to use more memory than requested, Slurm kills it with an out-of-memory error rather than letting it consume resources silently. This requires cgroup v2, which is confirmed on this cluster:\n$ stat -fc %T /sys/fs/cgroup cgroup2fs MariaDB and slurmdbd store accounting data. The setup creates a slurm_acct_db database and a slurm database user, then configures slurmdbd to connect to it. The slurmdbd configuration in /etc/slurm/slurmdbd.conf must have mode 600 and be owned by the slurm user, or slurmdbd will refuse to start.\n\u0026gt; 6. Disabling Swap on Compute Nodes # Swap needs to be disabled on compute nodes before running Slurm jobs. When ConstrainRAMSpace=yes is set in cgroup.conf, Slurm enforces memory limits via cgroup. If swap is active, a process that hits the RAM limit can spill into swap instead of being killed, which defeats the memory constraint and makes seff memory reporting inaccurate.\nThe login node (carrier) and management node (arbiter) can keep swap enabled since they do not run compute jobs.\nDisable swap permanently on compute nodes via systemd:\nansible workers,gpu -b -m systemd \\ -a \u0026#34;name=swap.target state=stopped enabled=no\u0026#34; Verify after the next reboot:\n$ cat /proc/swaps Filename Type Size Used Priority # Empty output means swap is off Note: The swap UUID may still appear in /etc/fstab. This is fine as long as swap.target is disabled in systemd. The unit will fail to activate on boot with a dependency error, which is the expected behavior.\n\u0026gt; 7. Starting the Cluster # Services must start in order. slurmdbd must be running before slurmctld tries to connect to it.\n# On arbiter $ sudo systemctl start mariadb $ sudo systemctl start slurmdbd $ sudo systemctl start slurmctld # On each compute node $ sudo systemctl start slurmd After services are up, initialize the accounting database:\n$ sacctmgr -i add cluster cluster $ sacctmgr -i add account root Description=\u0026#34;Root\u0026#34; Organization=\u0026#34;Cluster\u0026#34; $ sacctmgr -i add user wpaik Account=root Check cluster status:\n[wpaik@carrier ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cpu* up infinite 2 idle interceptor-[01-02] gpu up infinite 1 idle corsair-01 All nodes idle and ready. If nodes show as down or drain instead of idle, resume them:\n$ scontrol update NodeName=ALL State=RESUME \u0026gt; 8. Submitting Your First Jobs # Interactive Job # [wpaik@carrier ~]$ srun --pty bash [wpaik@interceptor-01 ~]$ hostname interceptor-01 [wpaik@interceptor-01 ~]$ exit srun assigned you to interceptor-01 because it is the first node in the default cpu partition.\nBatch Job # Create a simple batch script:\n#!/bin/bash #SBATCH --job-name=hello #SBATCH --partition=cpu #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --mem=500M #SBATCH --time=00:05:00 #SBATCH --output=hello_%j.out echo \u0026#34;Running on: $(hostname)\u0026#34; echo \u0026#34;Job ID: $SLURM_JOB_ID\u0026#34; date sleep 10 echo \u0026#34;Done.\u0026#34; Submit and monitor:\n$ sbatch hello.sh Submitted batch job 1 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 1 cpu hello wpaik R 0:03 1 interceptor-01 $ cat hello_1.out Running on: interceptor-01 Job ID: 1 Fri May 9 21:00:00 EDT 2026 Done. Multi-Node Job # #!/bin/bash #SBATCH --job-name=multinode #SBATCH --partition=cpu #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --mem-per-cpu=1G #SBATCH --output=multinode_%j.out srun hostname $ sbatch multinode.sh Submitted batch job 2 $ cat multinode_2.out interceptor-01 interceptor-01 interceptor-01 interceptor-01 interceptor-02 interceptor-02 interceptor-02 interceptor-02 Eight tasks across two physical machines, coordinated by Slurm.\nGPU Job # #!/bin/bash #SBATCH --job-name=gpu_test #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --gres=gpu:1 #SBATCH --output=gpu_%j.out nvidia-smi Checking Efficiency with seff # After a job completes, check how efficiently it used the requested resources:\n$ seff 1 Job ID: 1 Cluster: cluster User/Group: wpaik/wpaik State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:00:01 CPU Efficiency: 10.00% of 00:00:10 core-walltime Job Wall-clock time: 00:00:10 Memory Utilized: 1.20 MB Memory Efficiency: 0.24% of 500.00 MB CPU efficiency is low because sleep 10 does nothing. Memory efficiency is low because we requested 500MB but the script barely used any. This is exactly the kind of feedback seff is designed to give. Right-size your resource requests based on what jobs actually use.\n\u0026gt; 9. Common Issues # Nodes stuck in down or drain state after startup\n$ scontrol update NodeName=ALL State=RESUME If they keep going back to down, check the slurmd log on the affected node:\n$ ssh interceptor-01 \u0026#34;sudo tail -n 50 /var/log/slurm/slurmd.log\u0026#34; Slurm UID mismatch (Security violation)\nIf srun hangs or you see authentication errors in the logs, check that the slurm user has the same UID on every node:\n$ ansible all_nodes -m shell -a \u0026#34;id slurm\u0026#34; -b If UIDs differ, use 08_sync_slurm_uid.yaml from the GitHub repository to fix them. Note that if the target UID is occupied by another system user on a particular node, you will need to reassign that user to a different UID first before moving slurm into place.\nMPI jobs fail with PMI errors\nCheck that MpiDefault=pmix is in slurm.conf and that slurm-libpmi is installed on compute nodes. Also verify that the PMIx security mode is set:\n$ cat /etc/profile.d/pmix.sh export PMIX_MCA_psec=native slurmdbd fails to start\nCheck permissions on /etc/slurm/slurmdbd.conf. It must be mode 600 and owned by the slurm user:\n$ ls -la /etc/slurm/slurmdbd.conf -rw------- 1 slurm slurm 312 Apr 27 09:00 /etc/slurm/slurmdbd.conf Also verify MariaDB is running before starting slurmdbd:\n$ sudo systemctl status mariadb seff shows no memory data\nseff requires JobAcctGatherType=jobacct_gather/cgroup in slurm.conf and ConstrainRAMSpace=yes in cgroup.conf. Both require cgroup v2. Verify with stat -fc %T /sys/fs/cgroup.\ndnf install fails with No match for argument even though the RPM is on disk\nTwo distinct causes both surface as this same error:\nSELinux context inherited from NFS. Ansible\u0026rsquo;s per-task staging directory defaults to ~/.ansible/tmp/, which on this cluster lives on NFS-mounted /home. Files copied through it pick up the nfs_t SELinux context, and dnf silently refuses to handle them as local RPMs. Confirm with ls -lZ /tmp/slurm_rpms/ — if the context is nfs_t, this is it. The permanent fix is the remote_tmp = /var/tmp/.ansible-${USER}/tmp line in ansible.cfg from Episode 4. As an immediate workaround:\nsudo restorecon -Rv /tmp/slurm_rpms/ dnf exclude pinning was added before install. If /etc/dnf/dnf.conf already contains exclude=slurm* from a previous run, dnf strips the matching argument and reports it as missing. Check with grep slurm /etc/dnf/dnf.conf. For a reinstall, either remove the line first or pass --disableexcludes=all:\nsudo dnf install -y --disableexcludes=all /tmp/slurm_rpms/slurm-*.rpm dnf install fails with nothing provides libgdk-x11-2.0.so.0 or libgtk-x11-2.0.so.0\nEPEL is not enabled on the failing node. The Slurm base RPM depends on GTK2 runtime libraries that are not in Rocky 10\u0026rsquo;s default repositories. Install EPEL on the affected node and retry:\nsudo dnf install -y epel-release Or rebuild Slurm with --without gtk so the GTK2 dependency is removed entirely.\n\u0026gt; 10. What is Next # The cluster is now a real HPC system. Jobs are scheduled, resources are tracked, and seff shows efficiency data after each run.\nThe next episode covers Slurm accounting in depth: setting up accounts and users in slurmdbd, configuring partitions with resource limits, and fair share scheduling so heavy users do not monopolize the cluster.\nAll Ansible playbooks, configuration files, and the Slurm build scripts from this episode are in the GitHub repository.\nHappy Computing!\n","date":"20 5월 2026","externalUrl":null,"permalink":"/posts/hpc-from-scratch-05/","section":"Posts","summary":"","title":"[HPC From Scratch] Episode 5: Slurm - Installing the Job Scheduler","type":"posts"},{"content":"","date":"20 5월 2026","externalUrl":null,"permalink":"/tags/cluster/","section":"Tags","summary":"","title":"Cluster","type":"tags"},{"content":"","date":"20 5월 2026","externalUrl":null,"permalink":"/tags/home-lab/","section":"Tags","summary":"","title":"Home Lab","type":"tags"},{"content":"","date":"20 5월 2026","externalUrl":null,"permalink":"/series/hpc-from-scratch/","section":"Series","summary":"","title":"HPC From Scratch","type":"series"},{"content":"","date":"20 5월 2026","externalUrl":null,"permalink":"/tags/munge/","section":"Tags","summary":"","title":"Munge","type":"tags"},{"content":"","date":"20 5월 2026","externalUrl":null,"permalink":"/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":"Each series is a self-contained progression and starts from part 1 and follow through in order. Posts within a series link to each other automatically.\n","date":"20 5월 2026","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"},{"content":"","date":"20 5월 2026","externalUrl":null,"permalink":"/tags/slurm/","section":"Tags","summary":"","title":"Slurm","type":"tags"},{"content":"One drive. One login. Every node sees the same home directory.\nWelcome back to HPC From Scratch. In Episode 3, we set up the network, installed Rocky Linux on all six nodes, configured DHCP and NAT, and hardened SSH. The cluster is networked and secured. Now it needs two things before Slurm makes any sense: shared storage and centralized authentication.\nWithout these two pieces, you are manually copying files to every node and creating the same user account six times. This episode fixes both problems.\n*(Click the image to watch the tutorial on YouTube)* \u0026gt; 1. Why Shared Storage Matters # Without NFS, submitting an MPI job across two nodes means your input data has to exist on both nodes. You either copy it manually or write a script to sync it. Neither is sustainable.\nWith NFS, the Samsung 990 Pro on arbiter (the management node) exports a single /home directory. Every node in the cluster mounts it. Write a script on the login node, run it from any compute node. The file is already there.\n{:style=\u0026ldquo;display:table; margin:0 auto; max-width:100%; height:auto; background-color:#f8f9fa; border-radius:4px; padding:10px;\u0026rdquo;}\nThis also matters for Slurm. When a job writes output files, they land in /home on the NFS share. You do not need to SSH into compute nodes to retrieve results.\nPrerequisites\nBefore starting this episode:\nAll nodes are running Rocky Linux 10 with network configured (Episode 3) arbiter has the Samsung 990 Pro NVMe drive installed (Episode 2) SSH key-based login is working from arbiter to all other nodes \u0026gt; 2. Ansible Setup # From this episode onward, we use Ansible to apply configuration across all nodes at once. Without it, every change means SSHing into six machines individually.\nAnsible runs from arbiter. We keep it in /opt/ansible rather than a home directory so it stays off the NFS share. Ansible configuration files contain SSH keys and vault passwords that should not be visible to every node in the cluster.\nInstall Ansible # [wpaik@arbiter ~]$ sudo dnf install ansible-core [wpaik@arbiter ~]$ sudo mkdir -p /opt/ansible [wpaik@arbiter ~]$ sudo chown wpaik:wpaik /opt/ansible [wpaik@arbiter ~]$ cd /opt/ansible SSH Key # Generate a dedicated key for Ansible and distribute it to all nodes:\n[wpaik@arbiter ansible]$ mkdir .ssh [wpaik@arbiter ansible]$ ssh-keygen -t ed25519 -f .ssh/worker_ed25519 -N \u0026#34;\u0026#34; [wpaik@arbiter ansible]$ for node in 192.168.50.1 192.168.50.15 192.168.50.32 192.168.50.11 192.168.50.19; do ssh-copy-id -i .ssh/worker_ed25519.pub wpaik@$node done Inventory and Config # Create hosts.ini:\n[head] carrier.cluster.local ansible_host=192.168.50.1 [management] arbiter.cluster.local ansible_host=192.168.50.50 ansible_connection=local [workers] interceptor-01.cluster.local ansible_host=192.168.50.15 interceptor-02.cluster.local ansible_host=192.168.50.32 [gpu] corsair-01.cluster.local ansible_host=192.168.50.11 [visualization] observer.cluster.local ansible_host=192.168.50.19 [compute:children] workers gpu [all_nodes:children] head management workers gpu visualization [all_nodes:vars] ansible_user=wpaik cluster_network=192.168.50.0/24 cluster_domain=cluster.local cluster_realm=CLUSTER.LOCAL Note that arbiter uses ansible_connection=local since it is the Ansible controller itself.\nCreate ansible.cfg:\n[defaults] private_key_file = /opt/ansible/.ssh/worker_ed25519 inventory = ./hosts.ini host_key_checking = False log_path = ./log/ansible.log vault_password_file = /opt/ansible/.ansible_vault_pw remote_tmp = /var/tmp/.ansible-${USER}/tmp The last line, remote_tmp, deserves a note since it is the one setting that bites you only later. By default Ansible writes its per-task staging files into ~/.ansible/tmp/ on the remote node. After we set up NFS in section 3, every node\u0026rsquo;s /home lives on the NFS share, so that staging directory ends up on NFS. Files written there get the nfs_t SELinux context, which dnf refuses to handle when installing local RPMs in later episodes. The failure mode is misleading — dnf reports No match for argument for an RPM file that visibly exists on disk. Pinning remote_tmp to a local path on each node (/var/tmp is always local) sidesteps this entirely. It costs nothing now and saves a long debugging session in Episode 5.\nVerify connectivity:\n[wpaik@arbiter ansible]$ ansible all -m ping carrier.cluster.local | SUCCESS =\u0026gt; { \u0026#34;ping\u0026#34;: \u0026#34;pong\u0026#34; } arbiter.cluster.local | SUCCESS =\u0026gt; { \u0026#34;ping\u0026#34;: \u0026#34;pong\u0026#34; } interceptor-01.cluster.local | SUCCESS =\u0026gt; { \u0026#34;ping\u0026#34;: \u0026#34;pong\u0026#34; } interceptor-02.cluster.local | SUCCESS =\u0026gt; { \u0026#34;ping\u0026#34;: \u0026#34;pong\u0026#34; } corsair-01.cluster.local | SUCCESS =\u0026gt; { \u0026#34;ping\u0026#34;: \u0026#34;pong\u0026#34; } observer.cluster.local | SUCCESS =\u0026gt; { \u0026#34;ping\u0026#34;: \u0026#34;pong\u0026#34; } All six nodes responding. From here on, playbooks handle the repetitive work.\n\u0026gt; 3. NFS Server Setup # All commands in this section run on arbiter.\nPartition the NVMe Drive with LVM # A single large partition works, but LVM gives us the flexibility to allocate separate volumes for home directories, work storage, shared software, and scratch space. This mirrors how storage is typically organized on a real HPC cluster.\nFirst, verify the NVMe drive:\n[wpaik@arbiter ~]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 223.6G 0 disk ├─sda1 8:1 0 600M 0 part /boot/efi ├─sda2 8:2 0 1G 0 part /boot └─sda3 8:3 0 222G 0 part ├─rl-root 253:0 0 70G 0 lvm / └─rl-swap 253:1 0 7.7G 0 lvm [SWAP] nvme0n1 259:0 0 931.5G 0 disk The SATA boot drive is sda. The NVMe is nvme0n1. Create a physical volume, volume group, and four logical volumes:\n# Install LVM tools $ sudo dnf install -y lvm2 # Create physical volume and volume group $ sudo pvcreate /dev/nvme0n1 $ sudo vgcreate vg_nfs /dev/nvme0n1 # Create logical volumes $ sudo lvcreate -L 167G -n lv_home vg_nfs $ sudo lvcreate -L 251G -n lv_work vg_nfs $ sudo lvcreate -L 84G -n lv_shared vg_nfs $ sudo lvcreate -L 251G -n lv_scratch vg_nfs # Format as XFS $ sudo mkfs.xfs /dev/vg_nfs/lv_home $ sudo mkfs.xfs /dev/vg_nfs/lv_work $ sudo mkfs.xfs /dev/vg_nfs/lv_shared $ sudo mkfs.xfs /dev/vg_nfs/lv_scratch Create mount points and mount:\n$ sudo mkdir -p /nfsdata/{home,work,shared,scratch} $ sudo mount /dev/vg_nfs/lv_home /nfsdata/home $ sudo mount /dev/vg_nfs/lv_work /nfsdata/work $ sudo mount /dev/vg_nfs/lv_shared /nfsdata/shared $ sudo mount /dev/vg_nfs/lv_scratch /nfsdata/scratch Add to /etc/fstab for persistence:\n$ echo \u0026#39;/dev/vg_nfs/lv_home /nfsdata/home xfs defaults 0 0\u0026#39; | sudo tee -a /etc/fstab $ echo \u0026#39;/dev/vg_nfs/lv_work /nfsdata/work xfs defaults 0 0\u0026#39; | sudo tee -a /etc/fstab $ echo \u0026#39;/dev/vg_nfs/lv_shared /nfsdata/shared xfs defaults 0 0\u0026#39; | sudo tee -a /etc/fstab $ echo \u0026#39;/dev/vg_nfs/lv_scratch /nfsdata/scratch xfs defaults 0 0\u0026#39; | sudo tee -a /etc/fstab Bind mount /nfsdata/home to /home on arbiter itself, so the management node also uses the NFS storage:\n$ echo \u0026#39;/nfsdata/home /home none bind 0 0\u0026#39; | sudo tee -a /etc/fstab $ sudo mount -a Verify the final layout:\n[wpaik@arbiter ~]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 223.6G 0 disk ├─sda1 8:1 0 600M 0 part /boot/efi ├─sda2 8:2 0 1G 0 part /boot └─sda3 8:3 0 222G 0 part ├─rl-root 253:0 0 70G 0 lvm / ├─rl-swap 253:1 0 7.7G 0 lvm [SWAP] └─rl-home 253:6 0 144.3G 0 lvm nvme0n1 259:0 0 931.5G 0 disk ├─vg_nfs-lv_home 253:2 0 167G 0 lvm /home │ /nfsdata/home ├─vg_nfs-lv_work 253:3 0 251G 0 lvm /nfsdata/work ├─vg_nfs-lv_shared 253:4 0 84G 0 lvm /nfsdata/shared └─vg_nfs-lv_scratch 253:5 0 251G 0 lvm /nfsdata/scratch The bind mount makes lv_home appear twice: once at /nfsdata/home (the actual mount point) and once at /home (the bind mount that arbiter itself uses). The other three volumes only mount at their /nfsdata paths on arbiter. Client nodes will mount them at /work, /shared, and /scratch via NFS.\nConfigure the NFS Server # $ sudo dnf install -y nfs-utils $ sudo systemctl enable --now nfs-server Configure /etc/exports:\n/nfsdata/home 192.168.50.0/24(rw,sync,no_root_squash,no_subtree_check) /nfsdata/work 192.168.50.0/24(rw,sync,no_root_squash,no_subtree_check) /nfsdata/shared 192.168.50.0/24(rw,sync,no_root_squash,no_subtree_check) /nfsdata/scratch 192.168.50.0/24(rw,sync,no_root_squash,no_subtree_check) A quick note on the options: rw allows read and write, sync commits writes to disk before responding (safer), no_subtree_check avoids a performance penalty when exporting subdirectories, and no_root_squash lets root on client nodes act as root on the share, which Slurm will need later.\nNote on no_root_squash: This is appropriate for a trusted internal cluster network. Our cluster is physically isolated on the 192.168.50.x subnet. On a shared cluster with untrusted users, use root_squash instead.\nApply and open the firewall:\n$ sudo exportfs -ra $ sudo firewall-cmd --permanent --add-service={nfs,rpc-bind,mountd} $ sudo firewall-cmd --reload # Verify $ sudo showmount -e localhost Export list for localhost: /nfsdata/scratch 192.168.50.0/24 /nfsdata/shared 192.168.50.0/24 /nfsdata/work 192.168.50.0/24 /nfsdata/home 192.168.50.0/24 \u0026gt; 4. NFS Client Setup # Rather than SSHing into each node manually, use Ansible. Run from /opt/ansible on arbiter:\n[wpaik@arbiter ansible]$ ansible-playbook playbooks/nfs_setup.yaml -K What the playbook does on each client node: installs nfs-utils, sets the SELinux boolean for NFS home directories, creates mount points for /work, /shared, and /scratch, adds all four NFS mounts to /etc/fstab with _netdev, and mounts them.\nThe _netdev option tells the system to wait for network availability before mounting. Without it, a node that boots faster than arbiter will fail to mount and potentially hang at boot.\nThe playbook also enables XFS quota on arbiter and reboots it to apply. This is covered in the full playbook in the GitHub repository.\nVerify from carrier after rebooting:\n[wpaik@carrier ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/rl-root 70G 5.4G 65G 8% / arbiter.cluster.local:/nfsdata/home 167G 8.2G 159G 5% /home arbiter.cluster.local:/nfsdata/work 251G 4.9G 247G 2% /work arbiter.cluster.local:/nfsdata/shared 84G 23G 62G 27% /shared arbiter.cluster.local:/nfsdata/scratch 251G 22G 230G 9% /scratch Note: The playbook reboots worker and GPU nodes automatically. carrier (the head node) requires a manual reboot after the playbook completes since it is the SSH entry point into the cluster. After rebooting carrier, verify mounts with df -h.\nBefore moving on to FreeIPA, run the Chrony playbook to synchronize time across all nodes:\n[wpaik@arbiter ansible]$ ansible-playbook playbooks/chrony_setup.yaml -K This sets up carrier as the NTP server for the cluster and configures all other nodes to sync from it. FreeIPA uses Kerberos for authentication, and Kerberos will reject tickets if the time difference between nodes exceeds 5 minutes. Running Chrony before FreeIPA avoids that problem.\nTest that the share works:\n# Create a test file from interceptor-01 [wpaik@interceptor-01 ~]$ touch /home/nfs_test.txt # Verify it appears on interceptor-02 [wpaik@interceptor-02 ~]$ ls /home/nfs_test.txt /home/nfs_test.txt One file, visible everywhere.\n\u0026gt; 5. Time Synchronization (Chrony) # Before setting up FreeIPA, all nodes need to be synchronized to the same time source. FreeIPA uses Kerberos for authentication, and Kerberos will reject tickets if the clock difference between nodes exceeds 5 minutes. On a fresh cluster this is usually fine, but it is better to set it up explicitly.\ncarrier acts as the NTP server for the cluster. It syncs from external sources (time.cloudflare.com, pool.ntp.org) and serves time to all internal nodes. The other nodes sync from carrier.\n[wpaik@arbiter ansible]$ ansible-playbook playbooks/chrony_setup.yaml -K Verify sync status on any node after the playbook completes:\n$ chronyc tracking Reference ID : C0A83201 (carrier.cluster.local) Stratum : 3 System time : 0.000123456 seconds fast of NTP time Last offset : +0.000045678 seconds RMS offset : 0.000089012 seconds Reference ID pointing to carrier.cluster.local confirms the node is syncing from carrier.\n\u0026gt; 6. The Problem with Local Users # NFS solves the file sharing problem. But it creates a new one.\nNFS uses UID (User ID) and GID (Group ID) numbers to handle file permissions, not usernames. When user will on interceptor-01 has UID 1001, and user will on interceptor-02 has UID 1002 (because you created the accounts in a different order), they see different permissions on the same NFS files.\n# On interceptor-01 $ id will uid=1001(will) gid=1001(will) # On interceptor-02 $ id will uid=1002(will) gid=1002(will) # The NFS file owned by will on interceptor-01 (uid=1001) # looks like it belongs to a different user on interceptor-02 You can work around this by manually synchronizing UIDs across every node. On a six-node cluster with a few users, that is tedious but manageable. On a real cluster with hundreds of users, it is not viable.\nThe proper solution is centralized authentication: one place where user accounts are defined, and every node pulls from that source. This is what FreeIPA provides.\nPre-flight: UID Alignment # NFS does not compare usernames. It compares the numeric UID and GID stamped on every file. If wpaik has UID 1000 on arbiter but UID 1001 on interceptor-01, every file written from interceptor-01 lands on the share owned by UID 1001, and arbiter cannot find a matching user. Reads and writes silently misbehave or fail outright.\nFor a fresh six-node build done in one sitting, this usually does not bite. Rocky\u0026rsquo;s installer assigns UID 1000 to the first user created during installation, so as long as wpaik was the first user on every node, the numbers line up by themselves. The hazard appears later: a node reinstalled out of band, a kickstart that differs between machines, or an extra account created during install before wpaik. The UID drifts, NFS quietly breaks, and the failure mode is confusing because everything else looks fine.\nCheck before mounting anything:\n[wpaik@arbiter ansible]$ ansible all_nodes -a \u0026#34;id wpaik\u0026#34; Every node should report the same uid= and gid=. If one differs, align it against arbiter\u0026rsquo;s value (typically 1000, but verify) before continuing.\nThe fix runs on the misaligned node, as a different sudoer or as root, with no active wpaik session. The example below assumes arbiter has wpaik at UID 1000 and the misaligned node currently has 1001. Substitute your actual values.\n# On the misaligned node, as root or another sudoer [root@interceptor-01 ~]# who | grep wpaik # confirm no live session [root@interceptor-01 ~]# pkill -KILL -u wpaik # kill any leftovers # If NFS is already mounted, unmount first [root@interceptor-01 ~]# umount /home # use -l if busy # Renumber the account [root@interceptor-01 ~]# groupmod -g 1000 wpaik [root@interceptor-01 ~]# usermod -u 1000 -g 1000 wpaik # Fix ownership of files under the old UID. # -xdev keeps find on the local filesystem, so other partitions # and NFS mounts (if any are still present) are not touched. [root@interceptor-01 ~]# find / -xdev -uid 1001 -exec chown -h 1000 {} + [root@interceptor-01 ~]# find / -xdev -gid 1001 -exec chgrp -h 1000 {} + # Verify [root@interceptor-01 ~]# id wpaik uid=1000(wpaik) gid=1000(wpaik) groups=1000(wpaik),10(wheel) If wpaik belonged to extra groups before (wheel, for example), check with groups wpaik and re-add anything that got dropped during the usermod.\nThis is a stopgap. FreeIPA in Section 7 replaces local accounts with centralized identity and the question stops mattering. Until then, UID alignment is something you manage by hand whenever a node joins the cluster out of cycle.\n\u0026gt; 7. FreeIPA Server Installation # FreeIPA bundles several services into one package: LDAP (directory), Kerberos (authentication), DNS, and a certificate authority. The installation is opinionated and sets everything up together.\nAll commands in this section run on arbiter.\nPrerequisites # FreeIPA requires a fully qualified domain name (FQDN). Verify it resolves correctly before proceeding:\n[wpaik@arbiter ~]$ hostname -f arbiter.cluster.local [wpaik@arbiter ~]$ ping -c 1 arbiter.cluster.local PING arbiter.cluster.local (192.168.50.50) 56(84) bytes of data. Also verify at least 1.5GB of free RAM. The installer is memory-hungry:\n$ free -h total used free Mem: 15Gi 800Mi 14Gi Install and Run the Server Setup # $ sudo dnf install -y freeipa-server freeipa-server-dns $ sudo ipa-server-install \\ --domain=cluster.local \\ --realm=CLUSTER.LOCAL \\ --ds-password=\u0026lt;your_directory_manager_password\u0026gt; \\ --admin-password=\u0026lt;your_admin_password\u0026gt; \\ --hostname=arbiter.cluster.local \\ --ip-address=192.168.50.50 \\ --no-ntp \\ --unattended A few things to note: --realm must be uppercase, --no-ntp skips NTP configuration since we manage time sync with Chrony separately, and --unattended skips interactive prompts. The installer takes 5-10 minutes and configures LDAP, Kerberos, and the CA.\nAfter completion, open the required firewall ports:\n$ sudo firewall-cmd --permanent --add-service={freeipa-ldap,freeipa-ldaps,kerberos,dns,http,https} $ sudo firewall-cmd --reload Verify the Installation # $ kinit admin Password for admin@CLUSTER.LOCAL: $ klist Ticket cache: KCM:0 Default principal: admin@CLUSTER.LOCAL Valid starting Expires Service principal 04/27/26 09:00:00 04/28/26 09:00:00 krbtgt/CLUSTER.LOCAL@CLUSTER.LOCAL $ ipa user-find --------------- 0 users matched --------------- No users yet. We will add them after enrollment.\nSet the default shell to bash (the FreeIPA default is /bin/sh):\n$ ipa config-mod --defaultshell=/bin/bash \u0026gt; 8. FreeIPA Client Enrollment # Before enrolling, add arbiter to /etc/hosts on every node. The enrollment process needs to resolve arbiter.cluster.local, and at this point SSSD is not yet configured. Doing this beforehand ensures enrollment does not fail on DNS resolution.\nThe Ansible playbook handles this automatically:\n[wpaik@arbiter ansible]$ ansible-playbook playbooks/freeipa_setup.yaml -K If you prefer to do it manually on each node:\n# Add arbiter to /etc/hosts $ echo \u0026#34;192.168.50.50 arbiter.cluster.local arbiter\u0026#34; | sudo tee -a /etc/hosts # Install and enroll $ sudo dnf install -y freeipa-client oddjob-mkhomedir $ sudo ipa-client-install \\ --server=arbiter.cluster.local \\ --domain=cluster.local \\ --realm=CLUSTER.LOCAL \\ --principal=admin \\ --password=\u0026lt;your_admin_password\u0026gt; \\ --mkhomedir \\ --no-ntp \\ --unattended The --mkhomedir flag tells the system to create a home directory on first login. Since /home is NFS-mounted from arbiter, the directory lands on the NFS share and is immediately visible from all nodes.\nAfter enrollment, confirm each node can reach the IPA server:\n[wpaik@interceptor-01 ~]$ ipa user-find --------------- 0 users matched --------------- If this returns a response (even 0 users), the client is enrolled and talking to the server.\nCreate a Test User # Back on arbiter:\n[wpaik@arbiter ~]$ kinit admin $ ipa user-add testuser \\ --first=Test \\ --last=User \\ --password $ ipa user-find testuser -------------- 1 user matched -------------- User login: testuser First name: Test Last name: User Home directory: /home/testuser Login shell: /bin/bash UID: 99100XXXX GID: 99100XXXX Notice the UID range. FreeIPA assigns UIDs starting well above the range used by local system accounts, avoiding any collision. The exact starting range depends on how FreeIPA was configured during installation, but whatever it assigns will be identical on every node in the cluster.\nFor ongoing user management, the scripts/user_creation.sh script in the GitHub repository handles the full process: FreeIPA account creation, home directory setup with correct NFS ownership, XFS quota, and Slurm accounting entry.\nAccessing the FreeIPA Web UI # The FreeIPA web interface is reachable from outside the cluster using sshuttle, a VPN-over-SSH tool that routes traffic through the login node.\nOn your local machine:\n# Install sshuttle $ sudo dnf install sshuttle # Fedora/RHEL # or: pip install sshuttle # Add arbiter to your local /etc/hosts $ echo \u0026#34;192.168.50.50 arbiter arbiter.cluster.local\u0026#34; | sudo tee -a /etc/hosts # Open the tunnel (keep this terminal open) $ sshuttle -r wpaik@carrier.cluster.local 192.168.50.0/24 --dns Then open a browser and go to https://arbiter.cluster.local/ipa/ui/. Accept the self-signed certificate warning and log in with the admin credentials.\n\u0026gt; 9. Verification # SSH as the new user from the login node to a compute node:\n[wpaik@carrier ~]$ ssh testuser@interceptor-01 Password: Creating home directory for testuser. [testuser@interceptor-01 ~]$ pwd /home/testuser [testuser@interceptor-01 ~]$ id uid=99100XXXX(testuser) gid=99100XXXX(testuser) groups=99100XXXX(testuser) Now check the same user from a different node:\n[testuser@interceptor-02 ~]$ id uid=99100XXXX(testuser) gid=99100XXXX(testuser) groups=99100XXXX(testuser) Same UID on both nodes. Files written on interceptor-01 have correct permissions on interceptor-02. The home directory is the same NFS path regardless of which node you land on.\nOne account. Every node. One home directory.\nTroubleshooting Common Issues # Enrollment fails with DNS error: The playbook adds arbiter.cluster.local to /etc/hosts before enrollment. If it still fails, verify the entry exists on the failing node:\n$ getent hosts arbiter.cluster.local 192.168.50.50 arbiter.cluster.local arbiter If missing, add it manually:\n$ echo \u0026#34;192.168.50.50 arbiter.cluster.local arbiter\u0026#34; | sudo tee -a /etc/hosts NFS mount fails after FreeIPA enrollment: FreeIPA updates /etc/nsswitch.conf. Confirm files appears before sss for passwd and group:\n$ grep -E \u0026#34;^(passwd|group)\u0026#34; /etc/nsswitch.conf passwd: sss files systemd group: sss files systemd If NFS mounts hang after enrollment:\n$ sudo setsebool -P use_nfs_home_dirs 1 Home directory not created on first login:\n$ sudo systemctl enable --now oddjobd Node freezes on boot after NFS setup: A stale resume=UUID in GRUB can cause boot hangs. From the GRUB menu, press e, remove the resume=UUID=... argument, then Ctrl+X to boot. Once up:\n$ grubby --update-kernel=ALL --remove-args=\u0026#34;resume=UUID=\u0026lt;UUID\u0026gt;\u0026#34; \u0026gt; 10. What is Next # The cluster now has shared storage and centralized authentication. Every node shares the same home directory and every user has a consistent identity across all nodes.\nNext episode we install Slurm, the job scheduler. With NFS and FreeIPA already in place, Slurm has everything it needs to schedule jobs across nodes and write output files back to a shared location.\nAll configuration files and Ansible playbooks from this episode are in the GitHub repository.\nHappy Computing!\n","date":"5 5월 2026","externalUrl":null,"permalink":"/posts/hpc-from-scratch-04/","section":"Posts","summary":"","title":"[HPC From Scratch] Episode 4: NFS Storage \u0026 FreeIPA: One Drive, One Login","type":"posts"},{"content":"","date":"5 5월 2026","externalUrl":null,"permalink":"/tags/freeipa/","section":"Tags","summary":"","title":"FreeIPA","type":"tags"},{"content":"","date":"5 5월 2026","externalUrl":null,"permalink":"/tags/nfs/","section":"Tags","summary":"","title":"NFS","type":"tags"},{"content":"Date: April 28, 2026 Venue: Northeastern University, Boston, MA\nOverview # A hands-on workshop for university researchers who want to scale computation beyond a single CPU core. This session walks through core parallel computing concepts, real benchmark results, and working code examples that can be run directly on the cluster.\nTopics Covered # Serial vs. parallel execution: pipelining and data parallelism Flynn\u0026rsquo;s Taxonomy: SISD, SIMD, MISD, MIMD Shared vs. distributed memory models and when to use each Amdahl\u0026rsquo;s Law, Gustafson\u0026rsquo;s Law, and strong vs. weak scaling CPU parallelism in practice: Conway\u0026rsquo;s Game of Life (serial, OpenMP, MPI+OpenMP) GPU computing fundamentals: CUDA workflow and memory model Scaling ML workloads with PyTorch: single GPU, multi-GPU, and multi-node DDP Parallel tools for Python, R, and MATLAB Mapping parallelism to Slurm: --ntasks vs. --cpus-per-task Materials # Workshop Slides \u0026amp; Materials (GitHub) Workshop Recordings (Spring 2026) ","date":"28 4월 2026","externalUrl":null,"permalink":"/talks/neu-talk-02/","section":"Talks \u0026 Workshops","summary":"","title":"Introduction to Parallel Computing","type":"talks"},{"content":"","date":"28 4월 2026","externalUrl":null,"permalink":"/tags/parallel-computing/","section":"Tags","summary":"","title":"Parallel-Computing","type":"tags"},{"content":"","date":"28 4월 2026","externalUrl":null,"permalink":"/talks/","section":"Talks \u0026 Workshops","summary":"","title":"Talks \u0026 Workshops","type":"talks"},{"content":"","date":"28 4월 2026","externalUrl":null,"permalink":"/tags/workshop/","section":"Tags","summary":"","title":"Workshop","type":"tags"},{"content":"A laptop, a home router, and a gigabit switch. One isolated cluster subnet.\nWelcome back to HPC From Scratch. In Episode 2, we upgraded the four M715q nodes with dual-channel RAM and an NVMe drive, and fixed the iGPU memory trap that can crash Slurm jobs. This episode brings the cluster online: installing Rocky Linux, designing the network, and turning a laptop into a DHCP server, NAT gateway, and SSH bastion for the internal cluster subnet.\n*(Click the image to watch the tutorial on YouTube)* \u0026gt; 1. The Topology Decision # In production HPC, management and compute networks are strictly wired, physically separated, and connected through managed switches with VLANs. A single enterprise managed switch can cost more than this entire cluster.\nFor a home build, there are two realistic paths:\nFlat home network. Plug every node into the home router. Easy, but every node is exposed to the same network as phones, TVs, and IoT devices. No isolation, and one compromised device can reach the whole cluster. Physical isolation with a dedicated switch. All cluster nodes live on their own subnet behind a cheap unmanaged switch. The login node bridges the two worlds. I went with option 2. The Netgear GS308E provides the isolation. The login node sits at the boundary, handling DHCP, DNS, and NAT for the internal cluster subnet. Worker nodes never see the home network directly.\nThe result is the same pattern used in production HPC: the login node at the edge, an internal fabric behind it, and no direct external exposure for compute nodes. The difference is scale. Gigabit Ethernet instead of InfiniBand. An unmanaged consumer switch instead of a spine-leaf topology. Same architecture, different order of magnitude.\nNote: The HP Envy GPU node (corsair-01) connects to the same switch and gets the same base OS and network setup as every other node. The GPU side of that box will be configured in a later episode.\n\u0026gt; 2. OS Installation # Every node runs Rocky Linux 10, minimal install. I used the NanoKVM to mount the ISO and drive the installer over my browser, rotating it between machines. A monitor and keyboard work the same way if you do not have a NanoKVM.\nThe installation itself is unremarkable: boot the ISO, select minimal install, pick the boot drive, let it run, reboot.\nTwo things worth pre-planning while installing:\nCreate a sudoer user on every node. This is the account you will SSH into later. Root SSH will be disabled, so without this account you will lock yourself out.\nUse the same username across all nodes. When you run ssh-copy-id later, the local username is assumed by default, so ssh-copy-id arbiter works if the user matches on both sides. When FreeIPA comes in a later episode, it will replace these local accounts with centralized identity, but consistency makes the transition smoother.\n\u0026gt; 3. Login Node on WiFi # The login node is carrier, a refurbished Lenovo IdeaPad 1 laptop. It has WiFi and one Ethernet port. Most build guides would tell you that a login node should be wired. I put this one on WiFi on purpose.\nWhy WiFi for the external side? The login node needs internet access for package updates, pulling datasets, and remote SSH from outside the home. Running Ethernet to the home router would work, but it would consume one of the eight switch ports I need for cluster nodes, and it would require an extra cable across the room. WiFi removes that constraint at the cost of bandwidth the login node does not actually need.\nWhy Ethernet for the internal side? All heavy traffic (NFS reads, MPI messages, scheduler heartbeats) has to stay on the wired switch at full gigabit. The login node\u0026rsquo;s Ethernet port is the gateway into that fabric.\nThere are three laptop-specific steps to cover before anything else works.\nEssential packages. Throughout the series we will need a compiler, git, and a reasonable editor:\nsudo dnf upgrade -y sudo dnf install -y epel-release sudo dnf install -y vim git wget tree curl gcc-c++ cmake m4 Lid-close fix. By default, closing a laptop lid triggers systemd-logind to suspend the machine. For a login node this is catastrophic: the cluster loses its DHCP server, NAT gateway, and SSH entry point the moment you close the lid. The fix is a one-line change in /usr/lib/systemd/logind.conf:\nHandleLidSwitch=ignore After sudo systemctl restart systemd-logind, the laptop can live closed on top of the cluster stack without suspending.\nRouting priority. With two active interfaces (WiFi and Ethernet), Linux has to decide which one handles outbound internet traffic. It picks based on the route metric: lower metric wins. By default the wired connection often gets a lower metric than WiFi, which means internet traffic would be routed out through the cluster switch, which has no path to the home router. The fix is to force WiFi to have the lowest metric:\nnmcli connection modify \u0026lt;WIFI NAME\u0026gt; ipv4.route-metric 10 nmcli connection down \u0026lt;WIFI NAME\u0026gt; \u0026amp;\u0026amp; nmcli connection up \u0026lt;WIFI NAME\u0026gt; Find the connection name with nmcli connection show. After this, ip route show default should list WiFi as the first (primary) default route.\n\u0026gt; 4. DHCP: Handing Out IPs # The worker nodes need IP addresses. They have no connection to the home router, so the home router\u0026rsquo;s DHCP cannot reach them. The login node has to become their DHCP server.\nFirst, give the login node a fixed address on the cluster side. Workers will use this as their gateway:\nnmcli connection modify \u0026lt;WIRED NAME\u0026gt; ipv4.addresses 192.168.50.1/24 ipv4.method manual nmcli connection up \u0026lt;WIRED NAME\u0026gt; Now install and configure dnsmasq. I picked it over isc-dhcp-server because it is lightweight, single-binary, and handles both DHCP and DNS. For a six-node cluster, anything more is overkill.\nsudo dnf install -y dnsmasq sudo mv /etc/dnsmasq.conf /etc/dnsmasq.conf.bak The replacement /etc/dnsmasq.conf is about ten lines:\ninterface=\u0026lt;WIRED INTERFACE\u0026gt; dhcp-range=192.168.50.10,192.168.50.50,12h dhcp-option=3,192.168.50.1 dhcp-option=6,1.1.1.1,8.8.8.8 log-queries log-dhcp Find the interface name with nmcli device. Each line does one job:\ninterface= restricts dnsmasq to the wired side only. Without this, dnsmasq would try to answer DHCP requests on WiFi too, which would fight with the home router. dhcp-range= defines the pool of IPs dnsmasq will hand out, and the lease duration (12 hours). dhcp-option=3,192.168.50.1 advertises the login node as the default gateway. This is how workers learn where to send traffic destined for the internet. dhcp-option=6,1.1.1.1,8.8.8.8 tells workers which DNS servers to use (Cloudflare and Google as public fallbacks). log-queries and log-dhcp turn on verbose logging. Invaluable during initial bring-up. Turn them off once the cluster is stable. Open the firewall for DHCP and DNS, then start the service:\nsudo firewall-cmd --permanent --add-service=dhcp sudo firewall-cmd --permanent --add-service=dns sudo firewall-cmd --reload sudo systemctl enable --now dnsmasq Tip: journalctl -u dnsmasq -f on the login node during worker boot shows the full DHCP handshake as it happens (DHCPDISCOVER, DHCPOFFER, DHCPREQUEST, DHCPACK). Very useful for diagnosing why a worker is not getting an address.\n\u0026gt; 5. NAT: Getting Workers to the Internet # DHCP handed out IPs in the 192.168.50.x range. Those are private addresses, defined by RFC 1918 as non-routable on the public internet. If a worker sends a packet to dnf.rocky.example.com, it goes out to the cluster switch, bounces around, and dies. It has no path out.\nThe fix is Network Address Translation (NAT). The login node rewrites the source address on every outbound packet to its own WiFi-side public IP. Reply packets come back to the WiFi IP, and the login node looks up which internal source the packet belongs to and forwards it back. This is the same trick your home router does for every device in the house.\nTwo pieces are needed.\nIP forwarding. By default, a Linux machine will not forward packets between interfaces. It has to be explicitly allowed:\nsudo sysctl -w net.ipv4.ip_forward=1 echo \u0026#34;net.ipv4.ip_forward = 1\u0026#34; | sudo tee /etc/sysctl.d/99-ipforward.conf The first command enables forwarding immediately. The second persists it across reboots.\nMasquerade rule. With forwarding enabled, the kernel will route packets between interfaces, but it will not rewrite their source addresses. A masquerade rule on firewalld tells the kernel to do that rewriting:\nsudo firewall-cmd --permanent --add-masquerade sudo firewall-cmd --reload Verify:\nsudo firewall-cmd --list-all | grep masquerade Should show masquerade: yes.\nBring workers online and test. Power on a worker node. On the login node, check the leases file:\ncat /var/lib/dnsmasq/dnsmasq.leases Each line contains a timestamp, MAC address, IP, and hostname. SSH into the worker using the sudoer account you created during install:\nssh \u0026lt;user\u0026gt;@192.168.50.11 ping -c 3 1.1.1.1 If the ping works, every piece (DHCP, routing, NAT, DNS) is doing its job.\n\u0026gt; 6. Hostnames Instead of IPs # Typing IP addresses everywhere gets old fast. Worse, if you ever renumber the subnet, every script, config file, and commit history has the wrong addresses baked in. Hostnames are indirection, and indirection is cheap insurance.\nI use these names:\nHostname IP Role carrier 192.168.50.1 Login node arbiter 192.168.50.50 Management / NFS interceptor-01 192.168.50.15 Compute interceptor-02 192.168.50.32 Compute observer 192.168.50.19 Visualization corsair-01 192.168.50.11 GPU On each node, including the login node:\nsudo hostnamectl set-hostname \u0026lt;HOSTNAME\u0026gt; Then add every node to /etc/hosts on the login node:\n192.168.50.1 carrier.cluster.local carrier 192.168.50.15 interceptor-01.cluster.local interceptor-01 192.168.50.32 interceptor-02.cluster.local interceptor-02 192.168.50.11 corsair-01.cluster.local corsair-01 192.168.50.19 observer.cluster.local observer 192.168.50.50 arbiter.cluster.local arbiter From now on ssh arbiter works instead of ssh 192.168.50.50. This is a stopgap. FreeIPA in a later episode brings up a proper DNS server so hostnames resolve cluster-wide without touching /etc/hosts on each node.\n\u0026gt; 7. Hardening the Exposed Surface # Only the login node is reachable from the home WiFi. Workers sit behind NAT on their own subnet, so nothing on the home network can reach them directly. Hardening effort goes into carrier.\nThree things matter here: the SSH config itself, brute-force protection, and a small systemd fix specific to laptops.\nSSH drop-in config. Rocky 10\u0026rsquo;s default /etc/ssh/sshd_config includes files from /etc/ssh/sshd_config.d/*.conf, and the first value wins when the same setting appears in multiple files. This is a drop-in config system: you do not edit the main config, you add a new file with only the things you want to change.\nThe only real change I make is disabling direct root SSH login:\nsudo tee /etc/ssh/sshd_config.d/99-custom.conf \u0026gt; /dev/null \u0026lt;\u0026lt;\u0026#39;EOF\u0026#39; PermitRootLogin no EOF sudo sshd -t # validate syntax sudo systemctl reload sshd A few settings stay at their upstream defaults on purpose:\nPublic key authentication is enabled by default. ssh-copy-id works without any config change. Password authentication is also enabled by default, and I keep it. HPC users coming from university clusters are used to password login, and FreeIPA in a later episode will route that through centralized auth anyway. The combination of fail2ban and a decent password policy is a reasonable defense. Host keys (the server\u0026rsquo;s identity, not user keys) auto-load when no explicit HostKey directive is set. Rocky 10 generates RSA, ECDSA, and Ed25519 host keys at first boot. No config needed. Make sure the SSH port is reachable through firewalld:\nsudo firewall-cmd --permanent --add-service=ssh sudo firewall-cmd --reload fail2ban. Port 22 attracts brute-force attempts even on home networks. A compromised IoT device on the same WiFi is enough to start one. fail2ban watches auth logs, and when it sees too many failures from the same IP in a short window, it adds a temporary firewall rule to drop traffic from that IP.\nFollowing upstream fail2ban guidance, the configuration is a short jail.local that overrides only what I want to change:\nsudo dnf install -y fail2ban sudo tee /etc/fail2ban/jail.local \u0026gt; /dev/null \u0026lt;\u0026lt;\u0026#39;EOF\u0026#39; [DEFAULT] bantime = 10m maxretry = 3 [sshd] enabled = true mode = aggressive EOF sudo systemctl enable --now fail2ban Three failed auth attempts from an IP and it is banned for ten minutes at the firewall level. mode = aggressive combines the normal, ddos, and extra SSH filters. normal catches standard auth failures, ddos catches connections that close before authentication completes (a signature of some scanners), and extra adds a few less common patterns. Check what is currently banned with sudo fail2ban-client status sshd.\nCluster-wide passwordless SSH. With key auth enabled and the SSH jail in place, the last piece is distributing keys. On the login node, as your regular sudoer user (not root, which cannot SSH anyway):\nssh-keygen -t ed25519 ssh-copy-id \u0026lt;user\u0026gt;@arbiter ssh-keygen creates a private/public keypair in ~/.ssh/ (id_ed25519 and id_ed25519.pub). ssh-copy-id logs into the target machine with password auth once, then appends your public key to ~/.ssh/authorized_keys on that machine. On subsequent SSH attempts, the server sees your public key, verifies you have the matching private key, and lets you in without asking for a password. Repeat for each worker. After that, ssh arbiter from the login node should not prompt for a password.\nOptional: sshd startup override for laptop login nodes. On this laptop-based login node I ran into occasional boot-time issues where sshd failed to start before the Ethernet interface was fully configured. I did not capture the exact error at the time, so I cannot confirm the root cause with certainty. The standard fix is a systemd override that makes sshd wait for network-online.target and retry on failure. If your sshd is in a failed state after reboot, check journalctl -u sshd -b. If you are building this on desktop or server hardware, you likely do not need it.\nApply it with sudo systemctl edit sshd.service and paste:\n[Unit] Wants=network-online.target After=network-online.target [Service] Restart=on-failure RestartSec=5s StartLimitIntervalSec=0 systemctl edit creates the drop-in file at /etc/systemd/system/sshd.service.d/override.conf and runs daemon-reload automatically.\nInternal Nodes: Firewall Off # Everything in this section so far has been about carrier. The other five nodes are a different story.\nThey sit on 192.168.50.0/24 behind NAT. Nothing on the home WiFi can reach them directly, and the only inbound path is through carrier. firewalld on these nodes adds no real defense, but it does block things that need to work: NFS callbacks, FreeIPA enrollment over Kerberos and LDAP, and the long list of dynamic ports Slurm uses for srun and step launches. Maintaining accurate firewall rules across all of that is tedious and easy to get wrong.\nThe simpler approach, and the standard practice on isolated HPC fabrics, is to turn it off on every node that is not the login node:\nsudo systemctl disable --now firewalld The security boundary is carrier. Inside the boundary, full trust between nodes.\n\u0026gt; 8. Why WiFi Is Not the Bottleneck # The most common question about this topology is whether WiFi on the login node bottlenecks the cluster. It does not, because traffic paths are asymmetric.\nEditing code over SSH, pulling packages from dnf, running git pull, monitoring the system from a browser: all of this goes out over WiFi. None of it is bandwidth-sensitive. WiFi throughput at a few hundred Mbps is more than enough.\nThe heavy lifting happens entirely on the gigabit switch. When interceptor-01 reads a dataset from arbiter\u0026rsquo;s NFS share, that traffic goes node-to-switch-to-node without ever touching the login node. When an MPI job on two workers exchanges messages, same thing. Full gigabit, predictable latency, no WiFi involvement.\nThe compute fabric is purely wired. The WiFi side is only for management and internet access. Someone streaming 4K video on the home network has zero impact on cluster performance.\n\u0026gt; 9. What is Next # The cluster is now networked, addressable, and reachable. Every node has an OS. Hostnames resolve. The login node handles DHCP, NAT, and SSH for the internal subnet. From any machine on the home network, I can SSH into carrier and from there into any worker, password-free.\nIn Episode 4, we mount the Samsung 990 Pro on arbiter as shared NFS storage and bring up FreeIPA for centralized user management. After that, a single user account created once will work across every node in the cluster, and all nodes will share a home directory tree.\nAll configuration files and the full command reference for this episode are on GitHub.\nHappy Computing!\n","date":"21 4월 2026","externalUrl":null,"permalink":"/posts/hpc-from-scratch-03/","section":"Posts","summary":"","title":"[HPC From Scratch] Episode 3: The WiFi Login Node","type":"posts"},{"content":"","date":"21 4월 2026","externalUrl":null,"permalink":"/tags/networking/","section":"Tags","summary":"","title":"Networking","type":"tags"},{"content":"Four nodes. 16GB each. One hidden BIOS setting that can crash your Slurm jobs.\nWelcome back to HPC From Scratch. In Episode 1, we covered the full cluster architecture, cost breakdown, and network layout. This episode focuses on the compute backbone: upgrading the four Lenovo ThinkCentre M715q nodes with dual-channel RAM and NVMe storage, and fixing a BIOS setting that silently eats your memory.\n*(Click the image to watch the tutorial on YouTube)* \u0026gt; 1. What We Are Working With # Each M715q is a tiny Micro Form Factor PC. Here is what they shipped with from eBay:\nSpec Stock Configuration CPU AMD Ryzen 5 Pro 2400GE (4C/8T, 35W TDP) RAM 8GB DDR4 SO-DIMM (single stick, single-channel) Boot Drive 240GB 2.5\u0026quot; SATA SSD M.2 Slot Empty (NVMe capable) iGPU AMD Radeon RX Vega 11 The Ryzen 5 Pro 2400GE is a 35-watt part. Quiet and low-power, which matters when you have four of them sitting on your desk. 4 cores and 8 threads per node gives us 32 threads total across all the M715q nodes.\n8GB of single-channel RAM is often not enough for most HPC workloads. And the boot drive being a 2.5\u0026quot; SATA SSD turned out to work in our favor for the storage upgrade.\n\u0026gt; 2. The Storage Upgrade Path # When I opened the first M715q, I found the 240GB SATA SSD sitting in the 2.5\u0026quot; bay and an empty M.2 NVMe slot on the motherboard.\nBecause the OS boots from the SATA drive, the high-speed M.2 slot is free. I realized that I could install a 1TB Samsung 990 Pro into that slot on the management node. This drive serves as the NFS storage for the entire cluster.\nIf the boot drive had been an M.2 SSD instead (which I believe is a default option for these units), the upgrade path would have been different. I would have bought a standard SATA SSD for NFS instead. You work with the hardware you get.\nA PCIe Gen 4 NVMe drive is probably overkill for a Gigabit Ethernet network. Even more so because the M715q\u0026rsquo;s M.2 slot is PCIe 3.0, so the 990 Pro runs at Gen 3 speeds anyway. The network will bottleneck long before the drive does. We will benchmark the throughput we actually get in a later episode.\nNote: Only the management node gets the NVMe drive. The other three M715q nodes keep their stock 240GB SATA SSDs as boot drives. There is no need for local fast storage on compute nodes when jobs read data from NFS.\n\u0026gt; 3. RAM Upgrade: 8GB to 16GB Dual-Channel # 8GB is not enough for most HPC workloads. Instead of replacing the existing stick with a single 16GB module, I added a second 8GB stick.\nEach M715q came with one 8GB DDR4 SO-DIMM in one slot which leaves the second slot empty. I bought matching 8GB sticks and installed them in the empty slots. This gives us two benefits:\nDouble the capacity (8GB to 16GB) Dual-channel memory bandwidth Dual-channel matters for compute. With a single stick, the CPU accesses memory through one channel. With two sticks in both slots, it can read and write through two channels simultaneously. This roughly doubles the theoretical memory bandwidth, which directly affects performance in memory-bound workloads like MPI and numerical computation.\nRAM Compatibility\nThe M715q uses DDR4 SO-DIMM (laptop-sized) memory. When buying used RAM, match the specifications as closely as possible to the existing stick:\nSpec What to Match Form Factor DDR4 SO-DIMM Capacity 8GB (to match existing stick) Speed DDR4-2666 or higher (the 2400GE supports up to 2933) Voltage 1.2V (standard DDR4) (M715q I purchased came with DDR4-2666)\nI bought my RAM sticks on eBay. The four sticks cost a total of $78, averaging about $20 per node for the upgrade. If you buy 16GB kits (2x8GB) new, expect to pay more, but compatibility is guaranteed.\nTip: If you are unsure about compatibility, check the spec sheet online. Search for the matching part online.\nInstallation\nOpening the M715q is straightforward. Remove one screw on the back panel, slide the top cover off, and the internals are fully exposed. Once you remove the 2.5\u0026quot; SATA bay (one screw, then slide forward), the two SO-DIMM slots are clearly visible. Push the new stick into the empty slot until the clips snap into place.\nI upgraded all four nodes. The management node took a bit longer because it also got the Samsung 990 Pro NVMe drive. The other three were just RAM, so I ran through them quickly.\n\u0026gt; 4. The iGPU Memory Trap # This is the part that might save you hours of debugging later.\nAfter installing the RAM, I booted the management node into a Linux Live USB using the NanoKVM (no monitor or keyboard needed). I opened a terminal and ran:\n$ free -m total used free shared buff/cache available Mem: 15661 1656 10369 73 3989 14005 15,661 MiB. We installed 16GiB (16,384 MiB). Where did the other ~700 MiB go?\nThe answer: the integrated Vega GPU. Ryzen APUs share system RAM with the iGPU (integrated GPU). The GPU reserves a portion of your physical memory as video memory (VRAM), and the operating system never sees it.\nI confirmed this with following command:\n$ dmesg | grep VRAM The output showed 256MB allocated to VRAM.\nBIOS Setting: UMA Frame Buffer Size\nThe amount of RAM reserved for the iGPU is controlled by a BIOS setting called UMA Frame Buffer Size. On my units, the default was set to Auto, which allocated 256MB.\nI explicitly set it to 256MB (the lowest available option). Why bother changing it if Auto was already picking 256MB? Because Auto let the firmware decide, and that decision could change after a BIOS update or a hardware configuration change. If the iGPU suddenly grabs 512MB instead of 256MB, your Slurm jobs could start failing and the error messages will not point you to the BIOS.\nPinning it to a fixed value removes the guesswork.\nWhy This Matters for Slurm\nWhen you configure Slurm later in this series, each node\u0026rsquo;s memory must be declared in slurm.conf using the RealMemory parameter. If you set RealMemory=16000 because you installed 16GB, Slurm will try to allocate memory that does not exist. Jobs will crash with out-of-memory errors.\nThe correct approach:\nBoot the node Run free -m and note the total value Use that number (or slightly below it) as RealMemory in your Slurm configuration # Example slurm.conf entry NodeName=interceptor-01 CPUs=8 RealMemory=15600 State=UNKNOWN Every megabyte counts. Document it now, and save yourself the debugging later.\n\u0026gt; 5. Upgrade Cost Summary # Here is what this episode\u0026rsquo;s upgrades cost:\nItem Count Unit Price (USD) Total (USD) DDR4 8GB SO-DIMM (Micron) 2 15.00 30.00 DDR4 8GB SO-DIMM (Hynix) 2 24.00 48.00 Samsung 990 Pro 1TB NVMe 1 109.90 109.90 Episode Total $187.90 Combined with the four M715q units from Episode 1 ($343.60), the total compute backbone cost so far is $531.50. That covers four Ryzen nodes with 16GB dual-channel RAM each and 1TB of NVMe storage for NFS.\nPer node (excluding the shared NFS drive): roughly $105 for a fully upgraded Ryzen 4C/8T compute node with 16GB of RAM.\n\u0026gt; 6. What is Next # The compute backbone is assembled and verified. But hardware without a network is just a pile of metal.\nIn Episode 3, we will look at the rest of the cluster: the HP Envy TE01 GPU node with its Intel i7-10700F, the Gigabit network switch that connects everything, and why the login node uses WiFi to bridge the cluster to the internet.\nHappy Computing!\n","date":"25 3월 2026","externalUrl":null,"permalink":"/posts/hpc-from-scratch-02/","section":"Posts","summary":"","title":"[HPC From Scratch] Episode 2: RAM, NVMe, and the iGPU Memory Trap","type":"posts"},{"content":"","date":"25 3월 2026","externalUrl":null,"permalink":"/tags/hardware/","section":"Tags","summary":"","title":"Hardware","type":"tags"},{"content":"A 6-node cluster for $1,264. No server rack, no enterprise budget.\nWelcome to HPC From Scratch, a new series on The Login Node. The HPC 101 and Special Topics series covered how to use an HPC cluster. This series covers how to build one.\nOver the next several episodes, I will walk through the full process of building a functional HPC cluster from consumer hardware: sourcing parts, installing the OS, configuring Slurm, setting up identity management with FreeIPA, benchmarking, and upgrading. Every configuration file will be available on my GitHub.\nThis first episode covers what is in the cluster, where I got each part, how the network is laid out, and how this compares to running cloud instances.\n*(Click the image to watch the tutorial on YouTube)* \u0026gt; 1. Why Build a Cluster? # There are two common alternatives, and both have trade-offs.\nCloud (AWS, GCP, Azure): Running multi-node compute instances 24/7 gets expensive. Even with a 3-year savings plan, two modest EC2 instances cost over $2,300 per year (see Section 5). That is fine for burst workloads, but it is not practical for always-on experimentation and learning.\nSingle workstation: A high-end desktop gives you raw compute power, but it does not teach you distributed systems. You will never hit a network bottleneck, debug a Slurm scheduling conflict, or troubleshoot MPI on a single machine. You need multiple nodes for that.\nI wanted a miniature version of a real supercomputer architecture that I could test, break, and fix on my desk. It runs the same software stack as a university research cluster: Slurm for job scheduling, FreeIPA for identity management, NFS for shared storage, and MPI for parallel workloads.\n\u0026gt; 2. Bill of Materials # All prices are what I actually paid between late 2024 and late 2025. Due to recent price increases in the PC parts market, your total may be higher if you replicate this build today.\nItem Count Unit Price (USD) Total (USD) Condition Lenovo IdeaPad 1 1 161.00 161.00 Refurbished Lenovo ThinkCentre M715q 4 85.90 343.60 Used HP Envy TE01 1 400.00 400.00 Used DDR4 SODIMM (Micron) 2 15.00 30.00 Used DDR4 SODIMM (Hynix) 2 24.00 48.00 Used Netgear GS308E 1 21.50 21.50 New Samsung 990 Pro 1TB 1 109.90 109.90 New Sabrent USB-C Hub 1 59.90 59.90 New 10Gbps Cat 6 Ethernet Cable (x5) 1 9.90 9.90 New NanoKVM 1 69.90 69.90 New Rubber Feet 1 9.90 9.90 New Total Cost 1,263.60 Where I sourced these:\nThe four ThinkCentre M715q units and the RAM came from eBay. The HP Envy TE01 was a Craigslist cash deal (no receipt for that one). The Samsung 990 Pro, Netgear switch, USB-C hub, cables, and rubber feet came from Amazon. The NanoKVM was ordered directly from the manufacturer. The IdeaPad 1 was a refurbished unit from Lenovo.\nThe key was patience. I did not buy everything at once. I watched eBay listings for weeks, picked up the Craigslist deal when it appeared, and bought new components during sales. The M715q units averaged under $86 each. At that price, four of them cost less than a single mid-range GPU.\nNote on future upgrades: An RTX 5060 Ti and a new power supply are planned for the GPU node. These are not included in the cost above because they are optional upgrades, not part of the initial build. The GPU upgrade will be covered in a dedicated episode.\n\u0026gt; 3. Cluster Architecture # Hostname Role Hardware CPU Notes carrier Login Node Lenovo IdeaPad 1 AMD Ryzen 3 7920U (8vCPU, ~7GB RAM) WiFi to internet, Ethernet to cluster switch arbiter Management Node Lenovo ThinkCentre M715q Ryzen 5 Pro 2400GE (8 vCPU, ~14GB RAM) Slurm controller, FreeIPA server interceptor-01 CPU Compute Lenovo ThinkCentre M715q Ryzen 5 Pro 2400GE (8 vCPU, ~14GB RAM) Slurm compute interceptor-02 CPU Compute Lenovo ThinkCentre M715q Ryzen 5 Pro 2400GE (8 vCPU, ~14GB RAM) Slurm compute corsair-01 GPU Compute HP Envy TE01 Intel i7-10700F (16 vCPU, ~32GB RAM) GTX 1660 Super (upgrade planned) observer Visualization Lenovo ThinkCentre M715q Ryzen 5 Pro 2400GE (8 vCPU, ~14GB RAM) Visual/monitoring tasks Mixing AMD Ryzen and Intel across nodes looks messy at first glance. But in production HPC, heterogeneous architectures are standard.\nTake El Capitan, the world\u0026rsquo;s fastest supercomputer as of the November 2024 TOP500 list. It uses AMD MI300A APUs that pack CPU and GPU cores into a single package. My cluster splits those roles across separate nodes instead. The core idea is the same: different processors handling different parts of a workload. This cluster does that at desk scale.\nAll nodes run Rocky Linux. The software stack includes Slurm 25.11 for job scheduling, FreeIPA for centralized identity and authentication, NFS for shared storage (served from the Samsung 990 Pro), and OpenMPI for parallel workloads. Monitoring runs on Prometheus and Grafana. All configuration is managed through Ansible playbooks.\n\u0026gt; 4. Network Layout # The network topology is simple.\nAll cluster nodes connect to a Netgear GS308E Gigabit managed switch on a 192.168.50.x subnet. The switch is unmanaged in practice: no VLANs, no trunking. Internal cluster traffic stays physically isolated on this switch.\nThe login node (carrier) has two network interfaces. Its WiFi connects to the home router for internet access. Its Ethernet connects to the cluster switch. This makes the login node a bridge between the outside world and the internal cluster network.\nThis is the same pattern used in production HPC: the login node sits at the boundary between the external network and the internal fabric. The difference is scale and bandwidth. Gigabit Ethernet instead of InfiniBand or Slingshot. A consumer switch instead of a spine-leaf topology.\n\u0026gt; 5. AWS Cost Comparison # To put the build cost in perspective, here is what a roughly comparable cloud setup would cost on AWS.\nThe comparison uses two c6g.2xlarge instances, which match the CPU compute nodes (interceptor-01 and interceptor-02) in core count and memory. This does not include the management node, visualization node, login node, or GPU node. The actual cluster has more capacity than what is represented by two EC2 instances.\nHome Cluster (2 CPU nodes) AWS EC2 (2x c6g.2xlarge) vCPUs per node 8 8 Memory per node ~14 GB 16 GB Architecture x86 (AMD Ryzen 5 Pro) ARM (AWS Graviton2) Network 1 Gbps (managed switch) Up to 10 Gbps Total one-time cost $1,264 N/A Annual cost Electricity only $2,300 (3-yr Savings Plan, N. Virginia) Break-even ~7 months vs. cloud N/A Caveat: This comparison matches node count and memory, not raw performance. The c6g.2xlarge instances use newer ARM (Graviton2) cores and have significantly faster networking. The point is not that the home cluster outperforms EC2. The home cluster does not outperform EC2. But for learning distributed systems, job scheduling, and cluster administration, building your own hardware pays for itself fast and gives you experience that cloud instances cannot.\nThe AWS estimate was generated using the AWS Pricing Calculator with the following configuration: 2x c6g.2xlarge, US East (N. Virginia), Linux, Compute Savings Plans (3-year, no upfront), 24/7 consistent workload.\n\u0026gt; 6. What is Next # In Episode 2, we will open up the Lenovo ThinkCentre M715q and go through the hardware in detail. I will show you how to install the RAM upgrades and fix a critical BIOS setting where the integrated Vega GPU reserves a chunk of system memory by default.\nAfter that, the series will cover:\nOperating system installation and initial configuration Slurm installation and multi-node job scheduling FreeIPA setup for centralized authentication NFS shared storage configuration GPU upgrade (RTX 5060 Ti swap and power supply replacement) Benchmarking and performance tuning Cable management (yes, eventually) All configuration files and Ansible playbooks will be published on my GitHub as we go.\nHappy Computing!\n","date":"13 3월 2026","externalUrl":null,"permalink":"/posts/hpc-from-scratch-01/","section":"Posts","summary":"","title":"[HPC From Scratch] Episode 1: Building Real HPC on a Budget","type":"posts"},{"content":"","date":"24 2월 2026","externalUrl":null,"permalink":"/tags/linux/","section":"Tags","summary":"","title":"Linux","type":"tags"},{"content":"Date: February 24, 2026 Venue: Northeastern University, Boston, MA\nOverview # A hands-on workshop designed for university researchers and faculty who are new to Linux and high-performance computing (HPC) environments. This session covers essential command-line skills needed to navigate and work efficiently on HPC clusters.\nTopics Covered # Navigating the Linux filesystem and managing files Working with text editors and file permissions Environment variables and shell configuration Essential command-line utilities for research workflows Tips for transitioning from GUI-based workflows to the terminal Materials # Workshop Slides \u0026amp; Materials (GitHub) Recording # Workshop Recordings (Spring 2026) ","date":"24 2월 2026","externalUrl":null,"permalink":"/talks/neu-talk-01/","section":"Talks \u0026 Workshops","summary":"","title":"Linux Essentials for HPC Researchers","type":"talks"},{"content":"Stop using your laptop as a middleman.\nWelcome to the first HPC Special Topics post. This is a standalone deep dive that builds on concepts from the HPC 101 series.\nIn the Data Transfer post, we learned how to move files using scp and rsync. Those tools work great for laptop-to-cluster transfers. But what about cloud storage?\nImagine this: a professor shares a 200GB dataset on Google Drive. Without the right tool, you would download it to your laptop (2 hours on a good day), then scp it to the cluster (another 2 hours). That is 4 hours of babysitting file transfers.\nWhat if you could skip the laptop entirely and pull data straight from Google Drive to your /scratch directory with a single command?\nThat is exactly what Rclone does.\nWe will also go beyond the basics and explore an optimization technique often discussed by experienced HPC engineers: how parallel threading can drastically change your transfer speeds and when it does not.\n\u0026gt; 1. Why Rclone? # Rclone is a command-line program to manage files on cloud storage. Think of it as rsync, but for the cloud. It supports over 70 cloud storage providers, including Google Drive, Dropbox, OneDrive, Box, AWS S3, and even SFTP.\n*(Click the image to watch the tutorial on YouTube)* Why does this matter on HPC?\nDirect Transfer. Move data from Google Drive to your cluster\u0026rsquo;s /scratch space without touching your laptop. No more download-upload-download cycles.\nParallelization. Unlike scp which sends one file at a time through a single stream, Rclone can transfer multiple files simultaneously. This is where things get interesting (more on this in Section 4).\nReliability. Rclone handles retries, checksums, and interrupted transfers automatically. If your connection drops at 99%, it picks up where it left off just like rsync -P but for cloud storage.\nVersatility. One tool, 70+ backends. Whether your collaborator shares data on Google Drive, your institution uses Box, or your pipeline stores results on S3, Rclone handles them all with the same interface.\n\u0026gt; 2. Setting Up Rclone on HPC # Important: Many HPC clusters prohibit running large data transfers on the login node. Check your cluster\u0026rsquo;s policy first. If large transfers are restricted, run Rclone inside a compute job:\n$ srun --pty bash $ module load rclone $ rclone copy ... Step 1: Loading Rclone # On many HPC clusters, Rclone is already available as a module.\n$ module avail rclone $ module load rclone If Rclone is not available as a module, you can install it locally in your home directory:\n# Download and unzip $ curl -O https://downloads.rclone.org/rclone-current-linux-amd64.zip $ unzip rclone-current-linux-amd64.zip # Move the binary to your local bin $ mkdir -p ~/bin $ cp rclone-*/rclone ~/bin/ # Verify $ ~/bin/rclone version Note: If you install locally, make sure ~/bin is in your $PATH, or use the full path ~/bin/rclone when running commands.\nStep 2: Connecting to Google Drive (The Headless Challenge) # This is the step that trips up most beginners. Since your HPC cluster does not have a web browser, you must use the headless setup to authenticate.\nRun rclone config and choose n for a new remote. Name it something memorable (e.g., gdrive). Select the provider number for Google Drive. For all other prompts (Client ID, Secret, Scope, Root Folder ID, Service Account, Advanced config), just press Enter to accept the defaults. When asked \u0026ldquo;Use auto config?\u0026rdquo;, choose n. This is crucial for remote servers without a browser. Rclone will provide a URL. Copy and paste this URL into your local laptop\u0026rsquo;s browser. Log in to your Google account, authorize Rclone, and copy the verification code back into your HPC terminal. When asked about Team Drive, choose n (unless you use one). Confirm with y to save. (Check the Rclone overview of cloud storage systems for detailed steps to connect to other cloud providers.)\n$ rclone config # Follow the prompts above # ... # Verify the connection $ rclone lsd gdrive: # You should see your Google Drive folders listed If you see your folders, you are connected.\nTip: The same process works for Dropbox, OneDrive, and Box. Just choose a different provider number in step 3. Each provider has slightly different authentication steps, but Rclone walks you through them interactively.\n\u0026gt; 3. Essential Commands # Before we dive into optimization, let\u0026rsquo;s cover the commands you will use daily.\nListing and Browsing\n# List top-level directories in your cloud $ rclone lsd gdrive: # List files in a specific folder $ rclone ls gdrive:my_project/data # Show directory tree (great for exploring) $ rclone tree gdrive:my_project --max-depth 2 # Check storage usage $ rclone about gdrive: Copying Data\n# Cloud -\u0026gt; Cluster (the most common use case) $ rclone copy gdrive:my_data ~/scratch/my_data -P # -P: Shows real-time progress, speed, and ETA # Cluster -\u0026gt; Cloud (backing up results) $ rclone copy ~/scratch/results gdrive:results -P Copy vs. Sync: Know the Difference\n# copy: Only adds new files. Never deletes anything at the destination. $ rclone copy gdrive:data ~/scratch/data -P # sync: Makes destination identical to source. DELETES files at # destination that don\u0026#39;t exist at source. Use with caution! $ rclone sync gdrive:data ~/scratch/data -P Warning: rclone sync will delete files at the destination that are not present at the source. Always double-check your command before running sync. When in doubt, use copy.\nAt this point, you have everything you need to use Rclone as a daily tool. The next sections explore how to make it faster.\n\u0026gt; 4. The Optimization Challenge: Threads vs. Bandwidth # *(Click the image to watch the tutorial on YouTube)* A common insight among experienced HPC engineers is that in many real-world WAN scenarios, a single TCP stream cannot fully utilize available bandwidth due to latency, TCP window limits, and provider-side throttling. The solution? Open more streams.\nRclone has a key flag for this:\n--transfers=N # Number of files to transfer in parallel (default: 4) This raised a few questions worth testing:\nDoes increasing threads always make things faster? Is there a point of diminishing returns? Does uploading (send) behave the same as downloading (receive)? The Experiment\nEnvironment: 4-core HPC compute node, 1Gbps network, Rclone with default Google Drive API (shared client ID). Scenario A: A single large 5GB file (generated with /dev/urandom to prevent compression shortcuts). Scenario B: 1,000 small files (1MB each, also random data). Variable: --transfers set to 1, 4, 8, 16, and 32. Repetitions: 3 runs per condition to ensure consistency. \u0026gt; 5. Benchmark Results # Scenario A: The Single Giant (5GB) # *(Transfer time for a single 5GB file across different thread counts.)* The line is flat. Whether you set --transfers to 1 or 32, the transfer time barely changes.\nWhy? Because --transfers controls file-level parallelism. It determines how many files are transferred simultaneously. If you only have one file, there is nothing to parallelize. One file, one stream, regardless of the thread count.\nThis is a common misconception: --transfers=16 does not split a single file into 16 chunks. It opens 16 slots for 16 separate files.\nAdvanced Note: Rclone does provide --multi-thread-streams for chunk-level parallel downloads of single large files on supported backends. However, this works only for downloads and its effectiveness varies by provider. For most use cases, the --transfers flag covered here is what you want.\nTakeaway: For large single files, increasing --transfers has no effect. The transfer speed is determined by your network bandwidth and the cloud provider\u0026rsquo;s per-stream throughput.\nScenario B: The Small File Storm (1,000 × 1MB) # This is where threading shines.\n*(Transfer time for 1,000 small files (1MB each) across different thread counts.)* With a single thread, uploading 1,000 files took 1,293 seconds (over 21 minutes). At 8 threads, it dropped to 199 seconds (about 3 minutes). That is a 6.5x speedup just by changing one flag.\nDownloads tell a slightly different story: 1 thread took 307 seconds, while 4 threads brought it down to 93 seconds (a 3.3x improvement). But beyond 4 threads, download speed barely changed.\nWhy are small files so sensitive to threading? Each file transfer involves API calls, metadata verification, checksum validation, and connection overhead. With a single thread, you wait for all of this to complete before starting the next file. Multiple threads hide this per-file latency by overlapping transfers, which is why the speedup is so dramatic.\n\u0026gt; 6. Finding the Sweet Spot # *(Speedup factor relative to single-thread baseline.)* The Plateau Effect # Performance gains essentially stop after 8 threads. Why?\nAPI Rate Limits. Google Drive (and most cloud providers) limit the number of API requests per second. Adding more threads beyond the provider\u0026rsquo;s limit just leads to throttling and retries. This is especially strict when using the default shared API client ID that all Rclone users share.\nTip for Power Users: Creating your own Google API client ID can significantly increase your API quota and may shift the optimal thread count higher. See the Rclone Google Drive documentation for details.\nOverhead. Managing 32 concurrent transfers creates its own overhead which is connection setup, checksum verification, and retry logic. They all compete for resources.\nSend (Upload) vs. Receive (Download) # Notice that downloading is significantly faster and saturates earlier than uploading across all conditions.\nWhen you upload, the cloud provider must verify, index, and store each file as it arrives. When you download, the provider serves files from optimized CDN infrastructure with less per-file processing overhead. This asymmetry means your optimal --transfers value may differ depending on the direction of your transfer.\nEfficiency: Why 8 Is the Magic Number # We can measure how efficiently each thread contributes to speedup:\n$$ Efficiency = \\frac{Speedup}{Number \\: of \\: Threads} × 100\\% $$ Threads Send Speedup Efficiency 1 1.0x 100% 4 3.9x 98% 8 6.5x 81% 16 6.6x 41% 32 6.6x 21% At 8 threads, you get 81% efficiency, and each thread is pulling its weight. At 32 threads, efficiency drops to 21%. You are using 4x the resources for essentially zero additional speedup.\nFor this specific setup (1Gbps network, default Google Drive API client), 8 threads was the sweet spot. Your optimal number may differ depending on your network speed, cloud provider, and API configuration, but the methodology for finding it is the same: test, measure, compare.\nNote: These numbers are specific to Google Drive with the default shared API client ID. Your results may vary depending on the cloud provider, network speed, and API configuration. The methodology, however, applies universally.\n\u0026gt; 7. Summary \u0026amp; Recommendations # Rclone is more than a convenience tool. It is a direct pipeline between your cloud storage and your cluster.\nKey Takeaways:\nSkip the laptop. Use Rclone to transfer data directly between cloud and cluster. Threads matter for small files. Threads hide per-file latency overhead. Thousands of files? Use --transfers 8 or --transfers 16. Threads do not help single large files. --transfers is file-level parallelism, not file-splitting. Uploads and downloads behave differently. Downloads saturate earlier. Plan accordingly. Don\u0026rsquo;t overdo it. Setting threads to 64 will likely trigger API throttling and slow you down. Pack when possible. Even with Rclone, 100,000 tiny files will be slow. Consider using tar to bundle them first (as we covered in the Data Transfer post). Scenario Recommended Command Many small files rclone copy remote:path local:path --transfers 8 -P Few large files rclone copy remote:path local:path -P Directory sync rclone sync remote:path local:path -P (use with caution) Check before transfer rclone lsd remote: and rclone about remote: What is Next?\nWe have added another essential tool to our HPC toolkit. In the next series, we will shift gears completely from using the cluster to building one. We will talk about hardware, networking, and how to turn a pile of parts into a working HPC system.\nSee you in the next series!\nHappy Computing!\n","date":"16 2월 2026","externalUrl":null,"permalink":"/posts/hpc-special-topics-01/","section":"Posts","summary":"","title":"[HPC Special Topics] Rclone: Cloud-to-Cluster Data Transfers","type":"posts"},{"content":"","date":"16 2월 2026","externalUrl":null,"permalink":"/tags/cloud-storage/","section":"Tags","summary":"","title":"Cloud Storage","type":"tags"},{"content":"","date":"16 2월 2026","externalUrl":null,"permalink":"/tags/performance-tuning/","section":"Tags","summary":"","title":"Performance Tuning","type":"tags"},{"content":"","date":"16 2월 2026","externalUrl":null,"permalink":"/tags/rclone/","section":"Tags","summary":"","title":"Rclone","type":"tags"},{"content":"In the real world, hitting \u0026ldquo;Submit\u0026rdquo; is just the beginning.\nWelcome to the finale of the HPC 101 series.\nSo far, we have covered the essentials: Logging in, Moving Data, and Managing Environments. Finally, you submitted your job.\nBut sometimes, things go wrong.\nYour job stays \u0026ldquo;Pending\u0026rdquo; forever. It crashes 2 seconds after starting. It runs for 3 days but produces empty files. Today, we will learn the \u0026ldquo;Survival Skills\u0026rdquo; for HPC. We will cover how to debug failed jobs, how to check your resource efficiency, and why you are stuck in the queue.\n*(Click the image to watch the tutorial on YouTube)* \u0026gt; 1. In-depth Monitoring (scontrol) # You submitted a job. You type squeue --me. It says P (Pending). Ok, but after 10 minutes, it\u0026rsquo;s still pending. Or maybe it\u0026rsquo;s running, but you don\u0026rsquo;t know where.\n$ squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12345 cpu bash user123 P 0:00 1 (Priority) 12346 gpu bash user123 P 0:00 1 (Resources) squeue gives you a quick summary, but sometimes you need the Full Report. Use the command scontrol show job \u0026lt;JOBID\u0026gt;.\n$ scontrol show job 12345 JobId=12345 JobName=bash UserId=user123(123456) GroupId=users(1000) ... JobState=PENDING Reason=Resources ... StartTime=2026-01-25T21:00:00 EndTime=Unknown NodeList=(null) WorkDir=/home/user123/my_project Command=/bin/bash ... Key fields to look for:\nJobState \u0026amp; Reason: Tells you exactly why it is waiting (e.g., Resources, Priority). StartTime: The scheduler\u0026rsquo;s estimated start time. (Note: This can change if higher priority jobs enter the queue). NodeList: If running, this shows which specific compute node you are using (e.g., compute-node-01). WorkDir: Confirms where your script is running and where output files will be saved. Linux Tip: What is grep? The output of scontrol is very long. We can filter it using a pipe | and grep.\n| (Pipe): Takes the output of the left command and passes it to the right command. grep: Think of it as \u0026ldquo;Ctrl + F\u0026rdquo; for the terminal. It prints only the lines containing your keyword. # Show me ONLY the StartTime line $ scontrol show job 12345 | grep StartTime StartTime=2026-01-25T22:00:00 EndTime=2026-01-25T23:00:00\u0026gt; \u0026gt; 2. The Emergency Button (scancel) # Oops! You just realized you requested 100 nodes instead of 1 node. Or maybe your code is stuck in an infinite loop.\nDon\u0026rsquo;t just let it fail. Kill it immediately.\n# Cancel a specific job $ scancel 12345 # Cancel ALL jobs by user $ scancel -u user123 # Cancel a specific job $ qdel 12345 # Cancel ALL jobs (depends on system, usually manual loop or specific command) $ qselect -u user123 | xargs qdel \u0026gt; 3. The Detective Work (sacct \u0026amp; Logs) # You came back from coffee, and your job is gone from the queue. Did it finish? Or did it fail? Since it is not in the queue (squeue), we need to check the History.\nStep 1: Check the State (sacct) # The command is sacct (Slurm Accounting). By default, the output is messy, so we use format options.\n$ sacct -j 12345 --format=JobID,State,AllocCPUS,ReqMem,MaxRSS,Elapsed,ExitCode JobID State AllocCPUS ReqMem MaxRSS Elapsed ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 12345 FAILED 1 2G 00:10:15 137:0 12345.batch FAILED 1 00:10:15 137:0 Common States:\nCOMPLETED: Success! (Exit Code 0:0) CANCELLED: The job was killed. TIMEOUT: The job ran longer than the requested --time. FAILED: The code crashed (Non-zero exit code). Step 2: Read the Logs # sacct tells you what happened, but not why. To find the \u0026ldquo;why\u0026rdquo;, look at the output file you defined in your script (e.g., #SBATCH -o result.out).\n# Look at the END of the file first $ tail -n 20 result.out Common Error Messages:\ncommand not found: Did you module load? ModuleNotFoundError: Did you conda activate or install the package? killed / oom-kill: You ran out of memory. Step 3: Get Notified (Pro Tip) # Jobs often fail when you are not watching. Let Slurm email you. Add this to your job script:\n#SBATCH --mail-type=FAIL,END #SBATCH --mail-user=you@email.com FAIL: Notify only when it crashes. END: Notify when it finishes (success or failure). \u0026gt; 4. Resource Efficiency (seff) # This is the most important part for becoming a \u0026ldquo;Power User\u0026rdquo;.\nImagine you reserved a banquet table for 40 people, but you ate dinner alone. The restaurant manager (Scheduler) would be angry. In HPC, this happens when you request --cpus-per-task=40 but your python script only uses 1 core.\nHow do you check your efficiency? Use seff.\n$ seff 12345 Job ID: 12345 Cluster: cluster User/Group: user123/users State: COMPLETED (exit code 0) Cores: 8 CPU Utilized: 00:01:25 CPU Efficiency: 10.23% of 00:01:30 core-walltime Job Wall-clock time: 00:01:30 Memory Utilized: 12.09 MB Memory Efficiency: 0.15% of 8.00 GB (8.00 GB/node) Note: Some clusters may not have seff enabled. In that case, use sacct with AveCPU, MaxRSS.\n$ sacct -j 12345 --format=JobID,State,AveCPU,MaxRSS obID State AveCPU MaxRSS ------------ ---------- ---------- ---------- 12345 COMPLETED 12345.batch COMPLETED 00:01:30 12384K How to interpret the output:\nCPU Efficiency:\nBad (\u0026lt; 50%): You requested too many cores. If your code is not parallelized, request only 1 core.\nGood (~ 90%): You are utilizing resources well.\nMemory Efficiency:\nBad (\u0026lt; 10%): You requested too much RAM. Reduce --mem next time.\nDangerous (\u0026gt; 95%): You are on the edge of crashing (OOM). Increase --mem slightly (e.g., by 20%).\nWhy does this matter? Smaller jobs fit into \u0026ldquo;gaps\u0026rdquo; in the cluster easier. By requesting only what you need, your jobs will start faster!\n\u0026gt; 5. Why is my job pending? (Fairshare) # Sometimes, your job stays in PD (Pending) state with reason Priority or Resources, even though there seem to be empty nodes.\nThis is likely due to Fairshare. Think of it as a \u0026ldquo;Karma System\u0026rdquo;.\nThe cluster is a shared resource. If you ran thousands of heavy jobs last week, your \u0026ldquo;Karma\u0026rdquo; goes down. You wait in line. If you haven\u0026rsquo;t used the cluster for a while, your \u0026ldquo;Karma\u0026rdquo; is high. You jump the queue. Checking the Reason Explicitly\nInstead of guessing, you can ask Slurm exactly why you are waiting:\n$ squeue -j 12345 -o \u0026#34;%.18i %.9T %.30R\u0026#34; JOBID STATE NODELIST(REASON) 12345 PENDING (Priority) This reveals the specific REASON code:\nPriority: Just wait. It\u0026rsquo;s Fairshare logic. Resources: The cluster is busy, or you requested a specific node that is busy. QOSMaxJobsLimit: You hit the limit of allowed running jobs. Dependency: It\u0026rsquo;s waiting for another job to finish. Don\u0026rsquo;t panic. Usually, you just need to wait.\n\u0026gt; 6. Summary \u0026amp; Cheatsheet # Debugging Mindset (Read This Once) # If a job fails, always ask these questions in order:\nDid it start? (squeue, scontrol) -\u0026gt; If not, check your script syntax. Did it finish or crash? (sacct) -\u0026gt; Check the State. Why did it crash? (logs) -\u0026gt; Read the .err file. Did I request the right resources? (seff) -\u0026gt; Check memory usage. Can I make it smaller? -\u0026gt; Smaller jobs run faster. Congratulations! You have officially graduated from HPC 101. You are no longer just a guest; you are a resident of the cluster.\nGoal Command Check Details scontrol show job \u0026lt;JOBID\u0026gt; Kill Job scancel \u0026lt;JOBID\u0026gt; Check History sacct -j \u0026lt;JOBID\u0026gt; Check Efficiency seff \u0026lt;JOBID\u0026gt; What\u0026rsquo;s Next? In the next series, we will change gears completely. We will stop being a \u0026ldquo;User\u0026rdquo; and start thinking like an \u0026ldquo;Engineer\u0026rdquo;. I will start a new series on How to Build an HPC Cluster from scratch.\nSee you in the next series!\n","date":"28 1월 2026","externalUrl":null,"permalink":"/posts/hpc101-04/","section":"Posts","summary":"","title":"[HPC 101] Job Debugging: Why Did My Job Fail?","type":"posts"},{"content":"","date":"28 1월 2026","externalUrl":null,"permalink":"/tags/debugging/","section":"Tags","summary":"","title":"Debugging","type":"tags"},{"content":"","date":"28 1월 2026","externalUrl":null,"permalink":"/series/hpc-101/","section":"Series","summary":"","title":"HPC 101","type":"series"},{"content":"","date":"28 1월 2026","externalUrl":null,"permalink":"/tags/seff/","section":"Tags","summary":"","title":"Seff","type":"tags"},{"content":"","date":"28 1월 2026","externalUrl":null,"permalink":"/tags/troubleshooting/","section":"Tags","summary":"","title":"Troubleshooting","type":"tags"},{"content":"It looks like it worked. But you just created a hidden mess.\nWelcome back to the HPC 101 series.\nIn the previous post, we learned how to transfer data. Now, you are ready to run your Python code. You log in, type pip install numpy, and hit Enter.\nUnlike the old days, you might not see a \u0026ldquo;Permission Denied\u0026rdquo; error. Instead, you see this:\n[user123@compute-node-01 ~]$ pip install numpy Defaulting to user installation because normal site-packages is not writeable Collecting numpy Using cached numpy-2.3.5... Installing collected packages: numpy Successfully installed numpy-2.3.5 It says \u0026ldquo;Successfully installed\u0026rdquo;. So everything is fine, right?\nNo. You just fell into the \u0026ldquo;User Install\u0026rdquo; trap.\nToday, we will learn why this \u0026ldquo;automatic\u0026rdquo; installation is dangerous in HPC and how to build a proper \u0026ldquo;Private Laboratory\u0026rdquo; using Virtual Environments.\n*(Click the image to watch the tutorial on YouTube)* \u0026gt; 1. The Trap: The \u0026ldquo;Backpack\u0026rdquo; Problem # When the system notices you cannot write to the global library, it quietly installs packages into your hidden home folder (usually ~/.local/lib/python3.x/site-packages).\nLet\u0026rsquo;s use an analogy.\nSystem Python: This is the Restaurant Kitchen Pantry. It has standard ingredients. You are not allowed to touch it. User Install (pip install): This is your Backpack. Since you can\u0026rsquo;t use the pantry, you stuff ingredients into your backpack. Virtual Environment: This is a Separate Lunchbox. Why is the Backpack (User Install) bad?\nNo Isolation (Dependency Hell): Project A needs NumPy 1.20 and Project B needs NumPy 2.0. If you put them both in your backpack, they get squashed together. You broke Project A to fix Project B.\n\u0026gt; 2. The Solution: Your Private Lunchbox # Instead of stuffing everything into one backpack, you should use Virtual Environments.\nThink of a Virtual Environment as your own private lunchbox.\nIsolation: You create a \u0026ldquo;Project A Box\u0026rdquo; and a \u0026ldquo;Project B Box\u0026rdquo;. They never touch each other. Safety: Even if you mess up the installation in one box, you just throw that box away. Your other projects are safe. In HPC, using environments is not just \u0026ldquo;good practice\u0026rdquo;. It is the only way to survive.\n\u0026gt; 3. Choose Your Tool: Conda vs. Venv # There are two main tools for creating environments: Conda and Venv. Which one should you use?\nWhat is it?\nA cross-platform package manager that installs Python packages and external libraries (C, C++, CUDA).\nPros: Manages Python Versions: You can create an environment with Python 3.8 today and Python 3.12 tomorrow. Binary Dependencies: Handles complex libraries with GPU support (CUDA/cuDNN) automatically.\nCons: Heavy: It takes up more disk space than venv. Slow: Sometimes the \u0026ldquo;solver\u0026rdquo; takes a long time to resolve dependencies. Shell Pollution: Improper use of conda init can break your terminal (See Step 2).\nNote: Due to recent Anaconda licensing changes, many HPC centers are transitioning from Anaconda/Miniconda to Miniforge, which uses the free conda-forge channel by default. The commands and workflow are identical. See the Anaconda Terms of Service for details.\nRecommendation: Best choice for Science, Engineering, and AI/ML projects.\nWhat is it?\nA built-in Python module that creates lightweight virtual environments.\nPros: Lightweight \u0026amp; Fast: Built into Python, creates environments instantly. Clean: Doesn\u0026rsquo;t touch your shell configuration files.\nCons: Limited: Cannot install non-Python tools (like CUDA drivers) easily. Dependent: You are tied to the system\u0026rsquo;s Python version (if system has Python 3.6, your venv is 3.6).\nRecommendation: Good for simple Python scripts or pure software development.\n\u0026gt; 4. Let’s Build Your Environment # Let\u0026rsquo;s get practical. Here is how you set up your environment.\nStep 1: Create the Environment # # 1. Load the module $ module load miniconda3 # 2. Create environment $ conda create --name myenv python=3.13 # To store in your Lab\u0026#39;s group directory to save space in Home $ conda create --prefix /projects/myLAB/myCondaEnv python=3.13 # 1. Load the python module $ module load python/3.13.10 # 2. Create environment $ python3 -m venv /projects/myLAB/myenv # To store in your Lab\u0026#39;s group directory to save space in Home $ python3 -m venv /projects/myLAB/myVenv Step 2: Activate (The Safe Way) # WARNING: Do NOT run conda init\nMany tutorials tell you to run conda init. In HPC, this is dangerous. It modifies your .bashrc file to automatically activate the (base) environment every time you log in. This causes:\nConflict with system modules (OpenMPI, GCC). Open OnDemand Failure: It may prevent Jupyter or RStudio sessions from starting. If you already ran it, disable auto-activation:\n$ conda config --set auto_activate_base false Instead, use source activate \u0026lt;ENV\u0026gt; or the full path.\n# The \u0026#34;HPC Safe\u0026#34; way (Recommended) $ source activate /projects/myLAB/myCondaEnv # OR if you are using module system properly: $ conda activate /projects/myLAB/myCondaEnv # Source the activate script $ source /projects/myLAB/myVenv/bin/activate Step 3: Install Packages # # Handles binary deps (CUDA, etc.) better $ module load cuda/12.8 # Make sure to match the cuda version (myCondaEnv) $ conda install -c conda-forge cupy cuda-version=12.8 (myCondaEnv) $ conda install numpy pandas # Works in both Conda and Venv (myVenv) $ pip install matplotlib huggingface-hub Rule of Thumb:\nNever run pip install unless you are inside an activated environment.\nStep 4: Deactivate # # For conda $ conda deactivate # For venv $ deactivate \u0026gt; 5. Maintenance: Cleaning the Trash (Cache) # One day, you might see this error: Disk quota exceeded.\nYou check your folder, and it seems small. Where did the space go? Both Conda and Pip store downloaded files in a hidden Cache folder (~/.conda/pkgs or ~/.cache/pip). These can grow to 10GB+ easily.\nThe Clean Way:\n# Remove unused packages and caches $ conda clean --all # Remove pip cache $ pip cache purge The \u0026ldquo;Nuclear\u0026rdquo; Option: If your disk is 100% full, the commands above might fail (because they can\u0026rsquo;t create a lock file). In that case, you have to delete them manually.\n# WARNING: Be careful with rm -rf # For conda $ rm -rf ~/.conda/pkgs/* # For pip $ rm -rf ~/.cache/pip/* Don\u0026rsquo;t worry, deleting cache won\u0026rsquo;t break your installed environments. It just deletes the downloaded installers.\n\u0026gt; 6. Summary \u0026amp; Cheatsheet # Using environments on HPC is about keeping your workspace clean and avoiding the \u0026ldquo;Disk Quota Exceeded\u0026rdquo; error.\nAction Conda Command Venv Command Create conda create --prefix \u0026lt;path\u0026gt; python -m venv \u0026lt;path\u0026gt; Activate source activate \u0026lt;path\u0026gt; source \u0026lt;path\u0026gt;/bin/activate Install conda install / pip install pip install Clean conda clean --all pip cache purge My Advice: For new installations, Miniforge is the safest choice for AI/HPC projects. It gives you the same Conda workflow with conda-forge packages and no licensing concerns. If your cluster already provides Miniconda or Anaconda, those work just fine too. And please, avoid conda init to keep your login clean.\nHappy Computing!\n","date":"18 1월 2026","externalUrl":null,"permalink":"/posts/hpc101-03/","section":"Posts","summary":"","title":"[HPC 101] Virtual Environments: How to Build Your Own Workspace","type":"posts"},{"content":"","date":"18 1월 2026","externalUrl":null,"permalink":"/tags/conda/","section":"Tags","summary":"","title":"Conda","type":"tags"},{"content":"","date":"18 1월 2026","externalUrl":null,"permalink":"/tags/miniconda/","section":"Tags","summary":"","title":"Miniconda","type":"tags"},{"content":"","date":"18 1월 2026","externalUrl":null,"permalink":"/tags/python/","section":"Tags","summary":"","title":"Python","type":"tags"},{"content":"","date":"18 1월 2026","externalUrl":null,"permalink":"/tags/venv/","section":"Tags","summary":"","title":"Venv","type":"tags"},{"content":"Let\u0026rsquo;s turn that scary black screen into a hacker\u0026rsquo;s playground.\nLinux beginners usually consider the black terminal screen as a scary tool that might explode if they touch a wrong key. You don\u0026rsquo;t want to accidentally press a button that blows up your \u0026ldquo;home\u0026rdquo; directory. But once you get used to it, this screen makes you look like a cool hacker.\nBy the end of this post, you’ll be able to move around the Linux terminal, manage files, and edit text without panic.\n*(Click the image to watch the tutorial on YouTube)* \u0026gt; 1. The Dark House # Navigating the CLI (Command Line Interface) is like waking up in the middle of the night.\nIf someone takes your pretty looking GUI (Graphical User Interface) away and throws you into a CLI screen, it might feel like a power outage.\nImagine you took a nap on a couch and woke up at 3 AM. It is pitch black and you know how to get to your bed, but you can\u0026rsquo;t see anything on the way.\nUsing a terminal is exactly the same. You need to verify where you are and what is around you before you take a step. If you know exactly where to go, you can walk straight to your room. But you should always be careful not to trip.\n\u0026gt; 2. Navigating Your Home # Okay, it\u0026rsquo;s 3 AM, the lights are out, and you don\u0026rsquo;t have a flashlight. You want to go to your bed.\nFirst, you need to locate yourself. This is what pwd (Print Working Directory) does. It tells you exactly where you are standing.\n$ pwd /home/my_family/first_floor Before you move, you want to know what\u0026rsquo;s around you so you don\u0026rsquo;t have to kick the table. The ls (List) command is your hands feeling the surroundings. You can add options to see hidden items or extra details.\n$ ls couch kitchen lamp restroom staircase trash_bin TV $ ls -a couch kitchen lamp .phone .remote restroom TV user123_room # Now you found a phone and a remote hidden under the couch! # (Files starting with \u0026#39;.\u0026#39; are hidden in Linux) $ ls -l total 32 -rwxr-xr-x. 1 family family 4096 Dec 27 20:22 couch drwxr--r--. 1 family family 4096 Dec 6 20:02 kitchen -rwxr--r--. 1 family family 10517 Dec 26 18:51 lamp drwxr-xr-x. 1 family family 4096 Dec 26 17:49 restroom -rwxr-xr-x. 1 parents family 840 Dec 26 18:03 TV -rwxr-xr-x. 1 family family 840 Dec 26 18:03 trash_bin drwxr-xr-x. 1 user123 family 4096 Dec 26 18:03 user123_room # You can see full details about each item or room # You won\u0026#39;t see hidden items here Note: You can combine -a and -l as -la to see full details of all items\nThe detailed view (-l) shows some cryptic codes. Don\u0026rsquo;t worry about it though it looks like secret codes, but we can decode them.\n-rwxrw-r-- 1 user group 46 Feb 14 16:37 File.txt ^ ^ ^ ^ ^ ^ ^ ^ | | | | | | | | 1 2 3 4 5 6 7 8 File Type: - (File) or d (Directory) Permissions: rwxrw-r-- (Who can do what) Owner \u0026amp; Group: Who owns this item Size \u0026amp; Time: How big and when it was last touched Let\u0026rsquo;s take a closer look at the first part. It defines who is allowed to enter the room or touch the item.\nType Owner Group Others [-] [rwx] [rw-] [r--] | | | | File Read Read Read Write Write Exec If you see rwx, it means the owner has a full power (Read, Write, and Execute).\nNow, we learned how to read a map, so let\u0026rsquo;s start moving.\nLet\u0026rsquo;s go to the restroom first. Since you can see it in your list, you can walk straight in. Use the cd (Change Directory) command.\n$ cd restroom $ pwd /home/my_family/first_floor/restroom $ ls bath_tub body_wash hand_soap shampoo shower sink toilet towel You finished your business and want to go to your bedroom (user123_room). But wait, ls shows no room here! You have two options:\nStep out to the hall, check locations, and then enter your room. $ cd .. $ cd user123_room Go out and immediately enter your room in one go. $ cd ../user123_room What is \u0026ldquo;..\u0026rdquo;? In Linux, a single dot . represents Here (Current location), and double dots .. represent Parent location (One level up). (Unfortunately, it stops at 2 dots. There is no \u0026ldquo;...\u0026rdquo; or \u0026ldquo;....\u0026rdquo;)\nRelative vs. Absolute Path Using dots (. or ..) works within your house or close to your current position. But what if you are at a friend\u0026rsquo;s house or far way from your room? You can get out of your friend\u0026rsquo;s restroom (..), but you won\u0026rsquo;t find your room (user123_room) there. In that case, you need an Absolute Path (Full address).\n# Relative path (Works only if you are in the hallway) $ cd user123_room # Absolute path (Works from anywhere in the universe) $ cd /home/my_family/user123_room \u0026gt; 3. Magic Spells: File Operations # Now, we need some more imagination. You are not just a person in the dark but you are a Wizard. Your magic wand can create rooms and items, or make trash disappear.\nCreation (mkdir, touch)\nFirst, let\u0026rsquo;s create an empty room. The spell is mkdir (Make Directory).\n$ mkdir new_room If you want to create an item (an empty file), use touch.\n$ touch magic_scroll.txt Teleportation (mv)\nYou want to move a TV from the living room to your new room. Cast mv (Move).\n$ mv /home/my_family/first_floor/TV /home/my_family/first_floor/new_room/ Note: In Linux, renaming is just moving a file to the same place with a new name.\n$ mv old_name.txt new_name.txt Cloning (cp)\nIf you take the TV, your dad will be sad. Let\u0026rsquo;s create a clone of it using cp (Copy).\n# Copy TV to the parent directory $ cp ./TV ../ Now everyone is happy!\nDestruction (rm)\nYour mom asked you to take out the trash. With your magic power, you can simply incinerate it. Use rm (Remove).\n$ rm ./trash_bin Tip: When using rm, prefer relative paths so you clearly see what you\u0026rsquo;re deleting.\nWarning: Unlike Windows/Mac, Linux rm won\u0026rsquo;t keep trash (files) in a ** Recycle Bin.** When you rm a file, it\u0026rsquo;s gone forever. It is incinerated. So, please be careful when you cast this spell.\n\u0026gt; 4. X-Ray Vision (Checking Files) # While you were out, your parents left a note on the table.\nHey, we are leaving to pick up your cousin from the airport.\nPlease clean up the kitchen.\nDon\u0026rsquo;t watch TV all evening.\n\u0026hellip; (100 lines more) \u0026hellip;\nMake sure to finish your homework before we are back.\nCall if you want us to pick up anything for dinner.\nHow do you read this?\ncat: Opens entire note at once less: Opens content in a text viewer and lets you scroll up and down head: Peeks at the top few lines tail: Peeks at the bottom few lines My Suggestion:\nCommand Use Case cat Short file (Fits in one screen) less Long file (Log files or code) head Just checking the beginning tail Checking the latest update (End of logs) Note: less is a contents viewer. You can press q to close it. ESC won\u0026rsquo;t close the viewer\n# Read the first 2 lines $ head -2 note.txt Hey, we are leaving to pick up your cousin from the airport. You have a few things to do once you are back. # Read the last 2 lines $ tail -2 note.txt Make sure to finish your homework before we are back. Call if you want us to pick up anything for dinner. \u0026gt; 5. Write It Down (Editors) # You want to write a reply. On the terminal, you can\u0026rsquo;t open Microsoft Word. You need terminal editors like nano or vim.\nAnd\u0026hellip; Let\u0026rsquo;s not talk about emacs now. I\u0026rsquo;m sorry if you are an emacs fan.\nOption 1: Nano (The Notepad) If you want a simple sticky note and a pen, use nano. It\u0026rsquo;s very beginner friendly.\n$ nano reply.txt You can simply type whatever you want. The short cuts are at the bottom.\n^ means Ctrl key. To Save: Press Ctrl + O (Write Out), then Enter. To Exit: Press Ctrl + X. Option 2: Vim (The Pro Tool) It is a powerful tool but a bit more tricky. The most important concept is Modes.\nNormal Mode: You cannot type text. You can view contents or give commands. Insert Mode: You can actually type and edit. How to survive inside Vim:\nType vim reply.txt. Press i to start typing (Insert Mode). When done, press Esc (to exit Insert Mode). Type :wq and Enter (Write and Quit). If you are stuck and panic? Press Esc and type :q! (Force Quit without saving). \u0026gt; 6. Secret Tips # Let\u0026rsquo;s keep these tips between us.\n1. Tab Autocomplete (Magic Key) Don\u0026rsquo;t type long filenames manually. Just type the first few letters and hit TAB key.\n$ cd /home/my_family/first_f [TAB] # Becomes: $ cd /home/my_family/first_floor/ 2. History (Arrow Keys) Have you used the command before? Don\u0026rsquo;t retype the whole command. Just browse previously used commands with the Up/Down Arrow key and run it.\n3. The Abort Button (Ctrl+C) Stuck in a running program? Or typed a wrong command that you want to cancel? Press Ctrl + C.\n[user@linux]$ i_wrote_a_very_long_random_command [Ctrl+C] [user@linux]$ i_wrote_a_very_long_random_command^C [user@linux]$ (Canceled!) 4. The Clean Slate (clear) Is your screen too messy? Type clear. It wipes out the screen.\nThe Forbidden Spell One last warning. Don\u0026rsquo;t ever run this:\n$ rm -rf / This is the Nuke Button for your Linux world. It tries to delete everything from the root directory. Once launched, there is no going back.\nSummary\nNavigate: pwd (Where am I?), ls (What\u0026rsquo;s around here?), cd (Enter/Change location). Manage: mkdir (Create directory), touch (Create file), cp (Copy), mv (Move/Rename), rm (Destroy). View: cat (Short), less (Long), head/tail (Top/Bottom). Edit: nano (Simple), vim (Advanced). Survival: TAB to autocomplete, Ctrl+C to abort. Great job! You can now move comfortably in the darkness.\nHappy Computing!\n","date":"9 1월 2026","externalUrl":null,"permalink":"/posts/linux101-01/","section":"Posts","summary":"","title":"[Linux 101] The Terminal: Don't Be Afraid of the Dark","type":"posts"},{"content":"","date":"9 1월 2026","externalUrl":null,"permalink":"/series/linux-101/","section":"Series","summary":"","title":"Linux 101","type":"series"},{"content":"","date":"9 1월 2026","externalUrl":null,"permalink":"/tags/terminal/","section":"Tags","summary":"","title":"Terminal","type":"tags"},{"content":"We have the computing power. Now we need some Data.\nMoving files between your local machine (laptop/workstation) and the HPC cluster is a daily routine for researchers. You have your code, input data, and eventually, the results. This guide covers some basics for file transfer, from \u0026ldquo;packing\u0026rdquo; your files to handling massive datasets.\n*(Click the image to watch the tutorial on YouTube)* \u0026gt; 1. The Golden Rule: Pack Before You Move # Think of this process like moving into a new house.\nIn the previous post, we compared the HPC cluster to a Hotel. Let\u0026rsquo;s assume your laptop is your old house. Now, you need to move your belongings (data) to the new place (HPC cluster).\nImagine you have 10,000 pairs of socks (small data files). Would you carry them one by one to the moving truck? No, since it\u0026rsquo;ll take forever, you would put them in a box first.\nIn HPC, transferring thousands of small files individually kills network performance due to overhead. So, you should always archive your files or folder first.\nChoose Your Box: Tar vs. Zip # # Packing (create archive) $ tar -czf my_data.tar.gz my_folder # -c: Create # -z: Gzip compression # -f: File name # Unpacking (extract archive) $ tar -xf my_data.tar.gz # -x: Extract # -f: File name # (On most modern systems, tar detects compression automatically) # Packing (create archive) $ zip -r my_data.zip my_folder # -r: Recursive (includes all subdirectories) # Unpacking (extract archive) $ unzip my_data.zip \u0026gt; 2. Direct Download (Web to HPC) # Scenario: Your data is hosted on a website.\nDo not download it to your laptop just to upload it again to the cluster. That is an unnecessary double work. Just order your \u0026ldquo;delivery\u0026rdquo; directly to your new house (Cluster)!\nUse wget or curl on a cluster\u0026rsquo;s compute node (or a designated data transfer node, if your cluster provides one). Using a login node for a file transfer is usually not recommended.\n# Option 1: Using wget # wget \u0026lt;File Address\u0026gt; $ wget https://example.com/dataset.tar.gz # Option 2: Using curl # curl -o \u0026lt;File Name\u0026gt; \u0026lt;File Address\u0026gt; $ curl -o dataset.tar.gz https://example.com/dataset.tar.gz \u0026gt; 3. Transfer Tools: SCP vs. Rsync # Scenario: The files are on your laptop. (Note: Run following commands on your Local Terminal, not inside the cluster)\nSCP (The \u0026ldquo;Simple Throw\u0026rdquo;) # If you have a small file or a single packed archive, use scp (Secure Copy). It is simple and quick.\n# Upload: Laptop -\u0026gt; Cluster $ scp my_data.tar.gz \u0026lt;USER\u0026gt;@\u0026lt;HOST_NAME\u0026gt;:~/ # Example: scp data.tar.gz user123@login.university.edu:~/ # Download: Cluster -\u0026gt; Laptop $ scp \u0026lt;USER\u0026gt;@\u0026lt;HOST_NAME\u0026gt;:~/results.tar.gz ./ # Example: scp user123@login.university.edu:~/data.tar.gz ./ Rsync (The \u0026ldquo;Smart Mover\u0026rdquo;) # What if your file is huge (e.g., 100GB) and your WiFi disconnects at 99%? scp will fail, and you have to start over again from 0%. That is going to be a nightmare.\nTry rsync instead. It checks the difference between source and destination. If the connection drops, it resumes from where it left off.\n$ rsync -azP my_big_data \u0026lt;USER\u0026gt;@\u0026lt;CLUSTER\u0026gt;:~/ # Example: rsync -azP data_tar.gz user123@login.university.edu:~/ Understanding the flags (-azP):\n-a: Archive mode. Preserves permissions, timestamps, and symbolic links. -z: Compress file data during the transfer for faster speed. -P: Shows Progress bar and allows Partial transfer (Resuming). Rule of Thumb:\nSmall file or Simple transfer? Use SCP Big file or Unstable network? Use Rsync \u0026gt; 4. GUI Clients (WinSCP \u0026amp; FileZilla) # \u0026ldquo;I hate the terminal. Can I just drag and drop?\u0026rdquo;\nYes, you can! If you are not comfortable with command-line tools yet, or if you just want to browse files visually, use an SFTP Client.\nRecommended Tools # Windows only: WinSCP (Most popular) Windows/Mac/Linux: FileZilla or Cyberduck How to Connect # The settings are exactly the same as your SSH connection.\nFile Protocol: SFTP Host name: Your cluster address (e.g., login.university.edu) Port number: 22 (Default SSH port) User/Password: Your credentials Once connected, you will see your laptop\u0026rsquo;s files on the left and the cluster\u0026rsquo;s files on the right. Just drag and drop to transfer!\nNote for Globus Users: If you need to transfer massive datasets (Terabytes/Petabytes) between institutions or clusters, ask your system administrator about Globus. It is a high-performance transfer service often supported by research centers. It\u0026rsquo;s much faster and more reliable than SCP/SFTP for large data.\n\u0026gt; 5. Code Management with Git # Scenario: Moving your Python/C++ scripts.\nShould I use rsync for your code? You can, but why not try a better method. Treat your code like books in a library. You can keep old books while adding new editions and check them out. Try to use Git.\nLaptop: Commit and push your code to GitHub/GitLab.\n# Commit your changes $ git commit -a -m \u0026#34;Commit Message\u0026#34; # Push your changes to github $ git push Cluster: Clone or Pull the repository.\n# On the Cluster $ git clone https://github.com/username/my-project.git # Pull changes $ git pull This keeps your version history safe and makes collaboration much easier.\n\u0026gt; 6. Storage Quota # Warning: Remember the \u0026ldquo;Hotel Room\u0026rdquo; analogy? Your room has an occupancy limit. We call it Quota.\nIf you fill up your disk space, your jobs will crash, and you might not be able to save a file or cannot even login.\nHow to check? Commands vary by institution. Common examples include:\n$ quota -s $ lfs quota -u user123 /home/user123 $ check_usage Please check your user documentation or ask your support team for the specific command. Always check your available space before transferring a massive dataset.\nSummary\nPack your small files (tar or zip). Use wget for web data. Use scp for quick, small transfers. Use rsync -azP for large, robust transfers. Use git for code. Nice job! You have learned how to prepare your data. In the next post, we will learn how to manage software environments using Conda.\nHappy Computing!\n","date":"2 1월 2026","externalUrl":null,"permalink":"/posts/hpc101-02/","section":"Posts","summary":"","title":"[HPC 101] Data Transfer: How to Move Files In and Out","type":"posts"},{"content":"","date":"2 1월 2026","externalUrl":null,"permalink":"/tags/file-transfer/","section":"Tags","summary":"","title":"File Transfer","type":"tags"},{"content":"","date":"2 1월 2026","externalUrl":null,"permalink":"/tags/git/","section":"Tags","summary":"","title":"Git","type":"tags"},{"content":"","date":"2 1월 2026","externalUrl":null,"permalink":"/tags/rsync/","section":"Tags","summary":"","title":"Rsync","type":"tags"},{"content":"","date":"2 1월 2026","externalUrl":null,"permalink":"/tags/scp/","section":"Tags","summary":"","title":"SCP","type":"tags"},{"content":"Welcome to the HPC 101 series!\nThis guide covers the essential knowledge of High-Performance Computing (HPC). If you are new to HPC, don\u0026rsquo;t worry. We\u0026rsquo;ll walk through basics step-by-step, from logging in to submitting your first job.\n\u0026gt; 1. What is HPC? # High-Performance Computing (HPC) utilizes supercomputers or computer clusters to solve complex computational problems. While a standard workstation can handle everyday tasks, HPC is designed for massive scale, widely used in fields ranging from engineering and science to finance and psychology. It is a rapidly growing technology, especially in the age of AI and Machine Learning.\nResearch institutes and companies around the world use HPC to develop new products or run intensive simulations. One of the world’s fastest HPC systems, El Capitan, is hosted by Lawrence Livermore National Laboratory. (Reference).\nWhy do we use HPC? # HPC is a powerful tool that allows researchers and engineers to solve problems demanding high computational performance which cannot be handled by normal desktop PCs. Here are some example cases,\nAI/ML: Training large models using multiple GPUs Pharmaceutics: Simulating molecular dynamics to develop new medicines Physics/Chemistry: Running quantum chemistry or simulating protein folding Meteorology: Processing large data for accurate weather forecasting \u0026gt; 2. How to SSH into an HPC Cluster # Before we compute, we need to connect to the cluster. Watch the tutorial video below or follow the text guide below.\n*(Click the image to watch the tutorial on YouTube)* What is SSH? # SSH (Secure Shell) is a network protocol that enables secure connections between computers. It is used for remote access, command execution, and file transfers. Don\u0026rsquo;t worry if these terms sound too technical. Simply, think of it as a secure tunnel connecting your PC to the HPC cluster.\nLet\u0026rsquo;s connect! # Open a terminal window.\nLinux/Mac: Open the built-in Terminal app Windows: Use Command Prompt (CMD), PowerShell, or third party tools like PuTTY or MobaXterm Type the following command:\n$ ssh \u0026lt;YOUR_ID\u0026gt;@\u0026lt;CLUSTER_HOST_NAME\u0026gt; # Example: $ ssh user123@login.university.edu (Note: The $ sign indicates the command-line prompt. Do not type it.)\nSecurity Prompt: If this is your first time connecting, you will see a message asking: \u0026ldquo;Are you sure you want to continue connecting?\u0026rdquo; Type yes and press Enter.\nEnter Password:\nType your user password.\nNote: You will NOT see asterisks (****) or a cursor moving. This is a standard security feature in Linux. Just type your password and press Enter.\nSuccess:\nIf you see a screen similar to the one below, you have successfully logged in!\n[user123@login-node-01 ~]$ \u0026gt; 3. How to use Modules # On HPC, you can’t simply install software with sudo apt-get or sudo dnf. Instead, we use the Module System.\n*(Click the image to watch the tutorial on YouTube)* What is the Module System? # Most HPC clusters manage software using a module system like Environment Modules or Lmod. Unlike your personal computer where you can install software on system, HPC clusters use modules to:\nNo Conflicts: Different users can use different software versions simultaneously Reproducibility: You can keep your environment consistent for your research Auto-loading: When you load a module (e.g., OpenMPI), it automatically loads necessary dependencies (e.g., GCC compilers) Essential Commands # Here is a cheat sheet for module commands:\n# View list of ALL available modules on the system $ module avail # Load a specific module $ module load \u0026lt;NAME\u0026gt;/\u0026lt;VERSION\u0026gt; # Example: module load openmpi/4.1.8 # View list of CURRENTLY loaded modules $ module list # Unload a module $ module unload \u0026lt;NAME\u0026gt; # Unload ALL modules $ module purge Recommended Practices # Avoid .bashrc: Do not put module load commands in your .bashrc file. This could cause conflicts and login issues. Check availability first: Use module avail to see the exact name and version. Be specific: Always specify the version number (e.g., module load openmpi/4.1.8). If not specified, the default version is loaded, which might be changed. \u0026gt; 4. Submit Your First Job with Slurm # Now, you\u0026rsquo;re ready to submit a job.\n*(Click the image to watch the tutorial on YouTube)* What is a Job Scheduler? # In an HPC environment, you do not run heavy calculations directly on the Login Node. Instead, you submit a \u0026ldquo;job\u0026rdquo; to a Scheduler like Slurm, PBS, SGE, or LSF. The scheduler manages resources and assigns your job to available Compute Nodes.\nNote: This tutorial primarily focuses on Slurm, one of the most widely used schedulers in modern HPC systems. PBS/Torque examples are provided for reference, but commands and options may vary. Always check your cluster\u0026rsquo;s documentation for scheduler-specific syntax.\nInteractive Jobs: Useful for development, debugging, or tasks requiring a GUI. You get a shell on a compute node. Batch Jobs: Useful for long running tasks. You submit a script, and the system runs it when resources are available. {:style=\u0026ldquo;display:table; margin:0 auto; max-width:auto; height:auto; background-color:#f8f9fa; border-radius:4px; padding:10px;\u0026rdquo;}\nThe \u0026ldquo;Hotel\u0026rdquo; Analogy # Sometimes beginners make a mistake of running heavy tasks directly after logging in. Please don’t do that.\nThink of the HPC cluster as a Hotel.\nLogin Node = Hotel Lobby: This is where you check in. It’s a shared space. You wouldn’t set up a tent and sleep in the lobby, right? Compute Node = Guest Room: This is your private room where you can actually work (sleep). Scheduler = Receptionist: You ask the receptionist (Scheduler) for a room (Resources), and they assign you one. We use a job scheduler like Slurm to ask for resources.\nLet\u0026rsquo;s submit an Interactive Job # Use this when you need to test or debug code in real-time.\nRequest a session (get a room):\n[user123@login-node-01]$ srun --pty bash srun: job 12345 queued and waiting for resources srun: job 12345 has been allocated resources [user123@compute-node-01]$ # Note: your cluster may require specifying partition: # $ srun -p interactive --pty bash Your hostname will change from login-node-01 to compute-node-01. You are now in your “Guest Room”. When you are done, type exit to return to the login node (lobby):\n[user123@compute-node-01 ~]$ exit [user123@login-node-01 ~]$ Let\u0026rsquo;s submit a Batch Job # This is for long-running simulations. You write a \u0026ldquo;batch script\u0026rdquo; (reservation request) and submit it.\nCreate a script (e.g., job_script.sh) using a text editor like vim or nano. #!/bin/bash # Tells the system that this is a Bash script #SBATCH --account=myAcct # Account name #SBATCH --partition=myPart # Partition name #SBATCH --job-name=first_job # Job name #SBATCH --output=result.out # Standard output log #SBATCH --error=result.err # Standard error log #SBATCH --nodes=1 # Number of nodes #SBATCH --ntasks=1 # Number of tasks (processes) #SBATCH --time=00:10:00 # Time limit (HH:MM:SS) #SBATCH --mem-per-cpu=4G # Memory per cpu # Load necessary modules module load python/3.12.12 # Run your command echo \u0026#34;Hello, HPC World!\u0026#34; python3 --version #!/bin/bash # Tells the system that this is a Bash script #PBS -A myAcct # Account name #PBS -q myQueue # Queue name #PBS -N first_job # Job name #PBS -o result.out # Standard output log #PBS -e result.err # Standard error log #PBS -l nodes=1:ppn=1 # Number of nodes and processors per node #PBS -l walltime=00:10:00 # Time limit (HH:MM:SS) #PBS -l pmem=4gb # Memory per cpu # Load necessary modules module load python/3.12.12 # Change to submission directory cd $PBS_O_WORKDIR # Run your command echo \u0026#34;Hello, HPC World!\u0026#34; python3 --version Notes: Make sure to modify the script to meet your requirements\n(Important: Replace \u0026ldquo;myAcct\u0026rdquo; and \u0026ldquo;myPart\u0026rdquo; with your actual account and partition names provided by your system administrator.) #SBATCH: Slurm directives readable to Slurm scheduler\n(#SBATCH is one word not \u0026ldquo;# SBATCH\u0026rdquo;) Actual tasks located under Slurm directives Your job will get terminated once your tasks are done\n(in case you submitted a longer time than required) Submit the job: $ sbatch job_script.sh Submitted batch job 12345 $ qsub job_script.sh 12345.headnode (Remember this Job ID (12345) and reference this number in your ticket!)\nCheck the status: $ squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12345 myPart first_job user123 R 0:02 1 compute-01 $ qstat -u user123 Job ID Name User Time Use S Queue -------- -------- -------- -------- - ----- 12345 first_job user123 0:02 R myQueue Job Status Columns (Slurm):\nColumn Description JOBID Your Job\u0026rsquo;s assigned ID PARTITION Partition name NAME Job name USER User name ST Job status: R=Running, PD=Pending, F=Failed, S=Suspended, CG=Completing TIME Time elapsed since job started NODES Number of requested nodes In case you want to cancel the job, use scancel \u0026lt;JOBID\u0026gt; or qdel \u0026lt;JOBID\u0026gt;\n$ scancel 12345 $ qdel 12345 View results: Once the job finishes (or disappears from squeue), check the output file:\n# Success log $ cat result.out Hello, HPC World! Python 3.12.12 # Error log (If something went wrong) $ cat result.err Summary\nSSH: The secure tunnel to enter the cluster Modules: Load software Login Node (Lobby): Only for checking in Compute Node (Room): The actual place to run your work, assigned by the Scheduler Job Submission: Use sbatch for batch scripts and srun for interactive job Congratulations! You have successfully logged in, set up your environment, and run your first job. In the next post, we will move our bags (Data) to this new hotel room.\nNeed Help?\nCheck your cluster\u0026rsquo;s documentation for specific Slurm configurations Use man sbatch to see all available options Most clusters have a help channel or support email Happy Computing!\n","date":"27 12월 2025","externalUrl":null,"permalink":"/posts/hpc101-01/","section":"Posts","summary":"","title":"[HPC 101] First Steps to HPC: SSH, Modules, and Slurm","type":"posts"},{"content":"","date":"27 12월 2025","externalUrl":null,"permalink":"/tags/ssh/","section":"Tags","summary":"","title":"SSH","type":"tags"},{"content":"","date":"27 12월 2025","externalUrl":null,"permalink":"/tags/tutorial/","section":"Tags","summary":"","title":"Tutorial","type":"tags"},{"content":"","date":"16 12월 2025","externalUrl":null,"permalink":"/tags/first-commit/","section":"Tags","summary":"","title":"First Commit","type":"tags"},{"content":"Welcome to The Login Node. This blog documents my journey in HPC infrastructure and performance engineering. More experiments and lab notes are coming soon. Stay tuned.\n","date":"16 12월 2025","externalUrl":null,"permalink":"/posts/hello-world/","section":"Posts","summary":"","title":"Hello World!","type":"posts"},{"content":" 🔧 HPC From Scratch \u0026ndash; $1,300 이하의 일반 PC 부품으로 6노드 클러스터를 직접 구축합니다. 하드웨어 선택, OS 설치, 네트워크, Slurm, Ansible, GPU 워크로드까지. 여기서 시작하세요. 🎓 HPC 101 \u0026ndash; SSH, 모듈 시스템, Slurm 기초, 작업 디버깅. HPC가 처음인 연구자를 위한 시리즈. 여기서 시작하세요. 🐧 Linux 101 \u0026ndash; 터미널이 낯선 분들을 위한 명령줄 기초. 여기서 시작하세요. ","date":"2025년 1월 1일","externalUrl":null,"permalink":"/ko/","section":"","summary":"","title":"","type":"page"},{"content":"안녕하세요, Will Paik입니다. The Login Node에 오신 것을 환영합니다.\n저는 대규모 HPC 환경에서 AI/ML 모델을 확장하고 최적화하는 일을 하고 있습니다. 슈퍼컴퓨팅의 세계에는 늘 미묘한 긴장감이 흐릅니다. 시스템 관리자는 \u0026ldquo;서버가 죽으면 안 돼!\u0026ldquo;를 외치고, 연구자는 \u0026ldquo;무조건 더 빨리 돌려줘!\u0026ldquo;를 원하죠. 저는 이 둘 사이의 기술적 스윗 스팟(Sweet Spot)을 찾는 역할을 합니다.\n현재 본업은 HPC 머신러닝 성능 엔지니어입니다. 낮에는 거대 AI 모델 학습을 위해 대규모 클러스터를 최적화하고, 밤에는 그 원리를 쉽게 전해드리기 위해 방구석 미니 슈퍼컴퓨터를 직접 조립하고(가끔은 태워 먹으며) 실험합니다.\nCORE STACK: Slurm Linux Docker/Apptainer PyTorch Distributed Ansible\n이 블로그에서 다루는 것들 # The Login Node는 HPC 및 ML 인프라 엔지니어링 블로그입니다. 단순히 작업을 제출하고 기다리는 방법이 아니라, 시스템이 실제로 어떻게 동작하는지 이해하고 싶은 분들을 위한 공간입니다.\n콘텐츠는 세 가지 시리즈로 구성되어 있습니다:\n🔧 HPC From Scratch \u0026ndash; $1,300 이하의 일반 PC 부품으로 6노드 클러스터를 직접 구축합니다. 하드웨어 선택, OS 설치, 네트워크, Slurm, Ansible, GPU 워크로드까지. 여기서 시작하세요. 🎓 HPC 101 \u0026ndash; SSH, 모듈 시스템, Slurm 기초, 작업 디버깅. HPC가 처음인 연구자를 위한 시리즈. 여기서 시작하세요. 🐧 Linux 101 \u0026ndash; 터미널이 낯선 분들을 위한 명령줄 기초. 여기서 시작하세요. 홈 클러스터 # 역할 하드웨어 사양 로그인 노드 Lenovo IdeaPad 1 Ryzen 5 7520U, 8GB RAM 관리 노드 Lenovo ThinkCentre M715q Ryzen 5 2400GE, 16GB RAM 시각화 노드 Lenovo ThinkCentre M715q Ryzen 5 2400GE, 16GB RAM 워커 노드 (x2) Lenovo ThinkCentre M715q Ryzen 5 2400GE, 16GB RAM GPU 노드 HP Envy TE01 Core i7-10700F, 32GB RAM, GTX 1660 Super 스토리지 (관리 노드 경유) 1TB NVMe SSD (NFS) 네트워크 기가비트 매니지드 스위치 8포트, VLAN 지원 소프트웨어 스택: Rocky Linux 10, Slurm 25, Ansible, Apptainer, Prometheus + Grafana (구축 중)\n배경 # Penn State 대학교에서 항공우주공학 박사학위(계산과학 부전공)를 취득했으며, 그 후 8년간 500명 이상의 연구자를 지원했습니다. 현재는 Northeastern University에서 근무 중입니다. 항공우주 분야의 배경은 대규모 최적화 문제를 바라보는 시각을 형성해 주었고, 지금은 우주선 궤도 대신 GPU 클러스터에 그 방식을 적용하고 있습니다.\n전체 경력 사항은 Career 페이지에서 확인하실 수 있습니다.\n연락처 # GitHub LinkedIn YouTube ","date":"2025년 1월 1일","externalUrl":null,"permalink":"/ko/about/","section":"","summary":"","title":"소개","type":"page"},{"content":"Date: 2015–2016 Institution: Pennsylvania State University, University Park, PA\nI served as a mentor and instructor for engineering undergraduates, focusing on computational methods and programming logic.\nAerospace Analysis: Assisted students with numerical methods and engineering analysis. Programming for Engineers: Mentored students on MATLAB programming logic and algorithm development. (This entry archives past academic teaching experience at Penn State University.)\n","date":"1 1월 2015","externalUrl":null,"permalink":"/talks/teaching-psu/","section":"Talks \u0026 Workshops","summary":"","title":"Academic Teaching Experience (2015–2016)","type":"talks"},{"content":"","date":"1 1월 2015","externalUrl":null,"permalink":"/tags/teaching/","section":"Tags","summary":"","title":"Teaching","type":"tags"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"HPC 분야에서 9년 이상 시스템 아키텍처와 성능 최적화 경험을 보유하고 있습니다. 현재는 대규모 AI/ML 워크로드의 실제 성능을 극대화하는 데 집중하고 있으며, 대학 클러스터부터 주 단위 연구 인프라, 직접 구축한 홈 클러스터까지 다양한 환경을 다뤄왔습니다.\n경력 # HPC 머신러닝 성능 엔지니어 Research Computing, Northeastern University | 2025년 1월 – 현재\n프로덕션 HPC 클러스터에서 분산 ML 워크로드 최적화 AI/ML 애플리케이션 성능 분석 및 벤치마킹 다양한 분야의 연구자 컴퓨팅 워크플로우 최적화 지원 GPU 훈련 벤치마킹, AICR Benchmarking Group Massachusetts AI Computing Resource (AICR) | 2026년\nNortheastern University와 별개로 매사추세츠 주 AI 연구 인프라 벤치마킹 참여 멀티노드 구성에서 B200, RTX Pro 6000 클러스터의 GPU 훈련 워크로드 평가 담당 HPC 소프트웨어 컨설턴트 Institute for Computational and Data Sciences, Penn State | 2017년 1월 – 2024년 12월\n8년간 500명 이상의 연구자 지원 클러스터 성능 최적화 및 사용자 지원 재현 가능한 연구환경을 위한 컨테이너 환경 개발 (Singularity Hub 기여자) 시스템 모니터링, 자원 할당, 작업 최적화 Parallel Computing Support Application Engineer (인턴십) MathWorks | 2021년 여름\nMATLAB 병렬 컴퓨팅 툴박스 최적화 분산 컴퓨팅 성능 벤치마크 개발 HPC와 MATLAB 통합을 위한 문서 작성 기술 역량 # 분류 도구 스케줄러 Slurm, PBS (job array, 의존성 체인, 자원 최적화) 병렬 컴퓨팅 MPI (OpenMPI, Intel MPI), OpenMP, CUDA 스토리지 NFS, 병렬 파일시스템, 데이터 관리 전략 컨테이너 Singularity/Apptainer, Docker, Podman 자동화 Ansible, Bash 스크립팅, 시스템 프로비저닝 모니터링 Prometheus, Grafana, 성능 메트릭 언어 Python, C/C++, Fortran, MATLAB, Shell 버전 관리 Git, GitLab CI/CD 프로젝트 # 프로젝트 설명 HPC From Scratch 소비자용 하드웨어로 6노드 클러스터 구축. Slurm, Ansible, NFS, FreeIPA, Lmod. PyTorch DDP Benchmark HPC 클러스터용 멀티GPU/멀티노드 분산 훈련 스케일링 벤치마크. [GitHub] pkg_audit Slurm 파티션 전체 스캔 및 Ansible 수정 기능을 포함한 RPM 패키지 일관성 감사 도구. [GitHub] 4D LiDAR SLAM 최적화 ROS 2 실시간 성능을 위한 포인트 클라우드 처리 병렬화 사이드 프로젝트 (공개 예정) Game of Life 웹앱, 브라우저 포커 게임, ESP32-P4 열화상 카메라 기타 엔지니어링 프로젝트 NIST First Responder UAS Indoor Challenge (2022) 수상: 3위 + First Responder\u0026rsquo;s Choice (상금 $80,000) GPS 미작동 실내 긴급 상황을 위한 커스텀 쿼드콥터 제작. [nist.gov]\nVFS Design-Build-Vertical Flight Student Competition (2021, 2022) 수상: 3위 (2022), 예비 보고서 1위 + 최우수 전산 시뮬레이션상 (2021). [engr.psu.edu]\n제29, 30회 Intelligent Ground Vehicle Competition (2022, 2023) 팀장; 20파운드 탑재량의 자율주행 지상 로봇 개발.\nInteractive and Collaborative Robot Assist Project (2022) Pennsylvania State University, Robot Ethics and Aerial Vehicles Lab.\n제9, 10회 ESA Global Trajectory Optimization Competition (2017, 2019) 복잡한 궤도 최적화를 위한 병렬 알고리즘 개발. [psu.edu]\n학력 # Pennsylvania State University (펜실베이니아 주립대학교)\n학위 연도 비고 항공우주공학 박사 2024 컴퓨터과학 부전공. 논문: Multiple Gravity-Assist Trajectory Design with Continuous-Thrust Synergetic Maneuvers 항공우주공학 석사 2015 컴퓨터과학 부전공. 논문: Optimal Orbit Raising Via Particle Swarm Optimization 항공우주공학 학사 2013 강연 및 워크샵 # 자세한 내용은 강연 페이지를 참조하세요.\n강연 장소 연도 Introduction to Parallel Computing Northeastern University 2026년 봄 Linux Essentials for HPC Researchers Northeastern University 2026년 봄 Aerospace Analysis, Programming for Engineers TA Penn State 2015–2016 ","externalUrl":null,"permalink":"/ko/cv/","section":"","summary":"","title":"이력","type":"page"}]