[HPC From Scratch] Episode 2: RAM, NVMe, and the iGPU Memory Trap

8 minute read

Published:

Four nodes. 16GB each. One hidden BIOS setting that can crash your Slurm jobs.

Welcome back to HPC From Scratch. In Episode 1, we covered the full cluster architecture, cost breakdown, and network layout. This episode focuses on the compute backbone: upgrading the four Lenovo ThinkCentre M715q nodes with dual-channel RAM and NVMe storage, and fixing a BIOS setting that silently eats your memory.

(Click the image to watch the tutorial on YouTube)

Table of Contents


> 1. What We Are Working With

Each M715q is a tiny Micro Form Factor PC. Here is what they shipped with from eBay:

Spec Stock Configuration
CPU AMD Ryzen 5 Pro 2400GE (4C/8T, 35W TDP)
RAM 8GB DDR4 SO-DIMM (single stick, single-channel)
Boot Drive 240GB 2.5” SATA SSD
M.2 Slot Empty (NVMe capable)
iGPU AMD Radeon RX Vega 11

The Ryzen 5 Pro 2400GE is a 35-watt part. Quiet and low-power, which matters when you have four of them sitting on your desk. 4 cores and 8 threads per node gives us 32 threads total across all the M715q nodes.

The stock 8GB of single-channel RAM is not enough for most HPC workloads. And the fact that the boot drive was a 2.5” SATA SSD turned out to be a good option for the storage upgrade.


> 2. The Storage Upgrade Path

When I opened the first M715q, I found the 240GB SATA SSD sitting in the 2.5” bay and an empty M.2 NVMe slot on the motherboard.

Because the OS boots from the SATA port, the high-speed M.2 slot is free. I realized that I could install a 1TB Samsung 990 Pro into that slot on the management node. This drive serves as the NFS storage for the entire cluster.

If the boot drive had been an M.2 SSD instead (which I believe is a default option for these units), the upgrade path would have been different. I would have bought a standard SATA SSD for NFS instead. You work with the hardware you get.

A PCIe Gen 4 NVMe drive is probably overkill for a Gigabit Ethernet network. Even more so because the M715q’s M.2 slot is PCIe 3.0, so the 990 Pro runs at Gen 3 speeds anyway. The network will bottleneck long before the drive does. We will benchmark the throughput we actually get in a later episode.

Note: Only the management node gets the NVMe drive. The other three M715q nodes keep their stock 240GB SATA SSDs as boot drives. There is no need for local fast storage on compute nodes when jobs read data from NFS.


> 3. RAM Upgrade: 8GB to 16GB Dual-Channel

8GB may not be enough for data processing on HPC. The upgrade strategy was simple: add a second stick instead of replacing the existing one with a single 16+ GB RAM.

Each M715q came with one 8GB DDR4 SO-DIMM in one slot which leaves the second slot empty. I bought matching 8GB sticks and installed them in the empty slots. This gives us two benefits:

  1. Double the capacity (8GB to 16GB)
  2. Dual-channel memory bandwidth

Dual-channel matters for compute. With a single stick, the CPU accesses memory through one channel. With two sticks in both slots, it can read and write through two channels simultaneously. This roughly doubles the theoretical memory bandwidth, which directly affects performance in memory-bound workloads like MPI and numerical computation.

RAM Compatibility

The M715q uses DDR4 SO-DIMM (laptop-sized) memory. When buying used RAM, match the specifications as closely as possible to the existing stick:

Spec What to Match
Form Factor DDR4 SO-DIMM
Capacity 8GB (to match existing stick)
Speed DDR4-2666 or higher (the 2400GE supports up to 2933)
Voltage 1.2V (standard DDR4)

(M715q I purchased came with DDR4-2666)

I bought my RAM sticks on eBay. The four sticks cost a total of $78, averaging about $20 per node for the upgrade. If you buy 16GB kits (2x8GB) new, expect to pay more, but compatibility is guaranteed.

Tip: If you are unsure about compatibility, check the spec sheet online. Search for the matching part online.

Installation

Opening the M715q is straightforward. Remove one screw on the back panel, slide the top cover off, and the internals are fully exposed. Once you remove the 2.5” SATA bay (one screw, then slide forward), the two SO-DIMM slots are clearly visible. Push the new stick into the empty slot until the clips snap into place.

I upgraded all four nodes. The management node took a bit longer because it also got the Samsung 990 Pro NVMe drive. The other three were just RAM, so I ran through them quickly.


> 4. The iGPU Memory Trap

This is the part that might save you hours of debugging later.

After installing the RAM, I booted the management node into a Linux Live USB using the NanoKVM (no monitor or keyboard needed). I opened a terminal and ran:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          15661        1656       10369          73        3989       14005

15,661 MiB. We installed 16GiB (16,384 MiB). Where did the other ~700 MiB go?

The answer: the integrated Vega GPU. Ryzen APUs share system RAM with the iGPU (integrated GPU). The GPU reserves a portion of your physical memory as video memory (VRAM), and the operating system never sees it.

I confirmed this with following command:

$ dmesg | grep VRAM

The output showed 256MB allocated to VRAM.

M715q Memory VRAM

BIOS Setting: UMA Frame Buffer Size

The amount of RAM reserved for the iGPU is controlled by a BIOS setting called UMA Frame Buffer Size. On my units, the default was set to Auto, which allocated 256MB.

I explicitly set it to 256MB (the lowest available option). Why bother changing it if Auto was already picking 256MB? Because Auto let the firmware decide, and that decision could change after a BIOS update or a hardware configuration change. If the iGPU suddenly grabs 512MB instead of 256MB, your Slurm jobs could start failing and the error messages will not point you to the BIOS.

Pinning it to a fixed value removes that variable.

M715q BIOS VRAM

Why This Matters for Slurm

When you configure Slurm later in this series, each node’s memory must be declared in slurm.conf using the RealMemory parameter. If you set RealMemory=16000 because you installed 16GB, Slurm will try to allocate memory that does not exist. Jobs will crash with out-of-memory errors.

The correct approach:

  1. Boot the node
  2. Run free -m and note the total value
  3. Use that number (or slightly below it) as RealMemory in your Slurm configuration
# Example slurm.conf entry
NodeName=interceptor-01 CPUs=8 RealMemory=15600 State=UNKNOWN

Every megabyte counts. Document it now, and save yourself the debugging later.


> 5. Upgrade Cost Summary

Here is what this episode’s upgrades cost:

Item Count Unit Price (USD) Total (USD)
DDR4 8GB SO-DIMM (Micron) 2 15.00 30.00
DDR4 8GB SO-DIMM (Hynix) 2 24.00 48.00
Samsung 990 Pro 1TB NVMe 1 109.90 109.90
Episode Total     $187.90

Combined with the four M715q units from Episode 1 ($343.60), the total compute backbone cost so far is $531.50. That covers four Ryzen nodes with 16GB dual-channel RAM each and 1TB of NVMe storage for NFS.

Per node (excluding the shared NFS drive): roughly $105 for a fully upgraded Ryzen 4C/8T compute node with 16GB of RAM.


> 6. What is Next

The compute backbone is assembled and verified. But hardware without a network is just a pile of metal.

In Episode 3, we will look at the rest of the cluster: the HP Envy TE01 GPU node with its Intel i7-10700F, the Gigabit network switch that connects everything, and why the login node uses WiFi to bridge the cluster to the internet.


Happy Computing!