Dinh Doan Van Bien · January 13, 2026

From 1.2 GB to 469 MB

I had 5.2 million FileNode objects. I didn’t need them.

I asked a friend to test Renala. He works in graphic design, has a good eye for UI, and I knew he would be thorough and would not pull any punches. His opinion on whether something looks and feels right is worth more to me than most. He also owns an older Intel Mac with 16 GB of RAM, and one of his external disks had something like 13 million files on it.

He ran a full disk scan. The first thing he reported was not the memory. It was this: “I don’t understand what the black blocks are.” I spent an embarrassing amount of time chasing a rendering bug. It was not a rendering bug. He had unchecked the “Directory headers” display option, which hid the label text but kept the header background visible. I missed it because I had never turned that option off myself.

Then he mentioned, politely, that Activity Monitor showed Renala using 1.2 GB of RAM during the scan and that his machine had started swapping.

I had run the same scan dozens of times on an M3 Pro with 18 GB. I had never once looked at Activity Monitor. Every developer thinks their machine is representative.

The original data model was straightforward. FileNode was a class holding a UUID, a stored path string, an array of children, size, modification date, and a few flags. For 5.2 million files, it was a problem.

Five million class instances means five million separate heap allocations. Each one carries bookkeeping you did not ask for, plus the extra cost of being its own object in the heap. Heap fragmentation means the actual memory consumed is larger than the neat sum of the field sizes. On a machine with 16 GB of RAM, five million objects scattered across virtual memory is the difference between “runs fine” and “the OS starts swapping.”

The numbers here use Activity Monitor’s Memory column, because that is the number a user actually sees when the app starts swapping.¹

The cost per FileNode broke down roughly like this:

Before (per FileNode class instance):
  UUID:        16 bytes
  path String: ~40 bytes average (heap)
  children []: 24 bytes header + 8 per child
  metadata:    ~24 bytes
  Swift object overhead: ~16 bytes
  Total per node: ~120 bytes
  × 5.2M nodes = ~624 MB (plus heap fragmentation)

That part was easy to diagnose. The harder part was deciding what the replacement should look like.

Stage 1, the compact model that proved the direction

The first pass was not the final design. It was the first compact model that made the problem small enough to think about.

The UUID alone was 16 bytes per node. For 5.2 million nodes: 83 MB just for identifiers. The first fix for that part is arithmetic: switch to a UInt32 NodeID, 4 bytes each, total 21 MB. That was enough.²

Every node stored its full path. /Users/dinh/Library/Application Support/SomeApp/SomeSubfolder/somefile.dat. That string is reconstructible from data you already have. Storing the full path is redundant. But redundancy at scale stops being a convenience and starts being a tax: an average path of 40-50 bytes, multiplied by 5.2 million nodes, is 200 MB of strings that do not need to exist.

So the first compact target looked roughly like this:

First compact target:
  parentIndex:  4 bytes (UInt32)
  nameIndex:    4 bytes (UInt32)
  size:         8 bytes (Int64)
  type + flags: 4 bytes
  Total per node: 20 bytes
  × 5.2M nodes = ~104 MB
  + name pool ceiling: ~156 MB (UTF-8, ~30 bytes avg name)
  Total ceiling: ~260 MB

Remove stored paths. Reconstruct on demand by walking up the parent chain. The name stays, because you need it for display and for reconstruction. The path goes.

graph TD
    accTitle: FileNode memory layout before and after compaction
    accDescr: The original FileNode class uses approximately 120 bytes per node (UUID, path string, children array, metadata, object overhead), totaling 624 MB for 5.2 million nodes. The compact struct target uses 20 bytes per node (parent index, name index, size, flags), totaling 104 MB.
  subgraph FN["FileNode (class instance, heap)"]
    FN1["UUID: 16 bytes"]
    FN2["path String: ~40 bytes (heap)"]
    FN3["children: [FileNode], 24-byte header<br/>+ 8 bytes per child pointer"]
    FN4["metadata (size, mtime, flags): ~24 bytes"]
    FN5["Swift object overhead: ~16 bytes"]
    FN6["Per node: ~120 bytes<br/>5.2M nodes: ~624 MB + heap fragmentation"]
  end
  subgraph SN["First compact target"]
    SN1["parentIndex: 4 bytes"]
    SN2["nameIndex: 4 bytes"]
    SN3["size: 8 bytes"]
    SN4["type + flags: 4 bytes"]
    SN5["Target: 20 bytes<br/>5.2M nodes: ~104 MB + pooled name bytes"]
  end
  FN6 -.->|"replaced by"| SN5

Structure	Per-node cost	5.2M nodes
FileNode (class, UUID)	~120 bytes, plus per-directory child arrays	~624 MB
First compact target	20 bytes	~104 MB

Approximate figures, but enough to prove the point.³ The real villain was not the scan itself, it was object shape, redundant strings, and millions of tiny allocations.

Drag the slider to change the node count. The memory bars update in real time, showing the first-pass comparison between the class-based tree and the compact flat model that proved the direction was right.

The UInt32 IDs and path removal were the first pass. They proved I could claw back hundreds of megabytes. The version that shipped is messier, larger, and much more useful.

Stage 2, the store that actually shipped

The shipping design still uses a flat ContiguousArray<SlimNode>. SlimNode is a struct, not a class: in a ContiguousArray, struct elements are stored inline, one after another in memory, with no heap pointer per element. But the real SlimNode is wider than the first-pass sketch because it has to carry more than wishful thinking.

Current SlimNode in the shipped NodeStore:
  sizeOnDisk:       8 bytes
  cloudSize:        8 bytes
  id:               4 bytes (UInt32)
  nameOffset:       4 bytes (UInt32)
  totalFileCount:   4 bytes (UInt32)
  firstChild:       4 bytes (UInt32)
  childCount:       4 bytes (UInt32)
  lastModifiedDays: 4 bytes (UInt32)
  nameLength:       2 bytes (UInt16)
  flags:            2 bytes (UInt16)
  Struct size:     44 bytes
  Array stride:    48 bytes
  × 5.2M nodes = ~250 MB

Struct size is the payload itself. Array stride is the actual spacing in memory from one element to the next after alignment padding, which is why 44 bytes of fields turns into 48 bytes per stored entry.

The original FileNode held an [FileNode] array for its children. That array header is 24 bytes (pointer, length, capacity). Each element is an 8-byte pointer to a heap-allocated child object. For a directory with 50 children: 24 + 400 bytes, plus 50 separate heap allocations for the children themselves.

The NodeStore replaces all of this with the same ContiguousArray<SlimNode>. Every node in the entire scan lives in one contiguous block of memory, with no per-element heap allocation. Each SlimNode stores the index of its first child and the number of direct children. To find a node’s children, you read that range directly. No per-directory child array. No scan for matching parents. Just a slice of the flat store. The tradeoff: every tree operation becomes index arithmetic instead of following object references. The code is less readable, but the memory layout is predictable.

The array is laid out in breadth-first order: the scan writes nodes level by level, but within each level, a directory’s children are written together as a contiguous segment. That is what matters for traversal: siblings live together, the next read is predictable, and the CPU spends less time chasing random heap pointers.

ContiguousArray matters here because predictable layout beats wishful thinking.⁴

Hover any node in the tree to trace it in memory. Left: FileNode class instances scattered across the heap, connected by pointers that jump unpredictably. Right: SlimNode structs packed in one ContiguousArray. Each directory stores a contiguous child range, so sibling traversal stays predictable even though the store itself is breadth-first. The outlined cells show the CPU prefetch window.

The name pool is a single ContiguousArray<UInt8> holding UTF-8 bytes. Each SlimNode stores nameOffset and nameLength, and repeated names are interned (stored once and referenced by offset), so node_modules or index.js gets stored once and referenced many times. One byte buffer, one growth pattern, no per-name heap object.

The first compact model assumed path reconstruction would just fall out of the flat store. It did not. The shipping design solves that separately. After the scan, Renala builds parentIndex and nameIndex dictionaries once in the background. UI code then walks the parent chain in $O(depth)$, collects names, and joins them into a path. The expensive part moved out of the hover loop. That was the point.

The old version hit 1.2 GB during the scan that triggered this work. The new version settled around 469 MB at idle after the same kind of large scan. On the Intel Mac, no more swapping. The scan completes, memory stabilizes, and the rest of the operating system can breathe.

The 469 MB was measured at idle after scan completion, not during the scan or rendering. Peak resident memory during conversion is higher because the old FileNode tree and the new NodeStore overlap briefly, and the working set also includes scan metadata and the treemap rectangle cache.¹

The two-stage arc also explains why 469 MB is not a mystery number. The first compact model told me the problem was solvable. The shipping layout told me what it actually cost. The flat store alone is about 250 MB at 5.2M nodes. The rest is the cost of making that store usable inside the real app.⁵ But it is still dramatically lower than what shipped initially, and the reduction is felt immediately on constrained hardware.

Stage 2 accounting for the shipped NodeStore. The first bar isolates the flat store body; the second adds pooled names and the post-scan support structures that explain why a ~250 MB store becomes a noticeably larger real working set.

The flat layout had a secondary benefit: serialization. Contiguous buffers are easier to write and read than a recursive object graph. In my measurements, cache encode time dropped by about 15x and decode by about 7.2x. This matters because cache load time is app launch time when reopening a saved scan. A user who scanned yesterday and opens Renala today sees results in under a second instead of waiting for a tree to be rebuilt from serialized nodes.

Roads not taken

Three approaches were considered and rejected between the first compact model and the store that finally shipped.

Lazy subtree streaming. Only materialize the nodes visible in the current treemap view. That saves RAM, but a disk analyzer needs a complete snapshot. Search, drill-down navigation, and stable treemap layout all assume the full tree is in memory.

Live tree DFS for path reconstruction. After removing stored path strings, the first path reconstruction approach walked the tree with a depth-first search on every hover event. At 5.2 million nodes, that was $O(N)$ work in the hover path. Building parentIndex and nameIndex once after the scan turned it into $O(depth)$ chain walks.

Name-in-parentIndex tuple. An early variant stored (parentID: NodeID, name: String) in one dictionary entry instead of separate parentIndex and nameIndex maps. It worked, but it blurred two different jobs. Parent walks are hot. Name lookup is colder.

The parent-child relationships are the same. The names, sizes, and types are the same. What changed is how memory is laid out.

The first pass removed obvious waste. The second pass made the flat store real. A late discovery confirmed the pattern: the .children computed property was still allocating a fresh Array on every call, copying child references into a new heap object each time. Hot-path code (treemap layout, sidebar filtering) called it constantly. Eliminating it via flat index traversal dropped that call site’s self-time from 5,432ms to 8ms (the heartbeat logger flagged this one too). The bottleneck moved again.

Game developers and systems programmers see this move all the time: flatten the graph, store the data contiguously, and replace object references with indices. It is less common in application code because the scale rarely justifies it. At five million nodes, the scale does not ask for permission.

The friend with the Intel Mac sent a message after the fix: “It seems much faster than yesterday, even with thermal throttling. Good.” And then: “I get the impression there are no more memory leaks.” There never were memory leaks. There was just too much memory being used on purpose. Somehow that felt worse.

References

Detect and diagnose memory issues: WWDC 2021
Memory Usage Performance Guidelines: Apple Developer Library
Locality of reference: Wikipedia
ContiguousArray: Swift Documentation

The 1.2 GB baseline came from an Intel Mac during the scan that exposed the problem. The 469 MB result came from an optimized build on an M3 Pro after scan completion, so this is not a controlled before-and-after on identical hardware. Field sizes come from the current struct layout. The final idle number also includes pooled names, side indexes, scan metadata, and treemap cache state. ↩ ↩²
A UInt32 can address 4.2 billion nodes, which is far beyond any scan Renala is likely to produce. The counter is sequential and resets per scan session, so it does not need to behave like a persistent global identifier. ↩
This was the first target, not the final shipped layout. The name-pool line is a ceiling based on rough average filename length, not a claim about the exact working set of the final implementation. ↩
In pure Swift, Array can also be very efficient for struct elements. The win from ContiguousArray here is the stricter guarantee about contiguous storage and the absence of bridging edge cases. ↩
The biggest additions are pooled name bytes, the parentIndex and nameIndex tables built after the scan, scan metadata, and the treemap rectangle cache. ↩

From 1.2 GB to 469 MB

Stage 1, the compact model that proved the direction

Stage 2, the store that actually shipped

Roads not taken

References

Footnotes