Testing a Visual App
526 tests. Zero of them look at a pixel.
How do you test a disk analyzer? The treemap is visual. The scanner touches the real filesystem. The UI is a Canvas that paints pixels. None of these lend themselves to the standard “call a function, check the return value” pattern.
The answer, for Renala, was to test everything except the pixels. But the more useful answer is why each piece gets the test investment it does. Every hour writing a test is an hour not building features. The question is not “how much coverage?” It is “where does that hour prevent the most damage?”
The risk map
Testing is risk allocation. Here is Renala’s, rendered as a treemap (deliciously meta). Each rectangle is sized by lines of code, colored by risk level. The grey zone is about 40% of the codebase. It has near-zero unit test coverage. That is the point.
The red zone is tiny. It is also the only zone where a bug means permanent data loss. The code paths that call FileManager.trashItem have a confirmation dialog, batch error handling, and test coverage. The grey zone, SwiftUI views, is enormous. The views are thin wrappers around ViewModels. A rendering glitch is cosmetic. A file deletion bug is catastrophic. Allocate accordingly.
Algorithm tests: is the layout correct?
The squarified treemap algorithm is a pure function: given a list of sizes and a bounding rectangle, it returns a list of rectangles. No side effects, no state. This makes it the easiest part of the app to test.
graph LR
accTitle: Squarify function test properties
accDescr: The squarify function takes sizes and bounds as input and produces rectangles as output. Four properties are verified: areas proportional to sizes, aspect ratios bounded, deterministic results, and all rectangles inside bounds.
SIZES["[sizes]"] --> FN["squarify()"]
BOUNDS["[bounds]"] --> FN
FN --> RECTS["[rects]"]
RECTS -.- P1["✓ Areas ∝ sizes"]
RECTS -.- P2["✓ Aspect ratios bounded"]
RECTS -.- P3["✓ Deterministic"]
RECTS -.- P4["✓ All rects inside bounds"]
Determinism sounds obvious until you realize that LLVM optimization changes between Xcode versions can alter floating-point results for the same source code.1 A layout that drifts by a few points after an Xcode update is a regression no user will report but every user will notice.
SquarifiedTreemapTests contains 20 tests (9 layout, 11 cushion and shading) that verify the properties a correct squarified layout must satisfy:
- Area proportionality: the area of each output rectangle is proportional to the input size, within floating-point tolerance.
- Aspect ratio bounds: no rectangle has an aspect ratio worse than a threshold derived from the input distribution.
- Determinism: the same input produces the same output, every time, across runs.
- Bounding containment: every output rectangle fits within the input bounding rectangle.
- Cushion coefficients: the inherited cushion parameters from parent directories are correctly accumulated at each nesting level.
Here is the proportionality test. Four files, known sizes, 1% tolerance:
@Test("Area proportionality: rect areas match file size ratios within 1%")
func areaProportionality() {
let root = makeFileNode(name: "root", size: 0, children: [
makeFileNode(name: "a.txt", size: 400),
makeFileNode(name: "b.txt", size: 300),
makeFileNode(name: "c.txt", size: 200),
makeFileNode(name: "d.txt", size: 100),
])
let bounds = CGRect(x: 0, y: 0, width: 1000, height: 1000)
let rects = layout.layout(node: root, bounds: bounds, minRectArea: 0)
#expect(rects.count == 4)
let totalArea = Double(bounds.width * bounds.height)
let totalSize: Double = 1000
for rect in rects {
let expectedFraction = Double(rect.fileNode.sizeOnDisk) / totalSize
let actualFraction = Double(rect.frame.width * rect.frame.height) / totalArea
let error = abs(actualFraction - expectedFraction) / expectedFraction
#expect(
error < 0.01,
"File \(rect.fileNode.name): expected \(expectedFraction), got \(actualFraction)"
)
}
}
These tests run in milliseconds. They catch regressions that would be invisible to the eye but visible to a user who notices “this folder used to be bigger than that one.”
Scanner tests: real files, not mocks
The scanner’s correctness depends on how the OS returns file attributes.2 A mock would test the parsing logic against what I think the OS returns. A real directory tests it against what the OS actually returns.
Protocol-based testing boundaries make this practical. The production scanner uses getattrlistbulk. Tests can substitute a simpler scanner when testing higher-level logic that does not depend on the syscall layer:
public protocol DirectoryScannerProtocol: Sendable {
func scan(
root: URL,
options: ScannerOptions,
powerMode: ScanPowerMode, // throttle vs full-speed
progress: @escaping @Sendable (ScanProgress) -> Void
) async throws -> ScanResult
}
The scanner tests create real temporary directories with real files, scan them, and verify the resulting tree. DirectoryScannerTests has 31 tests, SpotlightScannerTests adds 7, and the volume manager tests contribute 22 more, covering:
- Tree consistency: parent-child relationships match the filesystem structure.
- Cancellation: a scan can be cancelled mid-traversal and the partial result is still consistent.
- Parallel consistency: multiple concurrent scans of the same directory produce identical trees.
- Hidden files: files starting with
.are included or excluded based on the scan configuration.
The orphan node bug
The cancellation test caught a real bug. A scan cancelled mid-traversal was leaving orphan children in the node store, Renala’s flat backing tree from article 9: the parent reported N children, but only some were actually populated. Invisible until the next layout pass tried to read children that were never filled in.
The fix: clean up orphans before returning partial results. The test: verify tree consistency after every cancel. This is why real-file tests matter. A mock designed to simulate partial completion could theoretically surface this, but real I/O latency and OS scheduling jitter made the race condition manifest without anyone designing a test for it.
The performance gate
Performance tests live behind an environment variable: RENALA_PERF_TESTS=1. They do not run in CI by default because they are slow and hardware-dependent.
The key test is O(n) linearity. SyntheticFixtureGenerator creates trees of 1,000 and 10,000 nodes. The test runs the scanner on each, measures elapsed time, and verifies that the ratio stays below 15x for a 10x input increase:
@Test("Scan scales linearly (1K → 10K files)")
func scanScalesLinearly() async throws {
let smallRoot = try SyntheticFixtureGenerator.createOnDisk(
fileCount: 1_000, folderCount: 10)
let largeRoot = try SyntheticFixtureGenerator.createOnDisk(
fileCount: 10_000, folderCount: 100)
defer {
try? FileManager.default.removeItem(at: smallRoot)
try? FileManager.default.removeItem(at: largeRoot)
}
let smallMs = try await SyntheticFixtureGenerator.measureMs {
_ = try await DirectoryScanner().scan(
root: smallRoot, options: ScannerOptions(), progress: { _ in })
}
let largeMs = try await SyntheticFixtureGenerator.measureMs {
_ = try await DirectoryScanner().scan(
root: largeRoot, options: ScannerOptions(), progress: { _ in })
}
let ratio = largeMs / max(smallMs, 0.01)
#expect(ratio < 15,
"Scan scales super-linearly: \(String(format: "%.1f", ratio))x for 10x input")
}
A ratio of 10 would be perfectly linear. A ratio of 100 would be quadratic. The 15x threshold leaves room for cache effects and single-run variance while still catching anything worse than O(n log n).3 The difference at scale is not academic:
The interactive below models that exact check: a fixed 1K run, a fixed 10K run, and a hard fail once the larger run crosses 15x the smaller one.
This is a guardrail, not a benchmark. The absolute numbers depend on the hardware. The shape of the curve does not.
Integration: scan to layout
ScanToTreemapIntegrationTests wires the full pipeline: scan a temp directory, feed the result into the layout engine, verify that the output rectangles have areas proportional to the input file sizes.
graph LR
accTitle: Integration test: scanner to layout pipeline
accDescr: A temporary directory is scanned to produce a FileNode tree, which is laid out into a TreemapRect array. A verification step checks whether areas are proportional to file sizes. Failure indicates a type mismatch or off-by-one bug between layers.
TMP["Temp dir"] -->|scan| TREE["FileNode tree"]
TREE -->|layout| RECTS["TreemapRect array"]
RECTS -->|verify| CHECK{"Areas ∝ file sizes?"}
CHECK -->|yes| PASS["✓"]
CHECK -->|no| FAIL["Type mismatch? Off-by-one?"]
This catches integration bugs where the scanner produces a valid tree that the layout engine misinterprets: type mismatches, off-by-one in child counts, root node handling. Two tests, just enough to catch the category of bug that lives between layers.
What is NOT tested
The grey zone in the risk map above. SwiftUI views are still mostly untested. There are no snapshot tests, no visual regression tests, no screenshot comparisons, and over 30 view files have zero unit test coverage. That does not mean the UI is completely untouched: there are a few XCTest UI checks at the edges for specific workflows and accessibility fixes. They count as tests. They just do not belong to the fast, routinely exercised safety net, and they do not add up to systematic view coverage.
This is a deliberate trade-off, not an oversight. SwiftUI views in Renala are thin: they read from the ViewModel and render. The logic lives in the ViewModel and the model layer, both of which are tested. The residual risk is real: a thin view can still bind to the wrong property or invert a conditional. But broad view testing would require either XCUITest (slow, brittle, screen-resolution-dependent) or a snapshot framework (constant maintenance as Apple changes rendering between OS versions). In a solo or very small-tech freeware app, it was reasonable to leave some extra coverage at the edges without pretending that it formed a disciplined UI test strategy.
Error edge cases, such as disk permission denied mid-scan, filesystem changes during scan, and corrupt attribute buffers, are partially tested but not exhaustively. The scanner handles these defensively in production, but the test suite does not manufacture every pathological condition.
File deletion operations are the exception. Look at the red zone in the risk map. The code paths that call FileManager.trashItem have a confirmation dialog, batch error handling, and test coverage. The cost of a bug there is not a wrong number on screen. It is data loss.
Swift Testing vs XCTest
Renala uses Swift Testing (import Testing, @Test, #expect) for all new tests. Out of 54 test files, 51 use Swift Testing and 3 use XCTest: a 94/6 split. The three holdouts are UI test files because XCUIApplication has no Swift Testing equivalent yet.
That split says more about the testing philosophy than the tooling. The fast path is Swift Testing through swift test: it runs constantly and carries most of the confidence. The few UI tests stay in XCTest because they exercise boundaries Swift Testing does not reach well yet, and because some of them depend on slower, fixture-driven workflows that are not practical to run all the time. They are useful extra coverage, not the backbone of the suite.
The difference in practice:
// Swift Testing
@Test("Area proportionality within 1%")
func areaProportionality() {
let rects = layout.layout(node: root, bounds: bounds)
#expect(rects.count == 4)
#expect(error < 0.01, "Fraction mismatch")
}
// XCTest
func testColorModeDescriptionShownForFileType() throws {
let win = openDisplaySettings()
selectColoringMode("By File Type", in: win)
let desc = win.staticTexts.containing(
NSPredicate(format: "value CONTAINS 'category'")
).firstMatch
XCTAssertTrue(desc.waitForExistence(timeout: 3))
}
Swift Testing has better assertion diagnostics, parameterized tests via @Test(arguments:), and a cleaner syntax. The migration from XCTest was mechanical for unit tests. The parameterized tests are genuinely useful: testing the squarified layout across multiple input distributions without duplicating test functions. XCTest remains for the bits that are awkward, slower, and less central, which is exactly the point of the overall allocation strategy.
The velocity thesis
graph TB
accTitle: Test coverage risk allocation by component
accDescr: Five tiers from most to least tested. Critical (red): file operations at 100% coverage. High (orange): scanner with boundary tests. Medium (amber): ViewModel async state machine. Low (green): pure algorithm and model functions. Untested by design (grey): thin SwiftUI view wrappers.
subgraph RED["🔴 Critical: 100% coverage"]
F["File Operations<br/>trash, delete, move"]
end
subgraph ORANGE["🟠 High: boundary tests"]
S["Scanner<br/>real files, real OS"]
end
subgraph AMBER["🟡 Medium: state machine"]
V["ViewModel<br/>async transitions"]
end
subgraph GREEN["🟢 Low: pure functions"]
A["Algorithm + Models<br/>deterministic, fast"]
end
subgraph GREY["⚪ Untested: by design"]
U["SwiftUI Views<br/>thin wrappers"]
end
RED --- ORANGE --- AMBER --- GREEN --- GREY
526 tests. Zero pixels. The treemap could be rendering everything upside-down and the tests would pass. That is the gap, and it is an acceptable one.
These tests do not prove the app is correct. They prove it has not gotten worse. Every commit runs against a suite that says “the algorithm still produces proportional rectangles, the scanner still builds consistent trees, the performance is still roughly linear, and no file gets deleted without confirmation.” That is enough to ship with confidence.
References
- Meet Swift Testing: WWDC 2024
- Testing in Xcode: WWDC 2019
- Swift Testing documentation: Apple Developer Documentation
Footnotes
-
Swift does not enable
-ffast-mathby default, so bit-exact reproducibility is the norm for a given optimization level. The risk is real but narrow: major compiler upgrades, not routine Xcode patches. ↩ -
getattrlistbulkwrites attribute values into a raw byte buffer with 4-byte alignment. The caller parses the buffer by advancing a pointer through each attribute in the order specified by the attribute bitmap. A wrong offset or a misread length field corrupts every subsequent attribute. This is the kind of parsing logic where the OS is the only reliable source of truth. ↩ -
The test runs each workload once, not averaged. The generous threshold is the trade-off: it absorbs measurement noise at the cost of not detecting mild regressions. For a guardrail that runs locally, this is acceptable. ↩
