
Finding Duplicates at Scale
Hashing 300 GB of files. On a laptop. Without killing it.
The feature request was simple: find duplicate files. Group them, show the wasted space, let the user decide what to delete.
The naive implementation writes itself. Hash every file with SHA-256. Group by hash. Done. It is correct, deterministic, and will finish sometime next week.
Think about what “hash every file” actually means on a 5.2 million node scan. You are reading 300+ GB from disk, computing SHA-256 on every byte, with no way to tell the user when it will finish. On an NVMe SSD, minutes. On an HDD or network volume, much longer. And most of that I/O is wasted, because the vast majority of files on any disk are not duplicates of anything.
The pipeline
The insight is that most files cannot possibly be duplicates of each other, and you can determine that without reading them. A 4 KB file and a 400 MB file are not duplicates. This single observation eliminates most candidates before any I/O happens.
From there, the pipeline adds phases. Each phase is cheaper than SHA-256 and eliminates more candidates before the expensive work starts.
Phase 1: size grouping. Group all files by byte size. Files with unique sizes are discarded immediately. No I/O. In practice, this removes the vast majority of the file set. Most files on a typical disk are unique in size.
Phase 2: extension filter. Within each size group, sub-group by file extension. A .jpg and a .mp3 that happen to share a size are not going to be duplicates. This is a heuristic, not a guarantee, but it reduces false positives cheaply and keeps the candidate set small.
Phase 3: partial read comparison. For candidates that share both size and extension, read the first 4 KB of each file and compare byte-by-byte with memcmp. Many files that are the same size have different content within the first few kilobytes. A handful of cheap reads and a fast comparison eliminates most remaining false positives without touching the rest of the file.
Phase 4: SHA-256. Only the survivors of phases 1 through 3 reach this stage. Now you are hashing a small, curated set of files that are genuinely likely to be duplicates. The I/O cost is real but proportional to actual candidates, not the entire disk.
graph TD
A[All files] -->|group by size| B[Size-matched candidates]
B -->|eliminate unique sizes| C[Extension filter]
C -->|group by size+ext| D[Partial read compare<br/>first 4 KB memcmp]
D -->|survivors| E[SHA-256 hash]
E -->|group by hash| F[Duplicate groups]
F -->|250ms batch| G[UI update]
Each phase eliminates a progressively smaller but more expensive-to-eliminate set of candidates:
| Phase | Files surviving | Cost |
|---|---|---|
| Size match only | ~20% remain (80% eliminated) | No I/O |
| + Suffix/extension check | ~10% remain (90% eliminated) | No I/O |
| + Partial memcmp (4 KB) | ~3% remain (97% eliminated) | Minimal I/O |
| + Full SHA-256 | 100% accurate | I/O proportional to survivors only |
The numbers are approximate and depend heavily on the filesystem contents. A developer machine with many build artifacts will see different ratios than a photo library. The principle holds regardless: the expensive phase operates on a tiny fraction of the original set.
Bounded concurrency
Phases 3 and 4 involve file I/O. The temptation is to parallelize as aggressively as possible. Spawn a goroutine, or a Swift Task, for every file. Let the scheduler sort it out.
This is wrong for SSDs and catastrophic for HDDs. Sixty parallel I/O requests do not make a drive read sixty times faster. They make it spend more time deciding what to read than actually reading. The sweet spot was 6 workers in a bounded TaskGroup, found through measurement on an NVMe SSD. A spinning disk would want fewer. A fast RAID array might handle more. The number is not the point. The point is that bounded concurrency, where you pick a limit and stick to it, almost always beats unbounded concurrency for I/O.
The UI problem
The algorithm was fast. The app was not.
The pattern: find a duplicate group, append it to an array, trigger a UI update. Find another, append, update. Two thousand times in a few seconds. Each append forced SwiftUI to diff the entire results array. As the array grew, each diff took longer. The UI thread fell further behind with every result. The algorithm was producing answers faster than the screen could show them. A performance problem caused by being too fast. I did not see that one coming.
The fix was batching. Instead of publishing each result immediately, the worker accumulates results and flushes them to the @MainActor on a 250ms timer. The UI updates 4 times per second regardless of how many groups are found between flushes. The diff cost becomes constant rather than quadratic.
This pattern is not specific to duplicate detection. Anywhere you have a high-frequency producer feeding a SwiftUI consumer, batching is the answer.
The numbers
| Approach | Groups/second |
|---|---|
| Naive (hash all files) | ~113 |
| Pipeline (multi-phase) + 250ms batch flush | ~397 |
The 3.5x improvement from 113 to 397 groups per second came entirely from the batching change. The algorithm did not change. The I/O pattern did not change. The only difference was how often the results were handed off to the UI layer.
That is the pattern: most performance problems in features like this have two components. The algorithm, and the delivery mechanism. Fix the algorithm first, because a bad algorithm cannot be batched into acceptability. But do not stop there. The path from computed result to rendered pixel has its own costs, and they compound.
The duplicate detector shipped. On a warm filesystem cache, it processes a typical developer machine in under a minute. The SHA-256 phase, which felt like the whole problem at the start, turned out to be the smallest part of it. The actual bottleneck was a SwiftUI array diff. It always is.
References
- SHA-2 (Wikipedia): SHA-256 algorithm specification and properties.
- Apple CryptoKit documentation: Swift APIs for SHA-256 and other cryptographic operations.
- WWDC 2019: Advances in App Background Execution: Background task scheduling and concurrency constraints on Apple platforms.
- Swift TaskGroup documentation: Structured concurrency with bounded worker pools.