Finding Duplicates at Scale

Hashing 300 GB of files. On a laptop. Without killing it.

The feature request was simple: find duplicate files. Group them, show the wasted space, let the user decide what to delete.

The naive implementation writes itself. Hash every file with SHA-256 (a cryptographic function that produces a fixed-size fingerprint of a file’s contents; identical files always produce identical hashes). Group by hash. Done. It is correct, deterministic, and will finish sometime next week.

Think about what “hash every file” actually means on a 5.2 million node scan. You are reading 300+ GB from disk, computing SHA-256 on every byte, with no way to tell the user when it will finish. On an NVMe SSD, minutes. On an HDD or network volume, much longer. And most of that I/O is wasted, because the vast majority of files on any disk are not duplicates of anything.

The pipeline

The insight is that most files cannot possibly be duplicates of each other, and you can determine that without reading them. A 4 KB file and a 400 MB file are not duplicates. This single observation eliminates most candidates before any I/O happens.

From there, the pipeline adds phases. Each phase is cheaper than SHA-256 and eliminates more candidates before the expensive work starts.

Phase 1: size grouping. Group all files by byte size. Files with unique sizes are discarded immediately. No I/O. In practice, this removes the vast majority of the file set. Most files on a typical disk are unique in size.

Phase 2: name filter. Within each size group, sub-group by filename (case-insensitive). A report.pdf and backup.pdf that happen to share a size are not going to be duplicates. This is a heuristic, not a guarantee: it can miss duplicates where one copy was saved under a different name.1 But it reduces the candidate set cheaply and in practice the tradeoff is worth it.

Phase 3: partial read comparison. For candidates that share both size and name, read the first 8 KB of each file and compare byte-by-byte with memcmp.2 If the prefixes match, read the last 8 KB and compare those too. The suffix check catches 1.9% of candidates that were identical at the start but diverged at the end. 1.9% sounds negligible until you consider each saved comparison is a full SHA-256 pass over a potentially multi-gigabyte file.

Phase 3.5: hash-once for pairs. For candidate groups of exactly two files, there is a shortcut that avoids hashing both copies. Read both files in lockstep, comparing each chunk with memcmp while hashing only the first file. If every chunk matches, both files are provably identical, and the single hash serves as their shared fingerprint. If any chunk diverges, stop early and discard without finishing either file.

59.2% of groups reaching this stage were pairs. For the majority case, one read pass and one hash was enough.

Phase 4: SHA-256. Only the survivors of the earlier phases reach this stage. Now you are hashing a small, curated set of files that are genuinely likely to be duplicates. The I/O cost is real but proportional to actual candidates, not the entire disk.

graph TD
    accTitle: Multi-phase duplicate detection pipeline
    accDescr: Files are grouped by size, then unique sizes eliminated. Remaining candidates pass through a name filter, an 8 KB prefix and suffix comparison, a hash-once step for pairs, and finally full SHA-256 hashing. Confirmed duplicate groups are batched to the UI every 250 ms.
    A[All files] -->|group by size| B[Size-matched candidates]
    B -->|eliminate unique sizes| C[Name filter]
    C -->|group by size+name| D[Partial read compare<br/>8 KB prefix+suffix]
    D -->|prefix+suffix match| D2[Hash-once for pairs]
    D2 -->|remaining groups| E[SHA-256 hash]
    E -->|group by hash| F[Duplicate groups]
    F -->|250ms batch| G[UI update]

Each phase eliminates a progressively smaller but more expensive-to-eliminate set of candidates:

PhaseFiles survivingCost
Size match only~20% remain (80% eliminated)No I/O
+ Name check~10% remain (90% eliminated)No I/O
+ Prefix memcmp (8 KB)~4% remain (96% eliminated)Minimal I/O
+ Suffix memcmp (8 KB)~3% remain (97% eliminated)Minimal I/O
+ Hash-once (pairs)~1.5% remain (98.5% eliminated)One hash per pair
+ Full SHA-256100% accurateI/O proportional to survivors only

The numbers are approximate and depend heavily on the filesystem contents. A developer machine with many build artifacts will see different ratios than a photo library. The principle holds regardless: the expensive phase operates on a tiny fraction of the original set.

Click Step to advance through the pipeline one phase at a time. The interactive simplifies to four main phases: the hash-once-for-pairs optimization (Phase 3.5) and the name filter’s full-filename matching are omitted for clarity. Watch how most files are eliminated before any disk I/O happens.

Bounded concurrency

Phases 3 and 4 involve file I/O. The temptation is to parallelize as aggressively as possible. Spawn a Go goroutine, or a Swift Task, for every file. Let the scheduler sort it out.

This is wrong for SSDs and catastrophic for HDDs. Sixty parallel I/O requests do not make a drive read sixty times faster. They saturate the I/O queue and add contention in the driver stack.3 The sweet spot was 6 workers in a TaskGroup with a manual concurrency cap,4 found through measurement on an NVMe SSD with small random reads. A spinning disk would want fewer. A fast RAID array might handle more. The number is not the point. The point is that bounded concurrency, where you pick a limit and stick to it, almost always beats unbounded concurrency for I/O.

The UI problem

The algorithm was fast. The app was not.

The pattern: find a duplicate group, append it to an array, trigger a UI update. Find another, append, update. Two thousand times in a few seconds. Each append forced SwiftUI to reconcile the entire results list: diffing view identities, running layout, committing the render tree.5 As the array grew, each update took longer. The UI thread fell further behind with every result. The algorithm was producing answers faster than the screen could show them. A performance problem caused by being too fast. I did not see that one coming.

The fix was batching. Instead of publishing each result immediately, the worker accumulates results and flushes them to the @MainActor on a 250ms timer. The UI updates 4 times per second regardless of how many groups are found between flushes. The update cost per flush becomes constant rather than growing with the array.

This pattern is not specific to duplicate detection. Anywhere you have a high-frequency producer feeding a SwiftUI consumer, batching is the answer.

The numbers

Throughput measured via PerfLogWriter on M3 Pro, warm NVMe cache.6 Baseline: sequential single-threaded processing. Optimized: 6-worker TaskGroup + 250ms batch flush.

ApproachThroughputBottleneck
Sequential (baseline)~1.3 groups/sSingle-threaded I/O + per-result UI diff
Pipeline + batch flush~397 groups/sSwiftUI array diff (batched)

The improvement is roughly 300× end-to-end, from sequential single-threaded processing to a pipelined architecture with batched UI updates. The algorithm change (multi-phase elimination) reduced I/O dramatically. The batching change eliminated the SwiftUI diff storm. Neither alone would have been enough; together they transformed a feature that felt broken into one that finishes in seconds.

That is the pattern: most performance problems in features like this have two components. The algorithm, and the delivery mechanism. Fix the algorithm first, because a bad algorithm cannot be batched into acceptability. But do not stop there. The path from computed result to rendered pixel has its own costs, and they compound.

The duplicate detector shipped. On a warm filesystem cache, it processes a typical developer machine in under a minute. The SHA-256 phase, which felt like the whole problem at the start, turned out to be the smallest part of it. The actual bottleneck was SwiftUI’s update cycle. For list-heavy features with high-frequency updates, it usually is.

References

Footnotes

  1. The name filter introduces a risk of false negatives: two identical files with different names (e.g., photo.jpg copied and renamed to backup.jpg) will be excluded as candidates. For a duplicate finder, missed duplicates are the more dangerous failure mode compared to extra comparisons. The tradeoff is deliberate: on a typical filesystem, the savings from skipping cross-name comparisons vastly outweigh the rare missed duplicate.

  2. memcmp is a C standard library function that compares two memory buffers byte-by-byte and returns zero if they are identical. The real cost of this phase is the disk reads, not the comparison itself.

  3. On an HDD, excessive parallelism causes seek thrashing. On an NVMe SSD, the mechanism is different: the controller’s command queue saturates and the driver stack adds lock contention. The symptom is the same (diminishing returns), but the underlying cause is not.

  4. TaskGroup does not have a built-in concurrency limit. The bounding is a manual pattern: seed N tasks upfront, then add a new task only after awaiting a completed one. The implementation uses maxConcurrency = 6 with an inFlight counter to enforce the cap.

  5. SwiftUI does not diff the raw array. It diffs the view tree produced by ForEach over the array, reconciling view identities, then runs layout and commits to Core Animation. “Array diff” is shorthand for this full pipeline, which is where the cost accumulates.

  6. “Warm cache” means the operating system has recently read these files and holds them in memory, so reads are served from RAM rather than hitting the disk. Cold-cache performance would be significantly slower for the I/O phases.