Union-Find (Disjoint Set Union)
Union-Find (Disjoint Set Union)
Some problems are really one question asked over and over: are these two things already in the same group? Are these two computers on the same network? Do these two accounts belong to the same person? Did adding this friendship just merge two previously separate cliques? You could answer each query by walking the graph from scratch, but if connections keep arriving and you keep asking, that walking adds up fast. Union-Find is the structure built for exactly this: it keeps track of a bunch of disjoint groups and lets you merge two of them or check whether two elements share a group, both in almost constant time.
The mental model is a forest. Every element points at a parent, and if you keep following parents you eventually reach a root, which points at itself. Two elements are in the same set precisely when they climb to the same root. Merging two sets is then almost embarrassingly cheap: make one root point at the other, and two trees become one.
Two operations, and the two tricks that make them fast
The naive version is find, walk to the root, and union, point one root at the other. On their own they can degrade into long spindly chains, so find crawls forever. Two small optimizations fix that and are the whole reason the structure is famous.
The first is union by size: when you merge, always hang the smaller tree under the bigger root, so trees stay bushy instead of stringy. The second is path compression: every time you walk from a node up to its root, re-point the nodes you passed straight at the root, so the next find is a single hop. Together they flatten the forest so aggressively that the amortized cost of an operation is O(α(n)), where α is the inverse Ackermann function, a quantity so slow-growing it is below 5 for any n you could ever store. In practice, constant.
class DSU:
def __init__(self, n):
self.parent = list(range(n)) # everyone starts as their own root
self.size = [1] * n
def find(self, x):
while self.parent[x] != x:
self.parent[x] = self.parent[self.parent[x]] # path compression
x = self.parent[x]
return x
def union(self, a, b):
ra, rb = self.find(a), self.find(b)
if ra == rb:
return False # already in the same set
if self.size[ra] < self.size[rb]:
ra, rb = rb, ra # ra is the bigger root
self.parent[rb] = ra # smaller tree hangs under bigger root
self.size[ra] += self.size[rb]
return True
Picture two sets, one holding 1, 2, 3 with root 1, another holding 4, 5 with root 4. Calling union(3, 5) finds both roots, sees the first tree is bigger, and hangs 4 (and its child 5) under 1. Later, find(5) walks 5 to 4 to 1 and, on the way back, points 5 and 4 directly at 1, so every future lookup on them is instant.
In the wild: connectivity and cycle detection as a graph grows
Here is where it stops being a toy. Imagine you are wiring up a network by laying cables one at a time, and for each new cable you want two answers: are these two machines already reachable from each other, and does this cable close a redundant loop? Both are one union call. If the two endpoints already share a root, the cable is redundant, it would create a cycle, so you skip it; otherwise the union merges their components.
def has_cycle(n, edges): # undirected graph, edges arriving in any order
dsu = DSU(n)
for u, v in edges:
if not dsu.union(u, v): # union returns False when u, v already connected
return True # this edge closes a cycle
return False
This is not a contrived example, it is the beating heart of Kruskal's minimum spanning tree: sort the edges cheapest first, then add each one only if it connects two different components, using exactly this "union returns False means cycle, skip it" logic. The same machinery counts connected components (how many distinct roots remain), groups friend circles, and merges duplicate user accounts as "these two emails are the same person" facts stream in. Any time relationships accumulate and you keep asking "are these two together yet," it is the same class of problem, and it is Union-Find underneath.
The trigger
You are repeatedly merging groups and asking whether two elements are in the same group, with connections arriving over time rather than known all at once. The tell is any phrasing about connectivity, equivalence classes, "same network," "same component," or detecting a cycle as edges are added. If you catch yourself about to re-run a graph search just to answer "are these two linked," reach for Union-Find.
Where it shows up
- Graph connectivity and components: number of provinces, friend circles, counting islands when regions merge, network reachability.
- Cycle detection in undirected graphs and Kruskal's MST, where the cycle check is the algorithm.
- Equivalence and grouping: account or entity merging, "are these variables the same" in type inference, percolation and clustering.
Where it bites
The classic mistake is skipping one of the two optimizations and quietly getting a slow structure; without both, find can crawl and your near-constant guarantee evaporates. Watch the return value of union too, since the "did this actually merge, or were they already together" boolean is what powers cycle detection, and ignoring it throws away the very signal you needed. And remember the roots are internal bookkeeping: the number that comes back from find is just a representative of the set, not anything meaningful on its own, so never treat it as a label with outside significance.
When it is the wrong tool
Union-Find only ever joins sets; it has no cheap way to split them or delete an edge. The instant a problem needs disconnection, "these two are no longer related," the structure falls apart, and you need something heavier like a link-cut tree or an offline approach that processes deletions in reverse. It is also the wrong tool when you need the actual path between two nodes rather than a yes-or-no on connectivity, because path compression deliberately destroys the tree shape; for routes, use a real graph traversal. And if connectivity is static and you only ask once, a single BFS or DFS flood fill is simpler and just as fast, with no bookkeeping to maintain.
Its neighbors
It is the fast, incremental cousin of Graph Traversals, which stay the right choice when you need paths or a one-shot component scan. It is inseparable from Greedy, powering Kruskal's MST as the cheap cycle check. It shares the "relationships over a set of items" territory with Topological Sort (though that answers ordering, not grouping), and it often leans on Hashmaps to map arbitrary labels, emails, coordinates, strings, onto the integer indices the structure wants.