Two Heaps
Two Heaps
Finding the median of a fixed array is easy: sort it and look in the middle. The problem gets interesting when the data will not hold still. Numbers keep streaming in, and after every single one you need the current median, right now, without re-sorting the growing pile each time. Latency samples arriving from a service, sensor readings, a running "middle value" on a dashboard. Sorting from scratch on every arrival is quadratic and hopeless. The Two Heaps pattern is the elegant way out, and the idea is almost visual.
Split the numbers you have seen into two halves: a lower half and an upper half. Keep the lower half in a max-heap, so its largest element sits on top, and the upper half in a min-heap, so its smallest element sits on top. If you keep the two halves balanced in size, then those two tops are the elements straddling the exact middle of the data. The median is either the average of the two tops or, when one half holds one extra element, that half's top. You never sort anything; you just keep the boundary between the halves under your thumb.
Keeping the halves balanced
Two invariants do all the work. First, every element in the lower half is no bigger than every element in the upper half. Second, the two halves never differ in size by more than one. Each new number gets routed to the correct half, and then you rebalance if one side got too heavy.
import heapq
class MedianFinder:
def __init__(self):
self.lo = [] # max-heap (Python heaps are min-heaps, so store negatives)
self.hi = [] # min-heap: the upper half
def add(self, x):
heapq.heappush(self.lo, -x) # tentatively to lower half
heapq.heappush(self.hi, -heapq.heappop(self.lo)) # hand its max up to the upper half
if len(self.hi) > len(self.lo): # keep lo the same size or one bigger
heapq.heappush(self.lo, -heapq.heappop(self.hi))
def median(self):
if len(self.lo) > len(self.hi):
return -self.lo[0] # odd count: the extra element is the middle
return (-self.lo[0] + self.hi[0]) / 2 # even count: average the two tops
The push-then-shuffle dance guarantees both invariants at once: shoving the new value through the lower heap and passing its maximum up keeps the halves correctly ordered, and the final size check keeps them balanced. Feed it the stream 5, 3, 8, 1 and the medians come out 5, 4, 5, 4, each in O(log n) time to insert and O(1) to read. After those four numbers the lower half holds 1 and 3 with 3 on top, the upper half holds 5 and 8 with 5 on top, and the median is the average of the tops, 4.
In the wild: the running median of a live stream
This is not an interview curiosity; it is how you track a median over data that never stops. Picture a monitoring service watching request latencies: samples pour in continuously and you want the median latency at any instant, because the median is far more honest than the average when a few slow outliers would otherwise drag the mean around. You cannot store and re-sort millions of samples per second. So you keep two heaps, feed each new latency in, and read the median off the tops in constant time whenever the dashboard refreshes. It is exactly the same machinery as the toy stream, only the numbers are microseconds and they never stop coming.
The pattern generalizes past the literal median to any problem where you need to keep a balance point between two groups as data flows. The "maximize capital" problem uses two heaps, one ordering projects by the capital they require and one ordering the affordable ones by profit, to greedily pick the best reachable project at each step. Sliding-window median keeps the same two heaps but adds lazy deletion as the window moves. The common thread is always two priority queues facing each other across a boundary you want to hold steady.
The trigger
You need the median, or some middle or boundary element, repeatedly, as data streams in rather than all at once. The tell is "running median," "median of a data stream," or any situation where you must keep re-answering "what is in the middle right now" without the luxury of re-sorting. More broadly, if a problem wants you to balance two groups around a pivot and keep querying that pivot, think two heaps.
Where it shows up
- Streaming statistics: running median of latencies, sensor readings, or any live metric where the middle matters more than the mean.
- Sliding-window median: the same two heaps with lazy deletion as the window advances.
- Greedy selection around a threshold: "maximize capital" style problems, scheduling that repeatedly pulls the best item on one side of a boundary.
Where it bites
The perennial bug is the heap direction: the lower half needs a max-heap and the upper half a min-heap, and in languages with only a min-heap you fake the max-heap by negating, which is easy to get backwards and quietly returns nonsense. The rebalancing is the other landmine, since letting the sizes drift apart by more than one breaks the median formula, so the size check has to run after every insert. And be deliberate about the even-count case: returning one top instead of the average of both is an off-by-a-median error that passes small tests and fails on real data.
When it is the wrong tool
If you only need the median once, on a static array, two heaps are overkill; QuickSelect finds it in O(n) average time with no ongoing structure to maintain. If you need arbitrary order statistics, the 90th percentile, the k-th smallest for varying k, heaps only cheaply expose the boundary you balanced around, so reach instead for a balanced BST, an order-statistics tree, or a Fenwick tree over the value range. Frequent deletion of arbitrary elements is also awkward, since binary heaps cannot remove a middle element cheaply and you end up bolting on lazy deletion or indexed heaps, at which point a pair of multisets may be cleaner. And for a small, bounded amount of data, just sort; the constant factors of two heaps are not worth it.
Its neighbors
It is a specialized sibling of Top-K Elements, trading a single heap for two that face each other. It hands off to Kth Largest/Smallest Elements and its QuickSelect engine whenever the query is one-shot rather than streaming. It pairs with Sliding Window for windowed medians, and it is a favorite Design Problem in its own right, where "implement a MedianFinder" is really just a request to wire these two heaps together correctly.