David Estévez

Hashmaps & Frequency Counting

If two pointers is the pattern you reach for when data is sorted, the hashmap is what you reach for when it is not, and you are willing to spend memory to avoid paying for order. The whole superpower is a single operation: "have I seen this before, and how many times," answered in constant time. That one question, asked as you sweep through data once, dissolves a startling number of nested-loop problems into a single pass.

There are really two faces to it. One is membership and lookup: remember things you have seen so you can recognize them instantly later. The other is counting: instead of storing a flag, store a tally. Both are the same trade, memory for O(1) access, and knowing that a hashmap turns "search for a match" into "look it up" is half of practical algorithm design.

The complement trick: remember what you have seen

The cleanest illustration is the classic "find two numbers that add to a target." The brute force checks every pair. The hashmap version realizes that as you walk the array, the partner each number needs is fully determined, so you just ask whether you have already passed it.

def two_sum(nums, target):
    seen = {}                              # value -> where we saw it
    for i, x in enumerate(nums):
        if target - x in seen:             # its partner already went by
            return (seen[target - x], i)
        seen[x] = i
    return None

One pass, O(n) time, O(n) memory, no sorting required. The move, remember each element so a future one can find it in O(1), is the reusable idea; the target sum is just the dressing.

Scanning the array 2 7 11 15 for a pair summing to 9; at the value 7 the partner needed is 9 minus 7, which is 2, already in the seen map from index zero, so the pair 2 and 7 is found in one pass with a constant-time lookup instead of a nested loop. Swap the stored value for a running count and the same loop becomes frequency counting, the tool behind "most common word," "first non-repeating character," and grouping anagrams by their letter tallies.

In the wild: the hash join

Databases turn this trick into one of their most important operations. Joining two tables means finding rows that share a key, and the naive nested-loop join compares every row of one table to every row of the other, O(n * m). The hash join refuses. It builds a hashmap from one table, keyed by the join column, then probes it once per row of the other table.

def hash_join(left, right, key):
    index = {}
    for row in left:                       # build phase: index one side
        index.setdefault(key(row), []).append(row)
    out = []
    for row in right:                      # probe phase: look up, do not scan
        for match in index.get(key(row), ()):
            out.append((match, row))
    return out

That is O(n + m) instead of O(n * m), and it is the exact same idea as the two-sum complement trick, scaled up to database engines: build an index of what you have, then replace every search with a lookup. Whenever you see a query plan say "hash join," this is what it is doing, and it is why adding the right index can turn a query from minutes into milliseconds. Same class of problem, from an interview warm-up to the core of a database.

The trigger

You catch yourself writing a nested loop to find matches, count occurrences, or check "have I seen this," or the problem says "most frequent," "duplicate," "anagram," or "group by." Any of those is the hashmap knocking. The instinct to trade memory for an O(1) lookup is the whole pattern.

Where it shows up

Frequency counting: most common elements, first unique, anagram checks, all one pass with a Counter.
Membership and dedup: has this appeared before, are there duplicates, what is missing from a range.
Lookup to kill a nested loop: two-sum and its family, hash joins, mapping keys to precomputed values.

Where it bites

Hashmaps assume keys are hashable and stable, so a mutable object as a key, or a floating-point value with rounding noise, will betray you. Iteration order is not the insertion or sorted order unless you deliberately use an ordered map, which quietly breaks code that assumed otherwise. And the O(1) is an average: a bad hash or adversarial keys can degrade it, though in practice the language's built-in map handles that.

When it is the wrong tool

If the data is already sorted and the question is about a pair or a range, Two Pointers answers it in O(1) extra space, so do not spend O(n) memory on a hashmap you did not need. If you need order, ranking, or "the next larger key," a plain hashmap cannot help; you want a balanced tree or a sorted structure. And when memory is the binding constraint and the universe of keys is huge, an exact map may be too expensive, which is when probabilistic structures like Bloom filters or count-min sketches take over.

Its neighbors

It is the counting engine feeding Top-K Elements and Kth Largest/Smallest whenever "top" means "most frequent." It supplies the per-window tallies inside a Sliding Window, and it is the natural alternative the moment a Two Pointers solution loses its sorted-input assumption.

References

Introduction to Algorithms (CLRS), 4th ed., Cormen, Leiserson, Rivest, Stein, 2022
The Algorithm Design Manual, 3rd ed., Steven Skiena, 2020

Hashmaps & Frequency Counting