Disjoint Sets, also known as the Union-Find data structure, is a data structure that keeps track of a set of elements partitioned into a number of disjoint (non-overlapping) subsets. It provides operations for adding new sets, merging sets (union), and finding a representative member of a set.
History:
The disjoint-set data structure was first introduced by Bernard A. Galler and Michael J. Fischer in 1964. It was later refined by several researchers, including Robert Tarjan, who developed the optimized "union by rank" and "path compression" techniques in 1975, which greatly improved the time complexity of the operations.- Each set is represented by a representative element, often called the "root" or "parent" of the set.
- The elements within a set can form a tree-like structure, where each element points to its parent. The root of the tree is the representative of the set and points to itself.
- Two sets are considered disjoint if their representative elements are different.
- MakeSet(x): Creates a new set with the element x as its only member. The representative of the set is x itself.
- Find(x): Determines the representative element of the set that contains element x. It follows the parent pointers from x until it reaches the root element, which is the representative of the set.
- Union(x, y): Merges the sets containing elements x and y into a single set. It first finds the representatives of the sets containing x and y using the Find operation. If the representatives are different, it updates the parent pointer of one representative to the other, effectively merging the sets.
How it works:
The disjoint-set data structure maintains a collection of sets, each containing a number of elements. Initially, each element is in its own singleton set. The sets are represented using a forest of trees, where each tree represents a set, and the root of the tree is the representative of the set.When performing the Union operation to merge two sets, the representative elements of both sets are found using the Find operation. If the representatives are different, one of the representatives is chosen as the new root, and the other representative's parent pointer is updated to point to the new root. This effectively merges the two sets into one.
To optimize the Find operation and keep the trees shallow, two techniques are commonly used:
- Union by Rank: When merging two sets, the set with the smaller rank (an upper bound on the height of the tree) is attached to the root of the set with the larger rank. This helps keep the trees balanced and prevents them from becoming too tall.
- Path Compression: During the Find operation, once the root of a set is found, all the nodes along the path from the element to the root are updated to directly point to the root. This flattens the tree structure and makes future Find operations more efficient.
- Detecting connected components in an undirected graph
- Checking for cycles in an undirected graph (Kruskal's algorithm for minimum spanning trees)
- Tracking connected components in dynamic graphs
- Implementing Kruskal's algorithm for finding minimum spanning trees
- Image segmentation and labeling
- Least common ancestor problem in trees
The time complexity of the disjoint-set operations, when using both union by rank and path compression, is nearly constant on average (amortized time complexity). The Find and Union operations have a time complexity of O(α(n)), where α(n) is the inverse Ackermann function, which is nearly constant for all practical values of n. This makes disjoint sets a very efficient data structure for managing sets and solving problems related to connectivity and partitioning.