In this post we provide a brief description of K-d tree. K-d tree is a geometric data structure that provides efficient support for the following queries
- nearest neighbor search
- range queries
There are two common ways to implement the K-d tree. The main difference is the choice of the separator. We could use the point to
represent the separator line, which implies the node in the K-d tree is the point in the data set. Or we could use nodes in K-d tree to represent the separator line directly. The two approaches are illustrated in the diagram below:
To construct the tree, we first need to sort points in each dimension. Soring takes \(O(nlogn)\) time and there are \(d\) dimensions so overall it takes \(O(dnlogn)\) to sort points. Then we choose an axis and split the tree evenly. One way to choose the axis is the round-robin approach. Suppose we use x-axis to split the tree. As we can see in the diagram below, we will have two groups of the points: pink group and yellow group. Because we use x-axis to split the tree, the points in the same group are grouped automatically in the x row. It's not the case for the y row. Therefore, we need to regroup points for other dimensions. Then we can process the subtree recursively. The regrouping process takes \(O(dn)\) time. We have the following formula for time complexity
which gives \(T(n,d) \approx O(dnlogn)\).
Application: Nearest Neighbor Search
There are two components in the nearest neighbor search algorithm:
- Heuristic: We first identify the cell of the target point because that's the area that most likely contains the nearest neighbor of the target. Identifying the cell means we will follow a path from the root of the K-d tree to one of the leaves. During this process, we will keep track of the shortest distance between the target point and any visited points.
- Pruning: It's likely that we don't need to visit every subtree because the known shortest distance so far defines a searchable area. If the area represented by the subtree is completely out of the searchable area, then we don't need to process the subtree.
On average, the time complexity of nearest neighbot search is O(log N) but the worst case is N even if K-d tree is balanced.
----- END -----
©2019 - 2022 all rights reserved