Python - Recursive Data Structures

QTM 350: Data Science Computing

Davi Moreira

Introduction

This topic material is based on Professor Mike Gelbart Algorithms and Data Structures course. It was adapted for our purposes.

Outline:

Recursive data structures
Trees, binary search trees
Nearest neighbours intro
$k$-d trees
Amortization

Learning objectives¶

Implement a simple recursive data structure such as a linked list.
Compare and contrast a binary search tree to a hash table.
Select an appropriate algorithm for finding nearest neighbours.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import altair as alt

from collections import defaultdict

import sklearn.neighbors

Motivation for recursion¶

Last class we talked about recursive functions, which call themselves.
We saw how this made the binary search code more elegant.
Today we'll see a stronger motivation for recursive functions and recursive thinking.

Recursive data structures¶

We have talked about nested data structures, like:

x = [[1, 2, 3], ["a", "b", "c"]]
x

This is a list of lists.
We also had dicts of dicts; another example is the JSON format.
We can also have a doll within a doll:

(Image attribution to Fanghong, from Wikipedia article.)

Let's check the following class:

class TreasureBox:
    """
    A linked list, aka treasure box. The user add and retrive items from it.
    """

    def __init__(self, treasure):
        self.next = None # data type: TreasureBox
        self.treasure = treasure # data type: whatever

    def append_outer(self, treasure):
        """Add a new treasure box to the outside by putting lastest box inside it.

        Parameters
        ----------
        treasure : object
           the label designated to the newly covered treasure box

        Returns
        -------
        new_box : TreasureBox
             new treasure box object containing previous boxes inside    

        Example
        --------
        >>> box = box.append_outer(10)
        """

        new_box = TreasureBox(treasure)
        new_box.next = self
        return new_box

    def append_inner(self, treasure):
        """Add a new treasure box inside the innermost current box.

        Parameters
        ----------
        treasure : object 
           the label designated to the newly inserted treasure box

        Returns
        -------
        new_box : TreasureBox
             new treasure box object contained within innermost box of 
             the last treasure box    

        Example
        --------
        >>> box.append_inner(55)
        """

        if self.next is None:
            self.next = TreasureBox(treasure)
        else:
            self.next.append_inner(treasure)
        return self

    def get(self, depth):
        """Get the treasure by going depth levels deep into the treasure boxes.

        Parameters
        ----------
        depth : int 
           the depth of which to unwrap the treasure box 

        Returns
        -------
        object : 
             the treasure retrieved after recursing the specified depth. 

        Example
        --------
        >>> box = TreasureBox(12)
        >>> box = box.append_outer(9)
        >>> box = box.append_inner(55)
        >>> box.get(0)
        9 
        >>> box.get(1)
        55
        """

        if depth == 0:
            return self.treasure

        if self.next is None:
            return None  # Index out of bounds

        return self.next.get(depth-1)

The TreasureBox class is a representation of a nested structure where each box can contain another box, akin to a set of Russian dolls. It is a custom implementation of a linked list data structure. Each instance of TreasureBox represents a node in the linked list, with two main attributes: next, which points to the next TreasureBox object, and treasure, which can hold a value of any data type.

When utilizing the TreasureBox class, one can imagine each instance as an individual container with the capability to encapsulate another container within it or to be wrapped by another container itself. This characteristic makes the TreasureBox an excellent model for understanding fundamental recursive concepts and for practical applications that require nesting or hierarchical structures, such as organizational charts, file systems, and decision trees.

Here's a breakdown of the class and its methods:

__init__(self, treasure): This is the constructor for the TreasureBox class. It initializes a new instance with the provided treasure and sets next to None, indicating that there is no subsequent TreasureBox connected yet.
append_outer(self, treasure): This method creates a new TreasureBox with the specified treasure and inserts it outside the current one. It does this by setting the next attribute of the new box to point to the current box (self). This effectively wraps the current box inside the new one. The method returns the new box.
append_inner(self, treasure): This method adds a new TreasureBox inside the innermost box of the current chain. It does this recursively by calling append_inner on the next box until it finds a box that does not have a next box (self.next is None). It then creates a new TreasureBox and assigns it to the next attribute of the innermost box. The method returns the original outermost box (self).
get(self, depth): This method retrieves the treasure from a TreasureBox at a specified depth. If the depth is 0, it returns the treasure of the current box. If the depth is greater than 0, it recursively calls get on the next box, decrementing the depth each time, until it reaches the specified depth. If the depth is greater than the number of boxes, it returns None, indicating that the index is out of bounds (there's no box at that depth).

Examples on how to use these methods:

To add a box outside the current box (wrapping the current one inside), append_outer is called, with the treasure of the new box as an argument.
To add a box inside the current chain (at the innermost position), append_inner is used with the desired treasure for the new inner box.
To retrieve the treasure at a certain depth, get is called with the depth as an argument.

The TreasureBox class uses recursion to traverse and modify the linked list structure in both the append_inner and get methods. The append_outer method, on the other hand, is not recursive and simply reassigns the next attribute of a new TreasureBox instance.

In practice, the TreasureBox recursive data structure demonstrates the elegance and power of recursion in both constructing and interacting with complex, nested data. The simplicity of its API—providing methods to add new elements and retrieve contents—makes it accessible for teaching recursion and for implementing real-world data structures that exhibit hierarchical characteristics.

box = TreasureBox("$5")
box = box.append_outer("$100")
box = box.append_outer("$20")

Let's interpret the code above:

box = TreasureBox('\$5'): An instance of the TreasureBox class is created with a treasure of $5. This is the initial box.
box = box.append_outer("\$100"): The append_outer method is called on the box object. A new TreasureBox instance is created with a treasure of "$100". This new box is set to encompass the original box. The next attribute of this new TreasureBox points to the original box containing "$5". The variable box is then updated to reference this new outer box.
box = box.append_outer("\$20"): The append_outer method is invoked again, this time on the updated box object (which now contains the $100 treasure and encapsulates the $5 treasure). A new TreasureBox with a treasure of $20 is created and is set to encompass the $100 box. The variable box is updated once more to reference this newest outer box.

After these operations, we have a nested structure of TreasureBox instances that can be visualized like this:

The outermost TreasureBox contains "$20".
Wrapped inside it, the middle TreasureBox contains "$100".
Within the middle one, the innermost TreasureBox contains "$5".

The current reference box points to the outermost box. If we were to call the get method with different depths on the final box object, we would get:

box.get(0)

'D'

box.get(1)

box.get(2)

Question: What will happen with the code below?

box.get(3)

Answer:

box = TreasureBox("Initial box")
box = box.append_inner("Box in initial box")
box = box.append_inner("Box in the second box")

box.get(0)

box.get(1)

box.get(2)

Activity: without running the code, what will this return?

box = TreasureBox("A")

box = box.append_inner("B")
box = box.append_outer("C")
box = box.append_outer("D")
box = box.append_inner("E")

box.get(3)

In computer science, this is called a linked list.
It's not just a nested data structure (list of lists); the definition of the data type itself is recursive.
- What is a list? It contains stuff (including possibly lists).
- What is a treasure box? It contains one thing, and another treasure box.

Recursive algorithms vs. data structures¶

We have a relationship between recursive function calls (see append_inner and get) and recursive data types (see __init__).
Recursion is an important idea in understanding algorithms and data structures.

Could we just implement this with lists?¶

Yes, if we decide each list contains 2 elements, an item and another list.
But using OOP we can make it more clear / less buggy.

class TreasureBox:
    def __init__(self, treasure):
        # The list stores treasures, with the first element being the outermost box
        self.treasures = [treasure]

    def append_outer(self, treasure):
        # Adding a treasure to the start of the list represents wrapping all existing boxes
        self.treasures.insert(0, treasure)

    def append_inner(self, treasure):
        # Adding a treasure to the end of the list represents adding a box inside the innermost box
        self.treasures.append(treasure)

    def get(self, depth):
        # Retrieve the treasure at the given depth
        if depth < len(self.treasures):
            return self.treasures[depth]
        else:
            return None  # Out of bounds

# Usage:
box = TreasureBox("A")
box.append_inner("B")
box.append_outer("C")
box.append_outer("D")
box.append_inner("E")

# Now we should have the following structure: ["D", "C", "A", "B", "E"]
print(box.get(0))  # Outputs "D"
print(box.get(1))  # Outputs "C"
print(box.get(2))  # Outputs "A"
print(box.get(3))  # Outputs "B"
print(box.get(4))  # Outputs "E"

Trees, binary search trees¶

Trees are recursive data structures, like the linked lists above.
In the class activity you will implement a set based on trees instead of a hash table.
- In particular this data structure is called a binary search tree.
Below we have a generic tree:

(Image attribution to TotoBaggins, from Wikipedia article.)

Tree terminology:

A tree is either empty or a node with zero or more children that are themselves trees (or "subtrees").
If $X$ is the child of $Y$, then $Y$ is the parent of $X$ (e.g. Captain A is a child of Colonel B; Colonel B is the parent of Captain A).
The root is the only node without a parent (e.g. General).
A leaf is a node that does not have children (e.g. Private A).
An internal node is a node that is not a leaf (e.g. Captain A).
The height (aka depth) of the tree is the largest number of edges connecting the root to a leaf (here, 4).

Let's build a simple binary tree class using Python. A binary tree is a tree where each node has at most 2 children (the above is not a binary tree). So each tree node will have a label and two children.

class BinaryTree:
    
    def __init__(self, item):
        self.item = item
        self.left = None  # type = BinaryTree
        self.right = None # type = BinaryTree
    
    def insert(self, item):
        # pick a random side to insert on
        left = np.random.rand() < 0.5
        if left:
            if self.left is None:
                self.left = BinaryTree(item)
            else:
                self.left.insert(item)
        else:
            if self.right is None:
                self.right = BinaryTree(item)
            else:
                self.right.insert(item)
    
    def contains(self, item):
        if self.item == item:
            return True
        
        if self.left is not None:
            if self.left.contains(item):
                return True

        if self.right is not None:
            if self.right.contains(item):
                return True

        return False
    
    # We would want some more functions here, e.g. to add/remove things from the tree.

Let's check the BinaryTree class. It represents a binary tree data structure. A binary tree is a tree data structure in which each node has at most two children, referred to as the left child and the right child.

__init__(self, item)

Purpose: Initializes a new BinaryTree node with the specified item.
Parameters: item - the value or data to store in the node.
Attributes:
- self.item: Stores the data or value of the current node.
- self.left: A reference to the left child node, initially set to None.
- self.right: A reference to the right child node, also initially set to None.

insert(self, item)

Purpose: Inserts a new item into the binary tree.
Process:
- Randomly decides whether to insert the item on the left or right by generating a random number between 0 and 1 and checking if it is less than 0.5.
- If the chosen side (left or right) is None (i.e., there is no child node), it creates a new BinaryTree node with the item on that side.
- If the chosen side already has a child node, it recursively calls the insert method on that child node.

contains(self, item)

Purpose: Checks whether the binary tree contains a specified item.
Process:
- If the current node's item matches the searched item, it returns True.
- If the item doesn't match and there is a left child, it recursively checks the left subtree by calling contains on the left child. If the left subtree contains the item, it returns True.
- Similarly, if there is a right child, it recursively checks the right subtree. If the right subtree contains the item, it returns True.
- If the item is not found in either the current node or recursively in its children, it returns False.

Note

This implementation of insert does not create a balanced binary tree, nor does it follow the rules of a binary search tree where items are systematically placed to the left or right based on comparisons. Instead, items are placed randomly to the left or right, which can lead to an unbalanced and inefficient tree for search operations.

Let's manually build a binary tree containing some of the information in the previous example:

tree = BinaryTree("General")
tree.insert("Colonel A")
tree.insert("Colonel B")
tree.insert("Captain A")
tree.insert("Captain B")

tree.contains("Clown")

tree.contains("Captain B")

type(tree)

type(tree.left)

The key idea here is that, like TreasureBox, the BinaryTree object stores more binary tree objects.
However, each TreasureBox only stores one TreasureBox, whereas each BinaryTree stores two BinaryTrees.

Binary search trees (BSTs)¶

A binary search tree (BST) is a node-based binary tree data structure which has the following properties:

Ordered Structure: Each node contains a key (and optionally, an associated value) and has two distinguished sub-trees, typically referred to as the left subtree and the right subtree.
Binary Tree Property: Each tree node has at most two children.
Search Property: For any node, all elements in its left subtree are less than the node’s key, and all elements in the right subtree are greater than the node’s key. This must be true for all nodes, meaning the tree is recursively ordered.
No Duplicates: Each node's key must be unique within the tree, preventing multiple instances of the same key in the tree.

Operations on Binary Search Trees

BSTs support several operations, including:

Insertion: Adding a new key to the tree while maintaining the BST property. This is done by traversing the tree from the root to a leaf, following the search property at each step to decide the turn direction.
Deletion: Removing a key from the tree. This can be complex because it must ensure that the BST property is preserved. There are three scenarios: deleting a leaf node (no children), a node with one child, and a node with two children.
Search: Locating a node with a given key. The tree is traversed starting from the root, moving left or right depending on whether the key is less than or greater than the current node’s key, respectively.
Traversal: Visiting all the nodes in a systematic manner. Common traversal methods include in-order (left, root, right), pre-order (root, left, right), and post-order (left, right, root) traversals.
Finding Min/Max: The minimum key is found at the leftmost leaf node, and the maximum key is at the rightmost leaf node.
Successor/Predecessor: The successor of a node is the next node in the in-order traversal, while the predecessor is the node that precedes the current node in the in-order traversal.

Advantages

Efficiency: Operations like search, insertion, and deletion can be done in $O(log n)$ time complexity on average, which is efficient compared to linear data structures like arrays or linked lists.
Sorted Data: In-order traversal of a BST yields the keys in sorted order, which can be useful for applications that require sorted data.

Disadvantages

Worst-case scenarios: In the worst case, such as when inserting sorted keys, a BST can become unbalanced and resemble a linked list with O(n) time complexity for operations.
Maintenance: BSTs may require balancing to ensure that the tree remains efficient for all operations. Balancing a BST is a complex process and is handled by specialized trees like AVL trees or red-black trees.

BSTs are widely used due to their dynamic nature and efficiency for operations that involve searchable data. They form the basis for more complex data structures like sets, maps, and associative arrays in many programming libraries.

Nearest neighbours intro¶

A common problem is to find the nearest neighbours of a point.
We can start in 2D:

# You can ignore the code - we'll just look at the plot
###

# Creating Data
n = 20
np.random.seed(1)
X = np.random.rand(n, 2)

# Creating Dataframe
data = pd.DataFrame({'x0': X[:, 0], "x1" : X[:, 1]})
data['m0'] = .5
data['m1'] = .5

# Creating Plot
chart1 = alt.Chart(data).mark_circle(size=60).encode(
    x= alt.X('x0',
             axis=alt.Axis(grid=False)),
    y = alt.Y('x1',
             axis=alt.Axis(grid=False)))


chart2 = alt.Chart(data).mark_circle().encode(
    x= alt.X('m0',
              axis=alt.Axis(title = "")),
    y = alt.Y('m1',
               axis=alt.Axis(title = "")),
    color = alt.value('red'),
    size = alt.value(50)
)

chart1 + chart2

Which blue points are nearest to the red ("query") point?
- To define "nearest" we need a notion of distance.
- For now, we'll use Euclidean distance (the one you're used to from day-to-day life).
- In future courses, this might change.
- Choosing a distance metric is actually important in machine learning.

The algorithmic approach is:

Find the distance from the red point to all the blue points.
Find the smallest distances.

# It's OK if you don't understand this code, especially during lecture
# (It uses numpy broadcasting, which is covered in the second half of DSCI 523.)


def nearest_neighbour(data, query):
    """
    Find the point in the data that is nearest to the query point.

    Parameters
    ----------
    data : numpy.ndarray
        a 2D array containing the points as rows
    query : numpy.ndarray
        a 1D array containing the query point
    
    Returns
    -------
    int
        the index of the nearest point
                 
    Example
    --------
    >>> array = np.array([[1, 1], [2, 5], [5, 6], [3, 0], [9, 9]])
    >>> nearest_neighbour(array, [10, 10])
    4
    """
    
    if query.ndim == 1:
        query = query[None]

    return np.argmin(np.sum((data - query)**2, axis=1))

print(X)

query = np.array([0.5, 0.5])
nn = nearest_neighbour(X, query)
nn

X[8]

Question: what is the time complexity of nearest_neighbour if we have $n$ points in $k$ dimensions?

Answer: $O(nk)$, because we have to loop over all $n$ points, and computing the distance requires looping over the $k$ dimensions.

Problem: this may be way too slow!
For example, if you want to find similar items on Amazon, and they have a billion items, you don't want to have to look through all of them every time.

Recursive Data Structures and Nearest neighbor algorithms¶

Recursive data structures and nearest neighbor algorithms are concepts from different domains of computer science but can be connected in the context of data organization and search algorithms.

Recursive Data Structures: These are data structures that are defined in terms of smaller instances of the same type of data structure. For example, in a binary search tree (BST), each node has a value and pointers to two children, which are themselves binary search trees. Recursive data structures are often used to represent hierarchical data, and they naturally lend themselves to recursive algorithms for traversal and management.

Nearest Neighbor Algorithms: These algorithms are used to find the closest point(s) to a given point from a set of points. They are a key part of many applications, such as classification in machine learning, where an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k-NN algorithms).

The Connection: Recursive data structures like BSTs can be used to efficiently implement nearest neighbor searches, especially in higher dimensions. For instance, the k-d tree (k-dimensional tree) is a space-partitioning data structure for organizing points in a k-dimensional space. K-d trees are a binary tree that recursively subdivides the space into halves, with each node representing an axis-aligned hyperrectangle. When searching for the nearest neighbors of a point, you would use the tree structure to quickly eliminate large portions of the search space, significantly speeding up the search as compared to a brute-force approach that compares the point against every other point in the set.

Here’s how k-d trees connect recursive data structures with nearest neighbor searches:

They use a recursive data structure (a binary tree variant) to recursively divide the space into smaller regions.
They enable efficient nearest neighbor searches by utilizing the tree's properties to prune the search space.
When searching for the nearest neighbor(s) of a point, the algorithm recursively traverses the tree, going deeper into the branches that may contain the nearest neighbor and skipping branches that don't, based on the minimum possible distance between the point and the regions represented by those branches.

$k$-d trees¶

Sometimes we speed things up with faster algorithms.
But, as we've seen with trees and hash tables, sometime we speed things up with better data structures.
One of the classic ways to speed up nearest neighbours is a data structure call the $k$-d tree.

A k-d tree is a space-partitioning data structure that is particularly useful for organizing and searching for points in a k-dimensional space. Its applications are broad and include the following:

Nearest Neighbor Searches: The primary use of k-d trees is to quickly find the nearest neighbor(s) of a given point. This is useful in many domains, such as pattern recognition, statistical machine learning for classification tasks, and data mining.
Range Searches: K-d trees enable efficient range searching, where you can query for all points within a given range. For instance, in a mapping application, a k-d tree can be used to find all locations within a certain distance from a user's current position.
Geospatial Indexing: In geographic information systems (GIS), k-d trees are employed to index spatial data like points on a map, allowing for fast querying of location data.
Dimensionality Reduction: They can be used in dimensionality reduction problems, helping to organize points in high-dimensional spaces for algorithms like the k-nearest neighbors (k-NN) to work more efficiently.
Collision Detection: In computer graphics and gaming, k-d trees can help in collision detection by quickly identifying potential intersections between objects in space.
Robotic Motion Planning: K-d trees are used in robotics for the rapid exploration of configuration spaces, which helps in planning the motions of robots avoiding obstacles.
Multidimensional Keys in Databases: Database systems use k-d trees for indexing multidimensional keys, allowing for efficient processing of complex queries that involve multiple attributes.
Approximate Matching: In problems where an exact match is not required, or it's too expensive to find, k-d trees can be used to quickly find approximate matches.
Data Compression: By structuring data points in space, k-d trees can support certain types of data compression schemes.

K-d trees offer average-case efficient performance, which is generally $O(log n)$ for many of its operations. However, their performance can degrade to O(n) in the worst case, particularly in high-dimensional spaces (the so-called "curse of dimensionality"). For high-dimensional spaces, other data structures such as ball trees or approximate methods might be more suitable.

# You do not need to read/understand this code,
# But I think you're capable of understanding it with a bit of time spent.
# Feel free to ask if you have questions!


class KDTree:
    """A k-d tree data structure for fast nearest neighbour searches"""

    def __init__(self):

        self.location = None
        self.leftSubTree = None # type KDTree
        self.rightSubTree = None # type KDTree
        self.dim = None
        self.data = None

    def build(self, data, depth=0):
        """
        Build the k-d tree from the given data.
        Implementation inspired by https://en.wikipedia.org/wiki/K-d_tree

        Parameters
        ----------
        data : numpy.ndarray
            a 2D array where each row is a point in space
        depth : int 
            this can be ignored, for internal bookkeeping (default: 0)    
        """
        nrows = data.shape[0]
        self.dim = depth % data.shape[1]
        self.data = data

        self.location = np.median(data[:, self.dim])
        # above, or just data[nrows//2,dim] after sorting
        # although this one will average if there's a tie.

        if nrows == 1:
            return

        data = data[np.argsort(data[:, self.dim])]

        self.leftSubTree = KDTree()
        self.leftSubTree.build(data[:nrows//2], depth+1)

        self.rightSubTree = KDTree()
        self.rightSubTree.build(data[nrows//2:], depth+1)

    def approximateNearestNeighbour(self, query):
        """
        Find the nearest neighbor to the query point.
        However, this is just approximate; it finds a point in
        the same rectangle, not necessarily the actual nearest neighbour.
        This is just for educational purposes; a correct algorithm
        exists but it's too messy to put here.

        Parameters
        ----------
        query : numpy.ndarray 
            a point in space

        Returns
        -------
        numpy.ndarray
            the coordinates of the point closest to the query
        """
        if self.data.shape[0] == 1:
            return self.data[0]

        if query[self.dim] < self.location:
            return self.leftSubTree.approximateNearestNeighbour(query)
        else:
            return self.rightSubTree.approximateNearestNeighbour(query)

    def plot2d(self, depth=1, minx=0.0, maxx=1.0, miny=0.0, maxy=1.0):
        """
        Plot the k-d tree.

        Parameters
        ----------
        depth : int
            how deep to go down the tree when plotting (defult: 0)
        minx : int
            the left edge of the plot (default: 0.0)
        maxx : int
            the right edge of the plot (default: 1.0)
        miny : int
            the bottom edge of the plot (default: 0.0)
        maxy : int
            the top edge of the plot (default: 0.0)

        Returns
        -------
        numpy.ndarray
            the coordinates of the point closest to the query    
        """

        data = pd.DataFrame({'x0': self.data[:, 0], "x1": self.data[:, 1]})
        chart1 = alt.Chart(data).mark_circle(size=60).encode(
            x=alt.X('x0',
                    axis=alt.Axis(grid=False)),
            y=alt.Y('x1',
                    axis=alt.Axis(grid=False)))
        charts_list = [chart1]
        if depth == 0:
            return

        if self.dim == 0:
            data2 = pd.DataFrame(
                {'x0': (self.location, self.location), "x1": (miny, maxy)})
            chart2 = alt.Chart(data2).mark_line().encode(
                x=alt.X('x0',
                        axis=alt.Axis(grid=False)),
                y=alt.Y('x1',
                        axis=alt.Axis(grid=False)))
            charts_list.append(chart2)

            if self.leftSubTree is not None:
                chart0l = self.leftSubTree.plot2d(
                    depth-1, minx=minx, maxx=self.location, miny=miny, maxy=maxy)
                charts_list.append(chart0l)
            if self.rightSubTree is not None:
                chart0r = self.rightSubTree.plot2d(
                    depth-1, minx=self.location, maxx=maxx, miny=miny, maxy=maxy)
                charts_list.append(chart0r)
        elif self.dim == 1:
            data3 = pd.DataFrame(
                {'x0': (minx, maxx), "x1": (self.location, self.location)})
            chart3 = alt.Chart(data3).mark_line().encode(
                x=alt.X('x0',
                        axis=alt.Axis(grid=False)),
                y=alt.Y('x1',
                        axis=alt.Axis(grid=False)))
            charts_list.append(chart3)
            if self.leftSubTree is not None:
                chart1l = self.leftSubTree.plot2d(
                    depth-1, minx=minx, maxx=maxx, miny=miny, maxy=self.location)
                charts_list.append(chart1l)
            if self.rightSubTree is not None:
                chart1r = self.rightSubTree.plot2d(
                    depth-1, minx=minx, maxx=maxx, miny=self.location, maxy=maxy)
                charts_list.append(chart1r)

        params = list(filter(None, charts_list))
        return alt.layer(*params).encode()

        # alt.layer(*params).display()
        #alt.layer(whisker_low, box, whisker_high, midline, data=data)

Basic idea:

In each recursive step, there is a certain number of datapoints. If there's only one, we're done.
Otherwise, for one of the two dimensions (we alternate back and forth), find the median value along the dimension.
Split the data into two subsets based on being above or below that median, and build a (sub)tree for each of those subsets.
Starting from the full dataset, you will create a tree where each leaf is a datapoint.
You can find an approximate nearest neighbour by traversing the down the tree using the same decision points as were used to original split the data; the final leaf is the desired neighbour.

# np.median(X[:,0])

kdt = KDTree()
kdt.build(X)
kdt.plot2d(depth=1)

The plot shows the first level of partitioning in a k-d tree.

We know that in a k-d tree, data is recursively split along axis-aligned hyperplanes. At the first level (depth 0), the root of the k-d tree, the split is typically along the x-axis (assuming we're considering the first dimension as x and the second as y). This split is done at the median point of the data along that axis, which aims to partition the data into two halves of equal size.

In the plot, the blue dots represent the data points from X. The vertical blue line represents the median split along the x-axis (which is labeled as x0 on the plot). This line divides the plot into two regions:

To the left of the line, where the x-values of points are less than the median (x0 < median).
To the right of the line, where the x-values of points are greater than or equal to the median (x0 >= median).

This median split is the root node of the k-d tree, which then would recursively partition the data within each of the two halves if depth were greater than 1. However, since depth=1, we only see the first split and not any further subdivisions.

kdt = KDTree()
kdt.build(X)

kdt.plot2d(depth=2)

Now, the 2-dimensional plot shows the first two levels of partitioning in a k-d tree.

The lines divide the space into rectangles, each associated with a node in the k-d tree. The blue dots represent the data points, and their position relative to the lines indicates in which node of the k-d tree they would be located. The structure of these partitions reflects the hierarchical, recursive nature of the k-d tree, and the plot visualizes how a two-level k-d tree organizes the data in 2D space. Each subsequent depth would further subdivide the regions until each leaf node of the tree contains a small number of points, potentially down to a single point.

Clusters in a k-d tree plot they simply represent how the k-d tree organizes the data points into different partitions for efficient nearest neighbor searches.
Each rectangular region formed by the intersecting lines can be thought of as a "bucket" that groups points together based on the splitting criteria of the tree at each depth. These regions become nodes in the k-d tree.
The points within each rectangle are not necessarily the closest to each other in the dataset; they just happen to fall within the same partition based on the dimension and the median value at that node in the tree.
Dots over the lines represent data points that lie exactly on the splitting hyperplane at any given depth of the tree.
In 2D space, these "lines" are the splitting criteria — one line splits across the x0 axis and the other across the x1 axis. The dots on the lines are the points where the x0 or x1 value is equal to the median of the respective dimension used for the split at that level of the tree.
When building the k-d tree, these points could be assigned to either the left or right subtree arbitrarily, or some implementations may have a rule to decide on which side of the split such points should go.
The precise position of the dots, whether they're over the lines or not, is significant for how the tree will navigate during nearest neighbor searches, but it does not provide direct information about the density or the true "closeness" of the points unless we consider the actual distances between them.

kdt.plot2d(depth=3)

kdt.plot2d(depth=4)

kdt.plot2d(depth=6)

Let's find the approximate nearest neighbor to the query point. Our method will find a point in the same rectangle, not necessarily the actual nearest neighbour.

kdt.approximateNearestNeighbour(np.array([1, 1]))

The code above is a method call to approximateNearestNeighbour on an instance of the KDTree class. It was used to find an approximate nearest neighbor to the query point defined by the numpy array [1, 1]. Let's check what it does step by step:

Query Point: The point we're interested in finding the nearest neighbor for is represented as a numpy array with coordinates [1, 1]. This point is in a 2-dimensional space, and in the context of the k-d tree, we're searching for the closest point in the dataset that the tree has been built with.
Searching: The method begins at the root of the k-d tree and compares the query point to the splitting value (location) of the node, which is determined during the building of the tree. The splitting value is the median of the dataset along the dimension that the node represents.
Traversing the Tree: Depending on whether the query point's coordinate in the current dimension is less than or greater than the node's splitting value, the search moves to the left or right subtree, respectively.
Recursive Descent: This process is recursive — at each node, the algorithm decides which branch to follow (left or right) based on the comparison with the location and continues until it reaches a leaf node.
Approximation: Once a leaf node is reached, the method returns the data point associated with that leaf node as the approximate nearest neighbor. It's approximate because the algorithm does not backtrack to check other possible closer points that may be in different branches of the tree but happen to be closer to the query point due to the partitioning strategy.
Result: The result of the method is the coordinates of the data point that is the approximate nearest neighbor to the point [1, 1].

It's important to note that the approximation might not always give the true nearest neighbor because it doesn't perform a backtracking search, which would be required to find the exact nearest neighbor in a k-d tree. The backtracking search would involve checking other branches of the tree if there's a possibility that they could contain a closer point than what was found in the initial descent down the tree.

Let's check if this particular query point [1, 1], the approximate nearest neighbor identified by the k-d tree's method approximateNearestNeighbour happens to be the actual nearest neighbor:

X[nearest_neighbour(X, np.array([1,1]))]

Yes, it is!

Let's check a new query point:

kdt.approximateNearestNeighbour(np.array([0.5, 0.5]))

X[nearest_neighbour(X, np.array([0.5,0.5]))]

Timing experiments¶

We'll time scikit-learn's KDTree and compare it to brute force.

n_sizes = [100, 1000, 10_000, 100_000]

results = defaultdict(list)
results["n"] = n_sizes

d = 10

for n in n_sizes:
    print('n: ', n)
    X = np.random.rand(n, d)
    query = np.random.rand(1, d)

    print("  KDTree")
    time = %timeit -q -o -r 3 sklearn.neighbors.KDTree(X)
    results["KDTree init"].append(time.average)
    KDT = sklearn.neighbors.KDTree(X)

    time = %timeit -q -o -r 3 KDT.query(query)
    results["KDTree query"].append(time.average)

    print("  Brute force")
    time = %timeit -q -o -r 3 nearest_neighbour(X, query)
    results["Brute force"].append(time.average)

The code conduct a performance comparison between two methods of finding the nearest neighbor(s) in a dataset: using the k-d tree data structure and using a "brute-force" approach. The comparison is done for different sizes of datasets, and the time it takes to initialize the data structure and to query the nearest neighbor is measured and stored. Here's a step-by-step explanation:

Dataset Sizes: n_sizes is a list of integers that represent different sizes of datasets that the code will use for testing.
Results Storage: results is a defaultdict of lists, which is a convenient way to store the results of the timing experiments for each method and dataset size. The list n_sizes is also stored with the key "n".
Dimension: d represents the number of dimensions in the dataset. The code sets d = 10, indicating that each point in the dataset is a 10-dimensional vector.
Benchmarking Loop: The code loops over each size n in n_sizes. For each n, the following steps are performed:
- Generate a random dataset X with n samples, each of d dimensions, using np.random.rand(n, d).
- Generate a random query point with d dimensions using np.random.rand(1, d).
k-d Tree Benchmarking:
- Initialize a KDTree from sklearn.neighbors with the dataset X, and measure the time it takes to initialize the tree using %timeit -q -o -r 3. This is done quietly (-q), and the best result of 3 runs (-r 3) is taken. The average time is then appended to results["KDTree init"].
- Perform a nearest neighbor query on the KDTree with the generated query point and measure the query time, appending the average time to results["KDTree query"].
Brute Force Benchmarking:
- Execute a nearest neighbor search using a brute-force function nearest_neighbour with the dataset X and the query point, measuring the time it takes and appending the average to results["Brute force"].

It is expected that the k-d tree method will be significantly faster, especially as the size of the dataset grows, due to its $log(n)$ average complexity, compared to the brute-force method's linear complexity $(O(n))$.

df = pd.DataFrame(results, columns=list(results.keys()))
df

Question: What does the time complexity look like for the 3 columns?

Answer: Brute force looks linear, the query looks logarithmic(?), the initialization grows faster than linear, perhaps $O(n \log n)$ or perhaps something else, we won't worry about that here.

Question: Which is better, the $k$-d tree or brute force?

Answer: It depends how many queries you do.

A detailed interpretation of the results in the data frame:

As expected, the initialization time for the k-d tree (KDTree init) increases with the size of the dataset. It's much lower for small datasets but becomes more significant for larger datasets. This is due to the overhead of building the tree structure, which involves sorting points and creating tree nodes.
The query time for the k-d tree (KDTree query) remains very low across all dataset sizes, highlighting the efficiency of k-d trees for nearest neighbor searches. It doesn't increase significantly as the dataset size grows, which demonstrates the $log(n)$ complexity of the query operation in a k-d tree.
The brute force search times are low for smaller datasets but increase linearly with the size of the dataset. For 100 and 1,000 points, brute force is faster than initializing a k-d tree but slower when comparing query times.
For the largest dataset size (100,000), the k-d tree query time is still low, whereas the brute force method takes significantly longer. This demonstrates the inefficiency of the brute force approach for large datasets, where the linear search through all data points becomes costly.

In conclusion, these results support the claim that while k-d trees take time to build, they provide very fast query responses, and they outperform the brute force method as the size of the dataset increases.

Amoritzation¶

Let's focus on $n=10000$, and $k=10$. Then,

$k$-d tree initialization takes $\approx 4$ ms
$k$-d tree query takes $\approx 0.1$ ms
brute force search takes $\approx 1$ ms

Question: How many queries do we need to do such that the $k$-d tree is better?

Answer: around 5.

So if we're doing 100 queries, the $k$-d tree is much better.
This reflects a general phenomenon in algorithms: doing a lot of work up front to save time later.
- We saw this earlier with sorting a list and then doing binary search multiple times.
We say the up-front effort is amortized (or spread out) over the many queries.
In some cases, we can make more precise calculations.
- For example, we say hash table operations are $O(1)$.
- In fact, once in a while a slower operation must be done.
- However, we can show that an $O(n)$ operation only needs to be done every $1/n$ steps.
- In which case we say the cost is amortized and the overall cost is still $O(1)$.
- This is an important idea.

Other nearest neighbour approaches¶

Note: there are other nearest neighbour approaches besides $k$-d trees, including some very fast approximate algorithms.
In general, you can often do something faster if the result can be slightly wrong.
There are approaches based on hashing instead of trees.
Here are some resources:
- sklearn.neighbours documentation
- Approximate nearest neighbours, e.g. locality-sensitive hashing (LSH), this package.

!jupyter nbconvert _04-py-recursive-data-structures.ipynb --to html --template classic --output 04-py-recursive-data-structures.html

[NbConvertApp] Converting notebook _04-py-recursive-data-structures.ipynb to html
[NbConvertApp] Writing 484672 bytes to 04-py-recursive-data-structures.html