Python - Searching, Sorting, Hash

QTM 350: Data Science Computing

Davi Moreira

Introduction

This topic material is based on Professor Mike Gelbart Algorithms and Data Structures course. It was adapted for our purposes.

import numpy as np
import pandas as pd
from collections import defaultdict

import matplotlib.pyplot as plt
import altair as alt

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

Outline:

Linear search and binary search intro
Code timings
Sorting
Hash tables, hash functions
Lookup tables, Python dict
Python's defaultdict

Learning objectives¶

Describe some basic sorting algorithms and their time complexities.
Describe the binary search algorithm.
Explain why searching in a sorted list is faster than in an unsorted list.
Explain the pros and cons of hash tables, and how they work at a high level.
Apply python's dict and defaultdict data structures.

Linear search and binary search intro¶

We return to the problem of checking whether an element is present in a collection.

Linear search¶

def search_unsorted(data, key):
    """
    Searches the key in data using linear search 
    and returns True if found and False otherwise. 

    Parameters
    ----------
    data : list
           the elements to search within
    key  : int
           the key to search for

    Returns
    -------
    bool
        boolean if key is contained in the data 

    Examples
    --------
    >>> search_unsorted([1, 7, 67, 35, 45], 3)
    False
    >>> search_unsorted([1, 7, 67, 35, 45], 7)
    True
    """

    for element in data:
        if element == key:
            return True
    return False

Let's see some tests below! These assertions are checking that the function search_unsorted behaves as expected under various typical and boundary scenarios. If any assertion fails, Python will raise an AssertionError, indicating a bug or unexpected behavior in the function's implementation.

# Some tests

# key is the first element in the list
assert search_unsorted([4, 7, 9, -12, 1000], 4)

# key is the last element in the list
assert search_unsorted([4, 7, 9, -12, 1000], 1000)

# key occurs multiple times in the list
assert search_unsorted([4, 7, 9, -12, 4, 1000], 4)

# key is larger than the largest element in the list
assert not search_unsorted([4, 7, 9, -12, 1000], 2000)

# key is smaller than the smallest element in the list
assert not search_unsorted([4, 7, 9, -12, 1000], -18)

# nothing is in an empty list
assert not search_unsorted([], 1)

Question: What is the time complexity of the search_unsorted, as a function of the length of the list, $n$?

Answer:

Binary search¶

If the list is already sorted, we can search much faster with binary search.
See the "binary search video" that was recommended as pre-class viewing.
We start in the middle and can just restrict ourselves to searching half the list after a comparison.
Note: the input list must be sorted for the code to work.

def search_sorted(data, key):
    """
    Searches the key in data using binary search 
    and returns True if found and False otherwise. 

    Parameters
    ----------
    data : list
           a list of sorted elements to search within
    key  : int
           the key to search for

    Returns
    -------
    bool :
        boolean if key is contained in the data 

    Examples
    --------
    >>> search_sorted([1, 7, 35, 45, 67], 3)
    False
    >>> search_sorted([1, 7, 35, 45, 67], 7)
    True
    """
    
    while len(data) > 0:
        mid = len(data)//2
        if data[mid] == key:
            return True
        if key < data[mid]:
            data = data[:mid] 
        else:
            data = data[mid+1:]
    return False

data = [-12, 4, 7, 9, 45, 45, 987, 1000, 2000]

# Test cases for binary search

# key is the first element in the list
assert search_sorted(data, -12) == True

# key is the last element in the list
assert search_sorted(data, 2000) == True

# key occurs multiple times in the list
assert search_sorted(data, 45) == True

# key is larger than the largest element in the list
assert search_sorted(data, 3000) == False

# key is smaller than the smallest element in the list
assert search_sorted(data, -18) == False

# nothing is in an empty list
assert search_sorted([], 1) == False

Question: What is the time complexity of the search_sorted, as a function of the length of the list, $n$?

Answer:

The binary search approach involves repeatedly dividing the search interval in half.

Algorithm Description

Initialize the search range: The search range initially encompasses the entire list.
Find the middle element: Calculate the index of the middle element of the current search range.
Comparison:
- If the middle element is equal to the key, the search is successful and returns True.
- If the key is smaller than the middle element, adjust the search range to the left half (excluding the middle element).
- If the key is larger than the middle element, adjust the search range to the right half (excluding the middle element).
Repeat the process with the new search range.
Termination: If the search range becomes empty (no elements to search), return False.

Time Complexity Analysis

Worst-case and average-case time complexity:
- At each step of the binary search, the size of the search range (the list) is halved. This halving continues until either the key is found or the list cannot be divided further (i.e., it becomes empty).
- The number of times a list of size $ n $ can be halved until it reduces to a size of 1 is $\log_2 n$. This logarithmic reduction implies that the time complexity of binary search in both the average and worst cases is $ O(\log n) $, where $ n $ is the number of elements in the list.
Best-case time complexity:
- The best case occurs if the key is exactly at the middle of the list in the first check, leading to a single comparison.
- Thus, the best-case time complexity is $ O(1) $, representing a constant time operation.

Therefore, the theoretical time complexity of the search_sorted function remains $ O(\log n) $ for the average and worst cases due to the logarithmic reduction in problem size at each step of the algorithm.

Why the number of times a list of size $ n $ can be halved until it reduces to a size of 1 is $ \log_2 n $?

Understanding the Process of Halving:

When you halve a list (or any set of elements), each halving operation reduces the number of elements to half of its current size. For example:

If you start with $ n $ elements, after one halving, you have $ \frac{n}{2} $ elements.
After a second halving, you have $ \frac{n}{4} $ elements, and so forth.

The halving continues until you cannot halve further, which ideally ends at 1 element.

Mathematical Explanation:

To find how many times you can perform this operation, you are essentially asking, "How many times must you multiply 2 (the base of the halving) to reach $ n $?" Mathematically, this question translates into solving for $ k $ in the equation:

$$ 2^k = n $$

Where:

$ 2 $ is the base, representing the halving.
$ k $ is the number of times you multiply 2 together to reach $ n $.
$ n $ is the initial size of your list.

Using Logarithms:

The logarithm is the inverse operation of exponentiation. To solve for $ k $, you take the logarithm base 2 of both sides of the equation $ 2^k = n $:

$$ k = \log_2(n) $$

This tells you that $ k $ (the number of halvings) is equal to $ \log_2(n) $.

Intuitive Example:

If $ n = 8 $:

After 1 halving: 4 elements ($ \frac{8}{2} $)
After 2 halvings: 2 elements ($ \frac{4}{2} $)
After 3 halvings: 1 element ($ \frac{2}{2} $)

So, $ 2^3 = 8 $ leads to $ \log_2(8) = 3 $. This tells us it takes 3 halvings to reduce 8 elements to 1, aligning with $ \log_2(n) = 3 $ for $ n = 8 $.

Thus, the operation of binary search, which depends on this process of halving, has a logarithmic time complexity ($ O(\log n) $), as the list size $ n $ can be divided $ \log_2(n) $ times until only one or no elements are left.

Question: What happens if you call search_unsorted on sorted data? What happens if you call search_sorted on unsorted data?

Answer:

search_sorted([3, 2, 1], 1)

Question: Why doesn't the search_sorted function start by verifying that the list is indeed sorted?

Answer:

Code timing¶

Below we empirically measure the running times of 4 approaches:

Using in with a Python set (same as last class)
search_sorted, that is, our implementation of binary search above
Using in with a Python list (same as last class)
search_unsorted, that is, our implementation of linear search above

list_sizes = [100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]

results = defaultdict(list)
results["size"] = list_sizes

key = -1

for list_size in list_sizes:
    print('List size: ', list_size)
    x = np.random.randint(1e8, size=list_size)

    time = %timeit -q -o -r 1 search_unsorted(x, key)
    results["Unsorted list linear"].append(time.average)
    # Note: -q prevents it from printing to the terminal
    #       -o sends the result to a variable (average time in seconds)
    #       -r 3 makes it average only 3 trials instead of the default of 7, which saves time

    time = %timeit -q -o -r 1 (key in x)
    results["Unsorted list in"].append(time.average)

    x.sort()
    time = %timeit -q -o -r 1 search_sorted(x, key)
    results["Sorted list binary"].append(time.average)

    x_set = set(x)
    time = %timeit -q -o -r 1 (key in x_set)
    results["Python set in"].append(time.average)

Understanding the code above:

Question: Why do I search for $-1$ in the code below? Why not $1$?

Answer:

df = pd.DataFrame(results, columns=list(results.keys()))
df

Are these consistent with the time complexities we expected?

In general, we would expect to see the following trends in the data:

Unsorted List Linear Search:
- Time should increase linearly with the size of the list.
- This is because a linear search goes through each element one by one, so the time complexity is $O(n)$.
Unsorted List with 'in' Operator:
- Time should also increase linearly, similar to the linear search.
- The 'in' operator effectively performs a linear search when used on a list, so it also has a time complexity of $O(n)$.
Sorted List Binary Search:
- Time should increase logarithmically with the size of the list.
- Binary search has a time complexity of $O(\log n)$, so even as the list size grows large, the time should not increase significantly.
Python Set with 'in' Operator:
- Time should remain approximately constant regardless of the size of the set.
- The 'in' operator on a set is a hash table lookup with a time complexity of $O(1)$ in the average case.

df_long = pd.melt(df, id_vars="size", var_name="method", value_name="time (s)")

# Create a line chart using Altair
chart = alt.Chart(df_long).mark_line().encode(
    alt.X('size', scale=alt.Scale(type='log')),  # Setting the X-axis to a logarithmic scale
    alt.Y('time (s)', scale=alt.Scale(type='log')),  # Setting the Y-axis to a logarithmic scale
    color='method'  # Coloring the lines by the 'method' column
).configure_axis(grid=False)  # Disabling the grid lines

# Display the chart
chart

Note that the binary_search we wrote is actually slower than the linear search using in when the list is smaller than $10,000$ elements. Remember, big-O is just an "asymptotic" trend. There could be:

Large/small constants, like $1000\log(n)$ vs. $n$.
"Lower order terms", like $\log(n)+100 \log \log(n) + 100$ vs. $n$.

We are probably seeing the former; my code performs fewer steps (better complexity), but each step is much slower because the implementation is not optimized. (Often, though, we don't care too much about how code performs for very small inputs.)

The two fastest methods (the orange and blue curves) look quite similar, possible because of the log scale on the y-axis. Let's plot time vs. $\log n$ to try and tell the difference.

df_long = pd.melt(df[["size", "Sorted list binary", "Python set in"]],
                  id_vars="size",  # 'size' is the identifier column
                  var_name="method",  # The name of the new column that will contain the method names
                  value_name="time (s)")  # The name of the new column that will contain the time values

# Creating a line chart with Altair using the melted DataFrame. Each method will have a unique line.
chart = alt.Chart(df_long).mark_line().encode(
    alt.X('size', scale=alt.Scale(type='log')),  # X-axis represents 'size' on a logarithmic scale
    alt.Y('time (s)'),  # Y-axis represents the time in seconds
    color='method'  # The lines are colored based on the 'method' to differentiate the search methods
).configure_axis(grid=False)  # Disabling grid lines for a cleaner look

# Display the chart
chart

We can see that the set really is constant, but the binary search is logarithmic - in the plot, time is linear in $\log(n)$.

Sorting¶

Sorting is a very popular topic in Algorithms and Data Structures courses.
We'll start with "selection sort".

def selection_sort(x):
    """Sorts x inplace... slowly.

    Parameters
    ----------
    x : list
           the list needed to be sorted

    Returns
    -------
    list :
        the sorted list 

    Examples
    --------
    >>> selection_sort([7, 1, 67, 35, 45])
    [1, 7, 35, 45, 67]
    >>> selection_sort([357, 6, 55, 12, 112])
    [6, 12, 55, 112, 357]
    """

    n = len(x)

    for i in range(n):
        # Get the index of the smallest value from location i onward
        min_ind = np.argmin(x[i:]) + i

        # Swap this with element i
        x[i], x[min_ind] = x[min_ind], x[i] # swap
    return x

How does the above code work?
- For the remaining part of the array x[i:], find the smallest element.
- Swap this with the current element.
- Repeat until reaching the end of the array.
We will revisit this code next class

Here's a step-by-step explanation:

Function Definition:
- The function selection_sort takes a single argument x, which is a list of elements that need to be sorted.
Length of List:
- The variable n is assigned the length of the list x. This will be used to control the iterations of the sorting process.
Sorting Loop:
- A for loop is initiated with the range from 0 to n-1 (since range in Python is non-inclusive of the end value). The loop variable i represents the current position in the list that the algorithm is attempting to fill with the correct element.
Finding the Minimum Element:
- Within the loop, the function uses np.argmin(x[i:]) to find the index of the smallest element in the sublist x[i:], which is the portion of the list from index i to the end. It then adds i to this index because np.argmin returns the index within the sliced list x[i:], not the full list x.
Swapping Elements:
- Once the index of the minimum element (min_ind) is found, a swap operation is performed to place this minimum element into its correct position i in the list. The swapping is done using the tuple unpacking feature in Python, which allows for the values of x[i] and x[min_ind] to be exchanged in a single line.
Returning the Sorted List:
- After the loop has finished executing, all elements have been moved to their correct positions, and the list x is sorted. The sorted list x is returned.
Function Annotations:
- The function includes a docstring with a brief description, parameter annotations, return value description, and examples of usage.
In-place Sorting:
- It's important to note that selection sort is performed in-place, meaning the list x is sorted without requiring additional memory for another list.

Question: What is the time complexity of this method?

Answer:

Detailed analysis:

The swapping takes constant time.
The call to np.argmin needs to look through the array.
- This takes time proportional to the length of the array.
- The first time, the length is $n$. Then $n-1$, then $n-2$, etc.
The number of steps in the above code is

$n+(n-1)+(n-2)+\ldots+3+2+1$

This is an arithmetic series; the sum is $\frac{n(n+1)}{2}=\frac12 n^2 + \frac{n}{2}$

We ignore the $\frac{n}{2}$ because it is very small compared to $\frac12 n^2$ when $n$ is large.
E.g. for $n=1000$ we have $\frac{1}{2}n^2= 500000$ vs. $\frac{n}{2}=500$.
We ignore the $\frac12$ because we're only interested in the growth, not the scaling factor.
Result: we say the above code is $O(n^2)$.

Put another way, for our purposes this is the same as code that performs $n$ steps inside the loop. The fact that it actually decreases each time is not important enough to show up in the Big O.

Question: could we find a sorting algorithm that takes $\log(n)$ time?

Answer:

Hash tables, hash functions¶

A hash, in the context of computing and data structures, refers to a fixed-size integer that is calculated from input data of arbitrary size using a function called a hash function. The purpose of a hash is to uniquely represent data, facilitate fast data retrieval, and manage data storage in a hash table.

Python's set type supports the following operations in $O(1)$ time:
- inserting a new element
- deleting an element
- checking if an element is present

Hash functions¶

Python objects have a hash:

hash("mds")

hash("")

It looks like the hash function returns an integer.

hash(5.5)

hash(5)

hash(-9999)

It looks like the hash function of a Python integer is itself. Or at least small enough integers.

hash(999999999999999999999999)

Sometimes it fails?

hash([1, 2, 3])

hash((1, 2, 3))

hash(None)

If a Python set is a hash table, that means items in it must be hashable (dict has the same requirement, for keys):

s = set()

s.add(5.5)

s.add("mds")

s

s.add([1, 2, 3])

s.add((1, 2, 3))

s

Typically, mutable objects are not hashable.

Hash tables¶

So, it looks like the hash function maps from an object to an integer.
And that Python sets use these hash functions.
How do they work?
The hash table is basically a list of lists, and the hash function maps an object to its location in the outer list.
- But it's a bit more complicated than that.
- The list typically expands and contracts automatically as needed.
- These operations may be slow, but averaged or "amortized" over many operations, the runtime is $O(1)$
- The hash function depends on this array size.
- There's also an issue of collisions: when two different objects hash to the same place.
Roughly speaking, we can insert, retrieve, and delete things in $O(1)$ time so long as we have a "good" hash function.
- The hash function will be "good" for default Python objects, and if you end up needing to implement your own one day you should read a bit more about it.

A simple hash table implementation¶

Below is a (very low-quality) hash table implementation, with only 4 buckets by default:

class HashTable:
    
    def __init__(self, num_buckets=4):
        self.stuff = list() # A list of lists
        self.n = num_buckets
        
        for i in range(num_buckets):
            self.stuff.append([]) # Create the inner lists, one per bucket
        
    def add(self, item):
        if not self.contains(item):
            self.stuff[hash(item) % self.n].append(item)
        
    def contains(self, item):
        return item in self.stuff[hash(item) % self.n]
    
    def __str__(self):
        return str(self.stuff)

Here is an explanation of each part of the HashTable class:

__init__ Method:

This is the constructor for the HashTable class.
num_buckets is an optional parameter that defaults to 4, setting the number of buckets in the hash table.
self.stuff is initialized as an empty list, which will eventually contain other lists representing the buckets of the hash table.
The for loop creates num_buckets empty lists within self.stuff, each representing a bucket where items will be stored.

add Method:

This method adds an item to the hash table.
First, it checks whether the item is already contained in the table using the contains method. If not, it proceeds to add the item.
To determine which bucket to place the item in, the method uses the built-in hash function to generate a hash code for the item and then takes the modulo of that hash code with the number of buckets (self.n). This ensures the hash code maps to one of the existing buckets.
The item is then appended to the appropriate bucket list.

contains Method:

This method checks if an item is present in the hash table.
Similar to the add method, it uses the built-in hash function and modulo operation to find the bucket that the item would be placed in.
It then checks if the item is present in that bucket (which is a list) using Python's in operator.

__str__ Method:

This method defines the string representation of the HashTable object.
When you print an instance of the HashTable or use the str() function on it, this method returns a string representation of self.stuff, which is the list of buckets with their contained items.

This hash table implementation is a rudimentary form of what is used in more complex data structures. It doesn't handle collisions in a sophisticated way (it simply appends items to the list in the bucket, which could lead to all items in the same bucket in the case of hash collisions). It also does not provide a method for removing items or resizing the table when it gets too full, which are common features in robust hash table implementations.

(Note: The hash function has a random seed that is set at the start of every Python session, so your actual results my vary from mine.)

Let's see how it works by adding and checking for the presence of strings.

Short explanation of each operation:

ht.add("Hello Gabriel!"):
- This line of code attempts to add the string "Hello Gabriel!" to the hash table ht.
- It calls the add method on the HashTable instance, which calculates the hash of "Hello Gabriel!", then finds the correct bucket by taking the modulo of the hash with the number of buckets in the hash table (in this case, 4 as per the class default).
- It then appends "Hello Gabriel!" to the list (bucket) at the index determined by the modulo operation, assuming "Hello Gabriel!" is not already in that bucket.
hash("Hello Gabriel!") % 4:
- This is a standalone operation (not a method call) that calculates the hash of the string "Hello Gabriel!" using Python’s built-in hash function.
- The modulo operation (% 4) is then applied to the resulting hash value to determine which bucket the string would be placed into if added to the hash table. The number 4 corresponds to the default number of buckets in the HashTable.
- This line by itself does not affect the ht instance; it is likely there for illustrative purposes to show how the bucket index for "Hello Gabriel!" is determined.
`ht.contains("blah"):
- The contains method of the HashTable instance ht is called with the string "blah" as an argument to check if the string is present in the hash table.
hash("blah") % 4:
- The hash value of "blah" is computed and then modulo 4 is taken to find the bucket index where "blah" would be located if it were in the hash table.
- The contains method checks the bucket at this index. Since "blah" was not added to the hash table, it is not present in any bucket, and therefore ht.contains("blah") evaluates to False.

ht = HashTable()
print(ht)

So far, all 4 buckets are empty.
Now let's add something:

ht.add("Hello Gabriel!")
print(ht)

"Hello Gabriel!" went into this bucket because

hash("Hello Gabriel!")

hash("Hello Gabriel!") % 4

Now let's add more things:

ht.add("Welcome")
print(ht)

hash("Palmeiras!") % 4

ht.add("Palmeiras!")
print(ht)

ht.add("Goal!")
print(ht)

ht.add("Champion!")
print(ht)

If we want to look for something:

ht.contains("blah")

False because

And "blah" is not found in bucket.

hash("blah") % 4

ht.contains("Palmeiras!")

Same thing here.
The key idea is that you only need to look in one bucket - either it's in that bucket, or it's not in the entire hash table.

print(ht)

Above we have some collisions: that is, 2 items in the same bucket.
If the main list is able to dynamically grow as the number of items grows, we can keep the number of collisions low.
This preserves the $O(1)$ operations.

Lookup tables, Python `dict`¶

Python's dict type is a dictionary
A dictionary should support the following operations:
- inserting a new element
- deleting an element
- finding an element
It is much like a set except the entries, called "keys", now have some data payload associated with them, which we call "values".
It is also implemented as a hash table, meaning you can expect $O(1)$ operations.
Only the keys are hashed, so only the keys have to be hashable.
- A list can be a value, but not a key.

d = dict()
d[5] = "a"
d["b"] = 9
d

5 in d

9 in d  # it only searches the keys

d[5]

d[6]

Hashable types:

d[[1,2,3]] = 10

d[10] = [1,2,3] # OK

A reminder of some dictionary syntax:

f = {i: i*2 for i in range(10)}
f

for key, val in f.items():
    print("key =", key, " val =", val)

Nested dictionaries:

g = dict()
g[5] = f
g

Python's `defaultdict`¶

The defaultdict is a subclass of the dictionary class that returns a default value for the missing keys. Its importance in time complexity discussions can be highlighted by how it simplifies code that adds to collections of items within dictionaries. Without defaultdict, one would have to manually check whether a key exists in the dictionary and possibly initialize a new list for that key before appending to it. defaultdict automatically initializes a new list when a key is not present, thus avoiding key errors and making the code cleaner and more efficient.

In terms of time complexity, the defaultdict functions like a standard dictionary. The time complexity for retrieving and setting items in a defaultdict is O(1) on average. However, when a key does not exist, it calls the default factory function, which takes constant time, so the overall time complexity remains O(1). This ensures that operations on defaultdict do not significantly affect the time complexity of algorithms that utilize it.

In sum,

It's often the case that we want to add something to the value of a dictionary.
Example: listing multiples

multiples_of_5 = list()
for i in range(100):
    if i % 5 == 0:
        multiples_of_5.append(i)

print(multiples_of_5)

multiples_of_2 = list()
for i in range(100):
    if i % 2 == 0:
        multiples_of_2.append(i)

print(multiples_of_2)

multiples_of_3 = list()
for i in range(100):
    if i % 3 == 0:
        multiples_of_3.append(i)

print(multiples_of_3)

multiples_of_4 = list()
for i in range(100):
    if i % 4 == 0:
        multiples_of_3.append(i)

print(multiples_of_3)

But now let's say we want multiples of 2, 3, 4, 5, 6, 7, 8, 9 all in one place.
Well, we don't want to violate DRY and copy-paste the above code.
A dictionary would be ideal for this!

multiples_of = dict()

for base_number in range(2, 10):
    print("Finding the multiples of", base_number)
    
    for i in range(100):
        if i % base_number == 0:
            if base_number not in multiples_of:    # added
                multiples_of[base_number] = list() # added
                
            multiples_of[base_number].append(i)

print(multiples_of)

This works but we can make it better with the defaultdict!
A dictionary with a default value for cases when the key does not exist.
Use case here: the default is an empty list!

d = dict()
d["hello"]

d.get("hello", 5) # equivalent to d["hello"] but returns 5 if "hello" not in d

d[4] = 100000
d.get(4, 5)

from collections import defaultdict

dd = defaultdict(list)
dd["hello"]

dd["new key"].append(5)

dd

dd

The beauty here is that we can call append on a key that doesn't exist.
It defaults to a new empty list and then immediately appends.
So... we can improve our original code, if we change multiples_of to a defaultdict:

multiples_of = defaultdict(list)

for base_number in range(2, 10):
    print("Finding the multiples of", base_number)
    
    for i in range(100):
        if i % base_number == 0:
            multiples_of[base_number].append(i)

print(multiples_of)

Type of defaultdict¶

Just for fun, we can look at the implementation of defaultdict in PyPy, which is an implementation of Python in Python (the OG Python is written in C): https://github.com/reingart/pypy/blob/master/lib_pypy/_collections.py#L387
We can see here that defaultdict inherits from dict. Indeed:

type(d)

type(dd)

type(d) == type(dd)

isinstance(d, dict)

isinstance(dd, dict)

type(dd) == dict

So in general people prefer the use of isinstance(obj, class) over type(obj) == class.

list vs. list()¶

Question: why was it defaultdict(list) instead of defaultdict(list())?
What happens when I run this code:

my_list = list()
my_list

bad = defaultdict([])

bad = defaultdict(list())

But why?
The code that executes first is list().
Now we have a particular list being passed in.
But we don't want one particular list, we want lots of new lists all the time.
So you need to pass in the ability to create lists
- in other words, a function that creates lists.
That is list

list

list()

x = defaultdict(lambda : "Hello my name is Davi")

x[5]

So, in that case, what is my default value?

text = "Blah blah is he still talking about dictionaries??"

(This can be done with a Python function)

text.count('a')

number_of_a = 0
for t in text:
    if t == "a":
        number_of_a += 1
        
number_of_a

Ok, but now we want to count "a" and "b".
Same as before, let's use a dict.
We already know this won't work - it's the same problem as before:

number_of_times = dict()

for char in ('a', 'b'):
    for t in text:
        if t == char:
            number_of_times[char] = number_of_times[char] + 1
        
number_of_times

So, we use a defaultdict.
What do we want the default value to be?

Answer: 0

defaultdict(0)

Same problem as before! So:

defaultdict(lambda: 0)

defaultdict(int)

int()

Back to the code:

number_of_times = defaultdict(int)

for char in ('a', 'b', 'c'):
    for t in text:
        if t == char:
            number_of_times[char] += 1
        
number_of_times

Counter¶

And finally, for the supremely ~~lazy~~ awesome, we can use Counter:

from collections import Counter

This is basically a defaultdict(int) but with some fancy methods added:

number_of_times = Counter()

for char in ('a', 'b'):
    for t in text:
        if t == char:
            number_of_times[char] += 1
        
number_of_times

number_of_times.most_common(1)

!jupyter nbconvert _02-py-searching-sorting-hash.ipynb --to html --template classic --output 02-py-searching-sorting-hash.html

[NbConvertApp] Converting notebook _02-py-searching-sorting-hash.ipynb to html
[NbConvertApp] Writing 395275 bytes to 02-py-searching-sorting-hash.html