Introduction
This topic material is based on Professor Mike Gelbart Algorithms and Data Structures course. It was adapted for our purposes.
import numpy as np
import pandas as pd
from collections import defaultdict
import matplotlib.pyplot as plt
import altair as alt
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
Outline:
- Linear search and binary search intro
- Code timings
- Sorting
- Hash tables, hash functions
- Lookup tables, Python
dict
- Python's
defaultdict
Learning objectives¶
- Describe some basic sorting algorithms and their time complexities.
- Describe the binary search algorithm.
- Explain why searching in a sorted list is faster than in an unsorted list.
- Explain the pros and cons of hash tables, and how they work at a high level.
- Apply python's
dict
anddefaultdict
data structures.
Linear search and binary search intro¶
We return to the problem of checking whether an element is present in a collection.
Linear search¶
def search_unsorted(data, key):
"""
Searches the key in data using linear search
and returns True if found and False otherwise.
Parameters
----------
data : list
the elements to search within
key : int
the key to search for
Returns
-------
bool
boolean if key is contained in the data
Examples
--------
>>> search_unsorted([1, 7, 67, 35, 45], 3)
False
>>> search_unsorted([1, 7, 67, 35, 45], 7)
True
"""
for element in data:
if element == key:
return True
return False
Let's see some tests below! These assertions are checking that the function search_unsorted
behaves as expected under various typical and boundary scenarios. If any assertion fails, Python will raise an AssertionError
, indicating a bug or unexpected behavior in the function's implementation.
# Some tests
# key is the first element in the list
assert search_unsorted([4, 7, 9, -12, 1000], 4)
# key is the last element in the list
assert search_unsorted([4, 7, 9, -12, 1000], 1000)
# key occurs multiple times in the list
assert search_unsorted([4, 7, 9, -12, 4, 1000], 4)
# key is larger than the largest element in the list
assert not search_unsorted([4, 7, 9, -12, 1000], 2000)
# key is smaller than the smallest element in the list
assert not search_unsorted([4, 7, 9, -12, 1000], -18)
# nothing is in an empty list
assert not search_unsorted([], 1)
Question: What is the time complexity of the search_unsorted
, as a function of the length of the list, $n$?
Answer: <!---The time complexity of the search_unsorted
function is $O(n)$ because in the worst case the function loops over $n$ elements.--->
Binary search¶
- If the list is already sorted, we can search much faster with binary search.
- See the "binary search video" that was recommended as pre-class viewing.
- We start in the middle and can just restrict ourselves to searching half the list after a comparison.
- Note: the input list must be sorted for the code to work.
def search_sorted(data, key):
"""
Searches the key in data using binary search
and returns True if found and False otherwise.
Parameters
----------
data : list
a list of sorted elements to search within
key : int
the key to search for
Returns
-------
bool :
boolean if key is contained in the data
Examples
--------
>>> search_sorted([1, 7, 35, 45, 67], 3)
False
>>> search_sorted([1, 7, 35, 45, 67], 7)
True
"""
while len(data) > 0:
mid = len(data)//2
if data[mid] == key:
return True
if key < data[mid]:
data = data[:mid]
else:
data = data[mid+1:]
return False
data = [-12, 4, 7, 9, 45, 45, 987, 1000, 2000]
# Test cases for binary search
# key is the first element in the list
assert search_sorted(data, -12) == True
# key is the last element in the list
assert search_sorted(data, 2000) == True
# key occurs multiple times in the list
assert search_sorted(data, 45) == True
# key is larger than the largest element in the list
assert search_sorted(data, 3000) == False
# key is smaller than the smallest element in the list
assert search_sorted(data, -18) == False
# nothing is in an empty list
assert search_sorted([], 1) == False
Question: What is the time complexity of the search_sorted
, as a function of the length of the list, $n$?
Answer: <!---The time complexity of the search_sorted
function is $O(\log n)$ because in the worst case, the function loops over $\log n$ elements, as the search space reduces by half in each iteration of the loop. --->
The binary search approach involves repeatedly dividing the search interval in half.
Algorithm Description
- Initialize the search range: The search range initially encompasses the entire list.
- Find the middle element: Calculate the index of the middle element of the current search range.
- Comparison:
- If the middle element is equal to the key, the search is successful and returns
True
. - If the key is smaller than the middle element, adjust the search range to the left half (excluding the middle element).
- If the key is larger than the middle element, adjust the search range to the right half (excluding the middle element).
- If the middle element is equal to the key, the search is successful and returns
- Repeat the process with the new search range.
- Termination: If the search range becomes empty (no elements to search), return
False
.
Time Complexity Analysis
Worst-case and average-case time complexity:
At each step of the binary search, the size of the search range (the list) is halved. This halving continues until either the key is found or the list cannot be divided further (i.e., it becomes empty).
The number of times a list of size $ n $ can be halved until it reduces to a size of 1 is $\log_2 n$. This logarithmic reduction implies that the time complexity of binary search in both the average and worst cases is $ O(\log n) $, where $ n $ is the number of elements in the list.
Best-case time complexity:
The best case occurs if the key is exactly at the middle of the list in the first check, leading to a single comparison.
Thus, the best-case time complexity is $ O(1) $, representing a constant time operation.
Therefore, the theoretical time complexity of the search_sorted
function remains $ O(\log n) $ for the average and worst cases due to the logarithmic reduction in problem size at each step of the algorithm.
Why the number of times a list of size $ n $ can be halved until it reduces to a size of 1 is $ \log_2 n $?
Understanding the Process of Halving:
When you halve a list (or any set of elements), each halving operation reduces the number of elements to half of its current size. For example:
If you start with $ n $ elements, after one halving, you have $ \frac{n}{2} $ elements.
After a second halving, you have $ \frac{n}{4} $ elements, and so forth.
The halving continues until you cannot halve further, which ideally ends at 1 element.
Mathematical Explanation:
To find how many times you can perform this operation, you are essentially asking, "How many times must you multiply 2 (the base of the halving) to reach $ n $?" Mathematically, this question translates into solving for $ k $ in the equation:
$$ 2^k = n $$
Where:
- $ 2 $ is the base, representing the halving.
- $ k $ is the number of times you multiply 2 together to reach $ n $.
- $ n $ is the initial size of your list.
Using Logarithms:
The logarithm is the inverse operation of exponentiation. To solve for $ k $, you take the logarithm base 2 of both sides of the equation $ 2^k = n $:
$$ k = \log_2(n) $$
This tells you that $ k $ (the number of halvings) is equal to $ \log_2(n) $.
Intuitive Example:
If $ n = 8 $:
- After 1 halving: 4 elements ($ \frac{8}{2} $)
- After 2 halvings: 2 elements ($ \frac{4}{2} $)
- After 3 halvings: 1 element ($ \frac{2}{2} $)
So, $ 2^3 = 8 $ leads to $ \log_2(8) = 3 $. This tells us it takes 3 halvings to reduce 8 elements to 1, aligning with $ \log_2(n) = 3 $ for $ n = 8 $.
Thus, the operation of binary search, which depends on this process of halving, has a logarithmic time complexity ($ O(\log n) $), as the list size $ n $ can be divided $ \log_2(n) $ times until only one or no elements are left.
Question: What happens if you call search_unsorted
on sorted data? What happens if you call search_sorted
on unsorted data?
Answer:
search_sorted([3, 2, 1], 1)
Question: Why doesn't the search_sorted
function start by verifying that the list is indeed sorted?
Answer: <!---because this would take $O(n)$ time, defeating the purpose of the $O(\log\, n)$ lookup.--->
Code timing¶
Below we empirically measure the running times of 4 approaches:
- Using
in
with a Pythonset
(same as last class) search_sorted
, that is, our implementation of binary search above- Using
in
with a Pythonlist
(same as last class) search_unsorted
, that is, our implementation of linear search above
list_sizes = [100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]
results = defaultdict(list)
results["size"] = list_sizes
key = -1
for list_size in list_sizes:
print('List size: ', list_size)
x = np.random.randint(1e8, size=list_size)
time = %timeit -q -o -r 1 search_unsorted(x, key)
results["Unsorted list linear"].append(time.average)
# Note: -q prevents it from printing to the terminal
# -o sends the result to a variable (average time in seconds)
# -r 3 makes it average only 3 trials instead of the default of 7, which saves time
time = %timeit -q -o -r 1 (key in x)
results["Unsorted list in"].append(time.average)
x.sort()
time = %timeit -q -o -r 1 search_sorted(x, key)
results["Sorted list binary"].append(time.average)
x_set = set(x)
time = %timeit -q -o -r 1 (key in x_set)
results["Python set in"].append(time.average)
Understanding the code above:
Question: Why do I search for $-1$ in the code below? Why not $1$?
Answer: <!---we search for -1 because we know it will not be in the array. This gives us a worst case timing, because searching is the slowest if it has to keep looking; it can be faster if it finds something right away. For example, if it's the 1st element, then linear search would seem extremely fast.--->
df = pd.DataFrame(results, columns=list(results.keys()))
df
Are these consistent with the time complexities we expected?
In general, we would expect to see the following trends in the data:
Unsorted List Linear Search:
- Time should increase linearly with the size of the list.
- This is because a linear search goes through each element one by one, so the time complexity is $O(n)$.
Unsorted List with 'in' Operator:
- Time should also increase linearly, similar to the linear search.
- The 'in' operator effectively performs a linear search when used on a list, so it also has a time complexity of $O(n)$.
Sorted List Binary Search:
- Time should increase logarithmically with the size of the list.
- Binary search has a time complexity of $O(\log n)$, so even as the list size grows large, the time should not increase significantly.
Python Set with 'in' Operator:
- Time should remain approximately constant regardless of the size of the set.
- The 'in' operator on a set is a hash table lookup with a time complexity of $O(1)$ in the average case.
df_long = pd.melt(df, id_vars="size", var_name="method", value_name="time (s)")
# Create a line chart using Altair
chart = alt.Chart(df_long).mark_line().encode(
alt.X('size', scale=alt.Scale(type='log')), # Setting the X-axis to a logarithmic scale
alt.Y('time (s)', scale=alt.Scale(type='log')), # Setting the Y-axis to a logarithmic scale
color='method' # Coloring the lines by the 'method' column
).configure_axis(grid=False) # Disabling the grid lines
# Display the chart
chart
Note that the binary_search
we wrote is actually slower than the linear search using in
when the list is smaller than $10,000$ elements. Remember, big-O is just an "asymptotic" trend. There could be:
- Large/small constants, like $1000\log(n)$ vs. $n$.
- "Lower order terms", like $\log(n)+100 \log \log(n) + 100$ vs. $n$.
We are probably seeing the former; my code performs fewer steps (better complexity), but each step is much slower because the implementation is not optimized. (Often, though, we don't care too much about how code performs for very small inputs.)
The two fastest methods (the orange and blue curves) look quite similar, possible because of the log scale on the y-axis. Let's plot time vs. $\log n$ to try and tell the difference.
df_long = pd.melt(df[["size", "Sorted list binary", "Python set in"]],
id_vars="size", # 'size' is the identifier column
var_name="method", # The name of the new column that will contain the method names
value_name="time (s)") # The name of the new column that will contain the time values
# Creating a line chart with Altair using the melted DataFrame. Each method will have a unique line.
chart = alt.Chart(df_long).mark_line().encode(
alt.X('size', scale=alt.Scale(type='log')), # X-axis represents 'size' on a logarithmic scale
alt.Y('time (s)'), # Y-axis represents the time in seconds
color='method' # The lines are colored based on the 'method' to differentiate the search methods
).configure_axis(grid=False) # Disabling grid lines for a cleaner look
# Display the chart
chart
We can see that the set really is constant, but the binary search is logarithmic - in the plot, time is linear in $\log(n)$.
Sorting¶
- Sorting is a very popular topic in Algorithms and Data Structures courses.
- We'll start with "selection sort".
def selection_sort(x):
"""Sorts x inplace... slowly.
Parameters
----------
x : list
the list needed to be sorted
Returns
-------
list :
the sorted list
Examples
--------
>>> selection_sort([7, 1, 67, 35, 45])
[1, 7, 35, 45, 67]
>>> selection_sort([357, 6, 55, 12, 112])
[6, 12, 55, 112, 357]
"""
n = len(x)
for i in range(n):
# Get the index of the smallest value from location i onward
min_ind = np.argmin(x[i:]) + i
# Swap this with element i
x[i], x[min_ind] = x[min_ind], x[i] # swap
return x
- How does the above code work?
- For the remaining part of the array
x[i:]
, find the smallest element. - Swap this with the current element.
- Repeat until reaching the end of the array.
- For the remaining part of the array
- We will revisit this code next class
Here's a step-by-step explanation:
Function Definition:
- The function
selection_sort
takes a single argumentx
, which is a list of elements that need to be sorted.
- The function
Length of List:
- The variable
n
is assigned the length of the listx
. This will be used to control the iterations of the sorting process.
- The variable
Sorting Loop:
- A
for
loop is initiated with the range from 0 ton-1
(sincerange
in Python is non-inclusive of the end value). The loop variablei
represents the current position in the list that the algorithm is attempting to fill with the correct element.
- A
Finding the Minimum Element:
- Within the loop, the function uses
np.argmin(x[i:])
to find the index of the smallest element in the sublistx[i:]
, which is the portion of the list from indexi
to the end. It then addsi
to this index becausenp.argmin
returns the index within the sliced listx[i:]
, not the full listx
.
- Within the loop, the function uses
Swapping Elements:
- Once the index of the minimum element (
min_ind
) is found, a swap operation is performed to place this minimum element into its correct positioni
in the list. The swapping is done using the tuple unpacking feature in Python, which allows for the values ofx[i]
andx[min_ind]
to be exchanged in a single line.
- Once the index of the minimum element (
Returning the Sorted List:
- After the loop has finished executing, all elements have been moved to their correct positions, and the list
x
is sorted. The sorted listx
is returned.
- After the loop has finished executing, all elements have been moved to their correct positions, and the list
Function Annotations:
- The function includes a docstring with a brief description, parameter annotations, return value description, and examples of usage.
In-place Sorting:
- It's important to note that selection sort is performed in-place, meaning the list
x
is sorted without requiring additional memory for another list.
- It's important to note that selection sort is performed in-place, meaning the list
Question: What is the time complexity of this method?
Answer: <!--- $O(n^2)$. argmin
itself takes $O(n)$, and this is called $n$ times, for a total of $O(n^2)$. The actual number of steps is more like $n^2/2$. It requires $n-1$ iterations to place the smallest element, then $n-2$ to place the next smallest, and so forth, for each element in the list.--->
Detailed analysis:
- The swapping takes constant time.
- The call to
np.argmin
needs to look through the array.- This takes time proportional to the length of the array.
- The first time, the length is $n$. Then $n-1$, then $n-2$, etc.
- The number of steps in the above code is
$n+(n-1)+(n-2)+\ldots+3+2+1$
This is an arithmetic series; the sum is $\frac{n(n+1)}{2}=\frac12 n^2 + \frac{n}{2}$
- We ignore the $\frac{n}{2}$ because it is very small compared to $\frac12 n^2$ when $n$ is large.
- E.g. for $n=1000$ we have $\frac{1}{2}n^2= 500000$ vs. $\frac{n}{2}=500$.
- We ignore the $\frac12$ because we're only interested in the growth, not the scaling factor.
- Result: we say the above code is $O(n^2)$.
Put another way, for our purposes this is the same as code that performs $n$ steps inside the loop. The fact that it actually decreases each time is not important enough to show up in the Big O.
Question: could we find a sorting algorithm that takes $\log(n)$ time?
Answer: <!---No way, because it takes $n$ steps to even inspect every element of the input!--->
Hash tables, hash functions¶
A hash, in the context of computing and data structures, refers to a fixed-size integer that is calculated from input data of arbitrary size using a function called a hash function. The purpose of a hash is to uniquely represent data, facilitate fast data retrieval, and manage data storage in a hash table.
- Python's
set
type supports the following operations in $O(1)$ time:- inserting a new element
- deleting an element
- checking if an element is present
Hash functions¶
Python objects have a hash:
hash("mds")
hash("")
It looks like the hash function returns an integer.
hash(5.5)
hash(5)
hash(-9999)
It looks like the hash function of a Python integer is itself. Or at least small enough integers.
hash(999999999999999999999999)
Sometimes it fails?
hash([1, 2, 3])
hash((1, 2, 3))
hash(None)
If a Python set
is a hash table, that means items in it must be hashable (dict
has the same requirement, for keys):
s = set()
s.add(5.5)
s.add("mds")
s
s.add([1, 2, 3])
s.add((1, 2, 3))
s
- Typically, mutable objects are not hashable.
Hash tables¶
- So, it looks like the hash function maps from an object to an integer.
- And that Python
set
s use these hash functions. - How do they work?
- The hash table is basically a list of lists, and the hash function maps an object to its location in the outer list.
- But it's a bit more complicated than that.
- The list typically expands and contracts automatically as needed.
- These operations may be slow, but averaged or "amortized" over many operations, the runtime is $O(1)$
- The hash function depends on this array size.
- There's also an issue of collisions: when two different objects hash to the same place.
- Roughly speaking, we can insert, retrieve, and delete things in $O(1)$ time so long as we have a "good" hash function.
- The hash function will be "good" for default Python objects, and if you end up needing to implement your own one day you should read a bit more about it.
A simple hash table implementation¶
Below is a (very low-quality) hash table implementation, with only 4 buckets by default:
class HashTable:
def __init__(self, num_buckets=4):
self.stuff = list() # A list of lists
self.n = num_buckets
for i in range(num_buckets):
self.stuff.append([]) # Create the inner lists, one per bucket
def add(self, item):
if not self.contains(item):
self.stuff[hash(item) % self.n].append(item)
def contains(self, item):
return item in self.stuff[hash(item) % self.n]
def __str__(self):
return str(self.stuff)
Here is an explanation of each part of the HashTable
class:
__init__
Method:
- This is the constructor for the
HashTable
class. num_buckets
is an optional parameter that defaults to 4, setting the number of buckets in the hash table.self.stuff
is initialized as an empty list, which will eventually contain other lists representing the buckets of the hash table.- The
for
loop createsnum_buckets
empty lists withinself.stuff
, each representing a bucket where items will be stored.
add
Method:
- This method adds an
item
to the hash table. - First, it checks whether the item is already contained in the table using the
contains
method. If not, it proceeds to add the item. - To determine which bucket to place the item in, the method uses the built-in
hash
function to generate a hash code for theitem
and then takes the modulo of that hash code with the number of buckets (self.n
). This ensures the hash code maps to one of the existing buckets. - The item is then appended to the appropriate bucket list.
contains
Method:
- This method checks if an
item
is present in the hash table. - Similar to the
add
method, it uses the built-inhash
function and modulo operation to find the bucket that theitem
would be placed in. - It then checks if the
item
is present in that bucket (which is a list) using Python'sin
operator.
__str__
Method:
- This method defines the string representation of the
HashTable
object. - When you print an instance of the
HashTable
or use thestr()
function on it, this method returns a string representation ofself.stuff
, which is the list of buckets with their contained items.
This hash table implementation is a rudimentary form of what is used in more complex data structures. It doesn't handle collisions in a sophisticated way (it simply appends items to the list in the bucket, which could lead to all items in the same bucket in the case of hash collisions). It also does not provide a method for removing items or resizing the table when it gets too full, which are common features in robust hash table implementations.
(Note: The hash
function has a random seed that is set at the start of every Python session, so your actual results my vary from mine.)
Let's see how it works by adding and checking for the presence of strings.
Short explanation of each operation:
ht.add("Hello Gabriel!")
:- This line of code attempts to add the string
"Hello Gabriel!"
to the hash tableht
. - It calls the
add
method on theHashTable
instance, which calculates the hash of"Hello Gabriel!"
, then finds the correct bucket by taking the modulo of the hash with the number of buckets in the hash table (in this case, 4 as per the class default). - It then appends
"Hello Gabriel!"
to the list (bucket) at the index determined by the modulo operation, assuming"Hello Gabriel!"
is not already in that bucket.
- This line of code attempts to add the string
hash("Hello Gabriel!") % 4
:- This is a standalone operation (not a method call) that calculates the hash of the string
"Hello Gabriel!"
using Python’s built-inhash
function. - The modulo operation (
% 4
) is then applied to the resulting hash value to determine which bucket the string would be placed into if added to the hash table. The number 4 corresponds to the default number of buckets in theHashTable
. - This line by itself does not affect the
ht
instance; it is likely there for illustrative purposes to show how the bucket index for"Hello Gabriel!"
is determined.
- This is a standalone operation (not a method call) that calculates the hash of the string
`ht.contains("blah"):
- The
contains
method of theHashTable
instanceht
is called with the string"blah"
as an argument to check if the string is present in the hash table.
- The
hash("blah") % 4
:- The hash value of
"blah"
is computed and then modulo 4 is taken to find the bucket index where"blah"
would be located if it were in the hash table. - The
contains
method checks the bucket at this index. Since"blah"
was not added to the hash table, it is not present in any bucket, and thereforeht.contains("blah")
evaluates toFalse
.
- The hash value of
ht = HashTable()
print(ht)
- So far, all 4 buckets are empty.
- Now let's add something:
ht.add("Hello Gabriel!")
print(ht)
"Hello Gabriel!" went into this bucket because
hash("Hello Gabriel!")
hash("Hello Gabriel!") % 4
Now let's add more things:
ht.add("Welcome")
print(ht)
hash("Palmeiras!") % 4
ht.add("Palmeiras!")
print(ht)
ht.add("Goal!")
print(ht)
ht.add("Champion!")
print(ht)
If we want to look for something:
ht.contains("blah")
False because
And "blah" is not found in bucket.
hash("blah") % 4
ht.contains("Palmeiras!")
- Same thing here.
- The key idea is that you only need to look in one bucket - either it's in that bucket, or it's not in the entire hash table.
print(ht)
- Above we have some collisions: that is, 2 items in the same bucket.
- If the main list is able to dynamically grow as the number of items grows, we can keep the number of collisions low.
- This preserves the $O(1)$ operations.
Lookup tables, Python dict
¶
- Python's
dict
type is a dictionary - A dictionary should support the following operations:
- inserting a new element
- deleting an element
- finding an element
- It is much like a
set
except the entries, called "keys", now have some data payload associated with them, which we call "values". - It is also implemented as a hash table, meaning you can expect $O(1)$ operations.
- Only the keys are hashed, so only the keys have to be hashable.
- A list can be a value, but not a key.
d = dict()
d[5] = "a"
d["b"] = 9
d
5 in d
9 in d # it only searches the keys
d[5]
d[6]
Hashable types:
d[[1,2,3]] = 10
d[10] = [1,2,3] # OK
A reminder of some dictionary syntax:
f = {i: i*2 for i in range(10)}
f
for key, val in f.items():
print("key =", key, " val =", val)
Nested dictionaries:
g = dict()
g[5] = f
g
Python's defaultdict
¶
The defaultdict
is a subclass of the dictionary class that returns a default value for the missing keys. Its importance in time complexity discussions can be highlighted by how it simplifies code that adds to collections of items within dictionaries. Without defaultdict
, one would have to manually check whether a key exists in the dictionary and possibly initialize a new list for that key before appending to it. defaultdict
automatically initializes a new list when a key is not present, thus avoiding key errors and making the code cleaner and more efficient.
In terms of time complexity, the defaultdict
functions like a standard dictionary. The time complexity for retrieving and setting items in a defaultdict
is O(1) on average. However, when a key does not exist, it calls the default factory function, which takes constant time, so the overall time complexity remains O(1). This ensures that operations on defaultdict
do not significantly affect the time complexity of algorithms that utilize it.
In sum,
- It's often the case that we want to add something to the value of a dictionary.
- Example: listing multiples
multiples_of_5 = list()
for i in range(100):
if i % 5 == 0:
multiples_of_5.append(i)
print(multiples_of_5)
multiples_of_2 = list()
for i in range(100):
if i % 2 == 0:
multiples_of_2.append(i)
print(multiples_of_2)
multiples_of_3 = list()
for i in range(100):
if i % 3 == 0:
multiples_of_3.append(i)
print(multiples_of_3)
multiples_of_4 = list()
for i in range(100):
if i % 4 == 0:
multiples_of_3.append(i)
print(multiples_of_3)
- But now let's say we want multiples of 2, 3, 4, 5, 6, 7, 8, 9 all in one place.
- Well, we don't want to violate DRY and copy-paste the above code.
- A dictionary would be ideal for this!
multiples_of = dict()
for base_number in range(2, 10):
print("Finding the multiples of", base_number)
for i in range(100):
if i % base_number == 0:
if base_number not in multiples_of: # added
multiples_of[base_number] = list() # added
multiples_of[base_number].append(i)
print(multiples_of)
- This works but we can make it better with the
defaultdict
! - A dictionary with a default value for cases when the key does not exist.
- Use case here: the default is an empty list!
d = dict()
d["hello"]
d.get("hello", 5) # equivalent to d["hello"] but returns 5 if "hello" not in d
d[4] = 100000
d.get(4, 5)
from collections import defaultdict
dd = defaultdict(list)
dd["hello"]
dd["new key"].append(5)
dd
dd
- The beauty here is that we can call
append
on a key that doesn't exist. - It defaults to a new empty list and then immediately appends.
- So... we can improve our original code, if we change
multiples_of
to adefaultdict
:
multiples_of = defaultdict(list)
for base_number in range(2, 10):
print("Finding the multiples of", base_number)
for i in range(100):
if i % base_number == 0:
multiples_of[base_number].append(i)
print(multiples_of)
Type of defaultdict¶
- Just for fun, we can look at the implementation of
defaultdict
in PyPy, which is an implementation of Python in Python (the OG Python is written in C): https://github.com/reingart/pypy/blob/master/lib_pypy/_collections.py#L387 - We can see here that
defaultdict
inherits fromdict
. Indeed:
type(d)
type(dd)
type(d) == type(dd)
isinstance(d, dict)
isinstance(dd, dict)
type(dd) == dict
So in general people prefer the use of isinstance(obj, class)
over type(obj) == class
.
list vs. list()¶
- Question: why was it
defaultdict(list)
instead ofdefaultdict(list())
? - What happens when I run this code:
my_list = list()
my_list
bad = defaultdict([])
bad = defaultdict(list())
- But why?
- The code that executes first is
list()
. - Now we have a particular list being passed in.
- But we don't want one particular list, we want lots of new lists all the time.
- So you need to pass in the ability to create lists
- in other words, a function that creates lists.
- That is
list
list
list()
x = defaultdict(lambda : "Hello my name is Davi")
x[5]
- So, in that case, what is my default value?
text = "Blah blah is he still talking about dictionaries??"
(This can be done with a Python function)
text.count('a')
number_of_a = 0
for t in text:
if t == "a":
number_of_a += 1
number_of_a
- Ok, but now we want to count "a" and "b".
- Same as before, let's use a dict.
- We already know this won't work - it's the same problem as before:
number_of_times = dict()
for char in ('a', 'b'):
for t in text:
if t == char:
number_of_times[char] = number_of_times[char] + 1
number_of_times
- So, we use a defaultdict.
- What do we want the default value to be?
Answer: 0
defaultdict(0)
Same problem as before! So:
defaultdict(lambda: 0)
defaultdict(int)
int()
Back to the code:
number_of_times = defaultdict(int)
for char in ('a', 'b', 'c'):
for t in text:
if t == char:
number_of_times[char] += 1
number_of_times
Counter¶
And finally, for the supremely lazy awesome, we can use Counter
:
from collections import Counter
This is basically a defaultdict(int)
but with some fancy methods added:
number_of_times = Counter()
for char in ('a', 'b'):
for t in text:
if t == char:
number_of_times[char] += 1
number_of_times
number_of_times.most_common(1)
!jupyter nbconvert _02-py-searching-sorting-hash.ipynb --to html --template classic --output 02-py-searching-sorting-hash.html