Introduction
This topic material is based on Professor Mike Gelbart Algorithms and Data Structures course. It was adapted for our purposes.
import numpy as np
import pandas as pd
import altair as alt
import networkx as nx
import matplotlib.pyplot as plt
from collections import defaultdict
Exercise: The Romeo and Juliet graph¶
Consider the graph output of the social network connections in Romeo and Juliet from the code below.
Attribution: the Romeo and Juliet graph, and inspiration, taken from UW CSE 140.
The code below creates a graph named rj
corresponding to the Romeo and Juliet graph above.
rj = nx.Graph()
rj.add_nodes_from(['Nurse',
# House of Capulet
'Juliet', 'Tybalt', 'Capulet',
'Friar Laurence',
# House Montague
'Romeo', 'Benvolio', 'Montague',
# Ruling house of Verona
'Escalus', 'Mercutio', 'Paris'
])
rj.add_edges_from([('Juliet', 'Nurse'),
('Juliet', 'Tybalt'),
('Juliet', 'Capulet'),
('Juliet', 'Friar Laurence'),
('Juliet', 'Romeo'),
('Capulet', 'Tybalt'),
('Capulet', 'Escalus'),
('Capulet', 'Paris'),
('Romeo', 'Friar Laurence'),
('Romeo', 'Benvolio'),
('Romeo', 'Montague'),
('Romeo', 'Mercutio'),
('Montague', 'Benvolio'),
('Montague', 'Escalus'),
('Escalus', 'Mercutio'),
('Escalus', 'Paris'),
('Paris', 'Mercutio')
])
nx.draw(rj, with_labels=True)
Graphs warmup¶
Write a function highest_degree
that takes in a graph and finds the vertex/vertices with the highest degree - in other words, the person with the largest number of friends. Your function should return a tuple with two elements:
- The maximum degree (int)
- All the nodes with that degree (set)
As always, a proper docstring is required.
Note: you can find the degree of a vertex using the following syntax:
nx.degree(rj, "Paris")
and you can iterate through all nodes in a graph like this:
for node in rj.nodes():
print(node)
Answer:
# Find the maximum degree and nodes with that degree
max_degree, nodes = highest_degree(rj)
# Assertions
assert nodes == {"Juliet", "Romeo"}, f"Expected nodes with highest degree: {'Juliet', 'Romeo'}, but got: {nodes}"
assert max_degree == 5, f"Expected max degree: 5, but got: {max_degree}"
# If no assertion error occurs, the function works as expected.
Largest distance¶
One interesting measure in a social network graph is the "distance" or number of "degrees of separation" between two people. This notion is used in academic research via the Erdős number and in the film industry via the Bacon number. For example, in the above graph, the distance between Juliet and Romeo is 1, and the distance between Juliet and Paris is 2 (via Capulet).
nx.shortest_path_length(rj, "Juliet", "Romeo")
nx.shortest_path_length(rj, "Juliet", "Paris")
Write a function largest_distance
to find the pair(s) of vertices with the largest distance (degree of separation). Your function should return a tuple with two elements:
- The maximum distance (int)
- All the pairs of nodes with that distance (set of tuples)
Note: do not include pairs twice. For example, if ('Romeo', 'Juliet')
is in the set, don't also include ('Juliet', 'Romeo')
.
Your function should work by iterating through all pairs of vertices. For each pair, it should compute the distance (degree of separation) between nodes using the function nx.shortest_path_length
used above.
def largest_distance(G):
"""
Finds the pairs of vertices with the highest degree of separation in the graph G.
Returns a tuple (max_distance, {node pairs with that distance})
Parameters
----------
G : networkx.classes.graph.Graph
a network graph
Returns
-------
max_distance: int
the maximum connections separating nodes
node pairs with max degree distance: set
nodes with largest separation
Examples
--------
>>> test = nx.Graph()
>>> test.add_nodes_from(['A', 'B', 'C', 'D'])
>>> test.add_edges_from([('A', 'B'), ('B', 'C'), ('C', 'D'), ('A', 'D')])
>>> largest_distance(test)
(2, {('A', 'C'), ('B', 'D')})
"""
# YOUR CODE HERE
Answer:
max_distance_rj, pairs_rj = largest_distance(rj)
assert max_distance_rj == 3
assert len(pairs_rj) == 12
test = nx.Graph()
test.add_nodes_from(['A', 'B', 'C', 'D'])
test.add_edges_from([('A', 'B'), ('B', 'C'), ('C', 'D'), ('A', 'D')])
max_distance_test, pairs_test = largest_distance(test)
assert max_distance_test == 2
assert pairs_test == {('A', 'C'), ('B', 'D')}
max_distance, node_pairs = largest_distance(rj)
# Output the result
print(f"The largest distance is: {max_distance}")
print("Pairs with the largest distance:", node_pairs)
¶
Assuming that nx.shortest_path_length
takes $O(V)$ time in the worst case, what is the worst case time complexity if your largest_distance
function from the previous part? Justify your answer.
Answer:
Exercise: computing degrees of separation with BFS¶
Let's see if we can come up with a better way to find the largest distance between two nodes in a graph; that is, an approach with a better time complexity than what we saw in the previous part.
Here is the reason why the previous approach is unnecessarily slow: in calling nx.shortest_path_length
between pairs of nodes we do a bunch of redundant/repeated computation. For example, computing the distance between Nurse and Mercutio is fairly similar to finding the distance between Juliet and Mercutio, but we're doing it all from scratch for every pair of nodes.
Below we provide a function distance_BFS
that takes in a networkx Graph
and two nodes, and uses breadth-first search (BFS) to compute the distance between the two nodes; in other words, it does the same thing as nx.shortest_path_length
. The function returns -1 if the two nodes are not connected (whereas nx.shortest_path_length
throws an error). The code here is similar to the BFS code from lecture, except that instead of only storing the nodes in the queue, we store tuples of the (node, distance) so that we can keep track of the distance.
def distance_BFS(G, node1, node2):
"""
Given a NetworkX Graph G, and start node node1
and goal node node2, distance_BFS returns the
degree of separation between node1 and node2.
Parameters
----------
G : networkx.classes.graph.Graph
the graph
node1 : str
first node
node2 : str
second node
Returns
-------
int
the distance between 2 nodes, if
they are not connected, returns -1
Examples
--------
>>> test = nx.Graph()
>>> test.add_nodes_from(['A', 'B', 'C', 'D'])
>>> test.add_edges_from([('A', 'B'), ('B', 'C'), ('C', 'D'), ('A', 'D')])
>>> distance_BFS(test, 'A', 'C')
2
"""
queue = [(node1, 0)]
visited = {node1}
while queue:
vertex, distance = queue.pop(0)
if vertex == node2:
return distance
for neighbour in G.neighbors(vertex):
if neighbour not in visited:
queue.append((neighbour, distance + 1))
visited.add(neighbour)
return -1
Some tests:
assert(distance_BFS(rj, "Juliet", "Romeo")) == 1
assert(distance_BFS(rj, "Juliet", "Paris")) == 2
assert(distance_BFS(rj, "Nurse", "Paris")) == 3
assert(distance_BFS(rj, "Nurse", "Mercutio")) == 3
rj2 = rj.copy()
rj2.add_node("Santa Claus")
assert(distance_BFS(rj2, "Romeo", "Santa Claus")) == -1
Your task is to adapt/modify the above code to create a function that takes in a node and returns the furthest (largest distance) node. The changes to the above code are not that major - you shouldn't be rewriting it from scratch. I suggest you start by pasting in the body of distance_BFS
and modifying it from there.
Your code should run in $O(V)$ time.
def furthest_from_node(G, node):
"""
Find the furthest node from the input node in a Graph G.
Parameters
----------
G : networkx.classes.graph.Graph
the graph
node : str
the node
Returns
-------
tuple (int, node)
the first element is the largest distance
the second element is a set of the nodes that achieve this distance
Examples
--------
>>> test = nx.Graph()
>>> test.add_nodes_from(['A', 'B', 'C', 'D'])
>>> test.add_edges_from([('A', 'B'), ('B', 'C'), ('C', 'D'), ('A', 'D')])
>>> furthest(test, 'A')
2
"""
# YOUR CODE HERE
Answer:
furthest_from_node(test, 'A')
assert furthest_from_node(test, 'A')[0] == 2
assert furthest_from_node(test, 'A')[1] == {'C'}
furthest_from_node(rj, 'Nurse')
assert furthest_from_node(rj, 'Nurse')[0] == 3
assert furthest_from_node(rj, 'Nurse')[1] == {'Paris', 'Escalus', 'Mercutio', 'Benvolio', 'Montague'}
assert furthest_from_node(rj, 'Juliet')[0] == 2
assert furthest_from_node(rj, 'Nurse')[1] == {'Paris', 'Escalus', 'Mercutio', 'Benvolio', 'Montague'}
Exercise: code timing (optional - extra credit)¶
Assuming you have implemented furthest_from_node
above, the function largest_distance_faster
below will call your furthest_from_node
function and find the pair of nodes with the largest distance. The code isn't very pretty, but it should work the same way as your largest_distance
- but faster!
def largest_distance_faster(G):
"""
Finds the pairs of vertices with the highest degree of separation in the graph G.
Returns a tuple (max_distance, {node pairs with that distance})
Parameters
----------
G : networkx.classes.graph.Graph
a network graph
Returns
-------
max_distance: int
the maximum connections separating nodes
node pairs with max degree distance: set
nodes with largest separation
Examples
--------
>>> test = nx.Graph()
>>> test.add_nodes_from(['A', 'B', 'C', 'D'])
>>> test.add_edges_from([('A', 'B'), ('B', 'C'), ('C', 'D'), ('A', 'D')])
>>> largest_distance(test)
(2, {('A', 'C'), ('B', 'D')})
"""
overall_max_dist = 0
distances = dict()
for v in G.nodes():
max_dist, max_dist_nodes = furthest_from_node(G, v)
distances[v] = (max_dist, max_dist_nodes)
overall_max_dist = max(overall_max_dist, max_dist)
node_pairs = set()
for v, output in distances.items():
output_dist, output_nodes = output
if output_dist == overall_max_dist:
for w in output_nodes:
if (w, v) not in node_pairs:
node_pairs.add((v, w))
return overall_max_dist, node_pairs
assert largest_distance_faster(test)[0] == 2
assert largest_distance_faster(test)[1] == {('A', 'C'), ('B', 'D')}
max_distance, pairs = largest_distance(rj)
assert max_distance == 3
assert len(pairs) == 12
(optional - extra credit)¶
What is the time complexity of largest_distance_faster
?
Answer:
(optional - extra credit)¶
Here's a networkx function that generates graphs:
ladder = nx.generators.classic.ladder_graph(3)
nx.draw(ladder)
You can change the size by changing its argument:
ladder = nx.generators.classic.ladder_graph(5)
nx.draw(ladder)
Using this function to generate graphs, perform a timing experiment to time largest_distance
(which you wrote) and largest_distance_faster
(which I wrote, but which calls furthest_from_node
that you wrote). Are the results consistent with the big-O running times we expected? Is largest_distance_faster
indeed faster than largest_distance
?
Answer:
Exercise: assessing virality (optional - extra credit)¶
Everyone wants their video or app to "go viral". This can occur by something spreading through a social network. Here, we will model virality as follows:
- pick some virality coefficient $\xi\in (0,1)$
- select one person (node) at random to be initially "infected"
- each currently infected person loses interest with probability $\alpha$ and becomes permanently un-infected. By default we'll use $\alpha=0.01$.
- for each infected person, each neighbour in the graph becomes infected with probability $\xi$. Note: if multiple neighbours of an un-infected node are infected, repeat this step multiple times. For example, if Mercutio and Paris like Gangnam Style, then Escalus has two chances of being infected at the current time step. Mathematically, the probability of infection is $1-(1-\xi)^2 = 2\xi-\xi^2$, but you don't need to calculate this in your code because you can just repeatedly try to infect the person.
- repeat steps 3-4 some number of times, by default $1000$.
Write a function implementing this model. Your function should return a list/array of the proportion of people infected at each iteration. Using a subgraph, explore some or all of the following questions:
- Investigate how the number of infected people proceeds as a function of time: what is the general shape you observe? Is it consistent across runs of the simulation?
- Report the maximum proportion of your population that was infected at any given time. Try this for a couple values of $\xi$ and investigate how the maximum proportion of infected people depends on $\xi$. Note: For a given value of $\xi$ you will need to run several simulations and average the results to reduce noise.
- Do your results depend significantly on the connectivity of the graph? For example, if you randomly remove half of the edges in the Facebook graph (be careful not to delete any vertices in the process!), what is the effect on the virality?
Note: this model for virality is similar to rudimentary models of how diseases spread across populations. See here.
Answer:
!jupyter nbconvert _05-py-graphs-practice.ipynb --to html --template classic --output 05-py-graphs-practice.html