Python - Graphs - Practice

QTM 350: Data Science Computing

Davi Moreira

Introduction

This topic material is based on Professor Mike Gelbart Algorithms and Data Structures course. It was adapted for our purposes.

import numpy as np
import pandas as pd
import altair as alt
import networkx as nx
import matplotlib.pyplot as plt
from collections import defaultdict

Exercise: The Romeo and Juliet graph¶

Consider the graph output of the social network connections in Romeo and Juliet from the code below.

Attribution: the Romeo and Juliet graph, and inspiration, taken from UW CSE 140.

The code below creates a graph named rj corresponding to the Romeo and Juliet graph above.

rj = nx.Graph()
rj.add_nodes_from(['Nurse',
                   # House of Capulet
                   'Juliet', 'Tybalt', 'Capulet',

                   'Friar Laurence',

                   # House Montague
                   'Romeo', 'Benvolio', 'Montague',

                   # Ruling house of Verona
                   'Escalus', 'Mercutio', 'Paris'
                   ])

rj.add_edges_from([('Juliet', 'Nurse'),
                   ('Juliet', 'Tybalt'),
                   ('Juliet', 'Capulet'),
                   ('Juliet', 'Friar Laurence'),
                   ('Juliet', 'Romeo'),

                   ('Capulet', 'Tybalt'),
                   ('Capulet', 'Escalus'),
                   ('Capulet', 'Paris'),

                   ('Romeo', 'Friar Laurence'),
                   ('Romeo', 'Benvolio'),
                   ('Romeo', 'Montague'),
                   ('Romeo', 'Mercutio'),

                   ('Montague', 'Benvolio'),
                   ('Montague', 'Escalus'),

                   ('Escalus', 'Mercutio'),
                   ('Escalus', 'Paris'),
                   ('Paris', 'Mercutio')
                   ])
nx.draw(rj, with_labels=True)

Graphs warmup¶

Write a function highest_degree that takes in a graph and finds the vertex/vertices with the highest degree - in other words, the person with the largest number of friends. Your function should return a tuple with two elements:

The maximum degree (int)
All the nodes with that degree (set)

As always, a proper docstring is required.

Note: you can find the degree of a vertex using the following syntax:

nx.degree(rj, "Paris")

and you can iterate through all nodes in a graph like this:

for node in rj.nodes():
    print(node)

Answer:

# Find the maximum degree and nodes with that degree
max_degree, nodes = highest_degree(rj)

# Assertions
assert nodes == {"Juliet", "Romeo"}, f"Expected nodes with highest degree: {'Juliet', 'Romeo'}, but got: {nodes}"
assert max_degree == 5, f"Expected max degree: 5, but got: {max_degree}"

# If no assertion error occurs, the function works as expected.

Largest distance¶

One interesting measure in a social network graph is the "distance" or number of "degrees of separation" between two people. This notion is used in academic research via the Erdős number and in the film industry via the Bacon number. For example, in the above graph, the distance between Juliet and Romeo is 1, and the distance between Juliet and Paris is 2 (via Capulet).

nx.shortest_path_length(rj, "Juliet", "Romeo")

nx.shortest_path_length(rj, "Juliet", "Paris")

Write a function largest_distance to find the pair(s) of vertices with the largest distance (degree of separation). Your function should return a tuple with two elements:

The maximum distance (int)
All the pairs of nodes with that distance (set of tuples)

Note: do not include pairs twice. For example, if ('Romeo', 'Juliet') is in the set, don't also include ('Juliet', 'Romeo').

Your function should work by iterating through all pairs of vertices. For each pair, it should compute the distance (degree of separation) between nodes using the function nx.shortest_path_length used above.

def largest_distance(G):
    """
    Finds the pairs of vertices with the highest degree of separation in the graph G.
    Returns a tuple (max_distance, {node pairs with that distance})

    Parameters
    ----------
    G : networkx.classes.graph.Graph
        a network graph

    Returns
    -------
    max_distance: int 
       the maximum connections separating nodes 
    node pairs with max degree distance: set 
        nodes with largest separation  

    Examples
    --------
    >>> test = nx.Graph()
    >>> test.add_nodes_from(['A', 'B', 'C', 'D'])
    >>> test.add_edges_from([('A', 'B'), ('B', 'C'),  ('C', 'D'), ('A', 'D')])
    >>> largest_distance(test)
    (2, {('A', 'C'), ('B', 'D')})
    """
    
    # YOUR CODE HERE

Answer:

max_distance_rj, pairs_rj = largest_distance(rj)
assert max_distance_rj == 3
assert len(pairs_rj) == 12

test = nx.Graph()
test.add_nodes_from(['A', 'B', 'C', 'D'])
test.add_edges_from([('A', 'B'), ('B', 'C'),  ('C', 'D'), ('A', 'D')])
max_distance_test, pairs_test = largest_distance(test)
assert max_distance_test == 2
assert pairs_test == {('A', 'C'), ('B', 'D')}

max_distance, node_pairs = largest_distance(rj)

# Output the result
print(f"The largest distance is: {max_distance}")
print("Pairs with the largest distance:", node_pairs)

¶

Assuming that nx.shortest_path_length takes $O(V)$ time in the worst case, what is the worst case time complexity if your largest_distance function from the previous part? Justify your answer.

Answer:

Exercise: computing degrees of separation with BFS¶

Let's see if we can come up with a better way to find the largest distance between two nodes in a graph; that is, an approach with a better time complexity than what we saw in the previous part.

Here is the reason why the previous approach is unnecessarily slow: in calling nx.shortest_path_length between pairs of nodes we do a bunch of redundant/repeated computation. For example, computing the distance between Nurse and Mercutio is fairly similar to finding the distance between Juliet and Mercutio, but we're doing it all from scratch for every pair of nodes.

Below we provide a function distance_BFS that takes in a networkx Graph and two nodes, and uses breadth-first search (BFS) to compute the distance between the two nodes; in other words, it does the same thing as nx.shortest_path_length. The function returns -1 if the two nodes are not connected (whereas nx.shortest_path_length throws an error). The code here is similar to the BFS code from lecture, except that instead of only storing the nodes in the queue, we store tuples of the (node, distance) so that we can keep track of the distance.

def distance_BFS(G, node1, node2):
    """ 
    Given a NetworkX Graph G, and start node node1 
    and goal node node2, distance_BFS returns the
    degree of separation between node1 and node2. 

    Parameters
    ----------
    G : networkx.classes.graph.Graph
        the graph
    node1 : str
        first node
    node2 : str 
        second node

    Returns
    -------
    int 
        the distance between 2 nodes, if
        they are not connected, returns -1

    Examples
    --------
    >>> test = nx.Graph()
    >>> test.add_nodes_from(['A', 'B', 'C', 'D'])
    >>> test.add_edges_from([('A', 'B'), ('B', 'C'),  ('C', 'D'), ('A', 'D')])
    >>> distance_BFS(test, 'A', 'C')
    2
    """

    queue = [(node1, 0)]
    visited = {node1}

    while queue:
        vertex, distance = queue.pop(0)
        if vertex == node2:
            return distance

        for neighbour in G.neighbors(vertex):
            if neighbour not in visited:
                queue.append((neighbour, distance + 1))
                visited.add(neighbour)
    return -1

Some tests:

assert(distance_BFS(rj, "Juliet", "Romeo")) == 1
assert(distance_BFS(rj, "Juliet", "Paris")) == 2
assert(distance_BFS(rj, "Nurse", "Paris")) == 3
assert(distance_BFS(rj, "Nurse", "Mercutio")) == 3

rj2 = rj.copy()
rj2.add_node("Santa Claus")

assert(distance_BFS(rj2, "Romeo", "Santa Claus")) == -1

Your task is to adapt/modify the above code to create a function that takes in a node and returns the furthest (largest distance) node. The changes to the above code are not that major - you shouldn't be rewriting it from scratch. I suggest you start by pasting in the body of distance_BFS and modifying it from there.

Your code should run in $O(V)$ time.

def furthest_from_node(G, node):
    """ 
    Find the furthest node from the input node in a Graph G.

    Parameters
    ----------
    G : networkx.classes.graph.Graph
        the graph
    node : str
        the node

    Returns
    -------
    tuple (int, node)
        the first element is the largest distance
        the second element is a set of the nodes that achieve this distance

    Examples
    --------
    >>> test = nx.Graph()
    >>> test.add_nodes_from(['A', 'B', 'C', 'D'])
    >>> test.add_edges_from([('A', 'B'), ('B', 'C'),  ('C', 'D'), ('A', 'D')])
    >>> furthest(test, 'A')
    2
    """

    # YOUR CODE HERE

Answer:

furthest_from_node(test, 'A')

assert furthest_from_node(test, 'A')[0] == 2
assert furthest_from_node(test, 'A')[1] == {'C'}

furthest_from_node(rj, 'Nurse')

assert furthest_from_node(rj, 'Nurse')[0] == 3
assert furthest_from_node(rj, 'Nurse')[1] == {'Paris', 'Escalus', 'Mercutio', 'Benvolio', 'Montague'}

assert furthest_from_node(rj, 'Juliet')[0] == 2
assert furthest_from_node(rj, 'Nurse')[1] == {'Paris', 'Escalus', 'Mercutio', 'Benvolio', 'Montague'}

Exercise: code timing (optional - extra credit)¶

Assuming you have implemented furthest_from_node above, the function largest_distance_faster below will call your furthest_from_node function and find the pair of nodes with the largest distance. The code isn't very pretty, but it should work the same way as your largest_distance - but faster!

def largest_distance_faster(G):
    """
    Finds the pairs of vertices with the highest degree of separation in the graph G.
    Returns a tuple (max_distance, {node pairs with that distance})

    Parameters
    ----------
    G : networkx.classes.graph.Graph
        a network graph

    Returns
    -------
    max_distance: int 
       the maximum connections separating nodes 
    node pairs with max degree distance: set 
        nodes with largest separation  

    Examples
    --------
    >>> test = nx.Graph()
    >>> test.add_nodes_from(['A', 'B', 'C', 'D'])
    >>> test.add_edges_from([('A', 'B'), ('B', 'C'),  ('C', 'D'), ('A', 'D')])
    >>> largest_distance(test)
    (2, {('A', 'C'), ('B', 'D')})
    """

    overall_max_dist = 0
    distances = dict()
    for v in G.nodes():
        max_dist, max_dist_nodes = furthest_from_node(G, v)

        distances[v] = (max_dist, max_dist_nodes)
        overall_max_dist = max(overall_max_dist, max_dist)
    
    node_pairs = set()
    for v, output in distances.items():
        output_dist, output_nodes = output
        if output_dist == overall_max_dist:
            for w in output_nodes:
                if (w, v) not in node_pairs:
                    node_pairs.add((v, w))
    
    return overall_max_dist, node_pairs

assert largest_distance_faster(test)[0] == 2
assert largest_distance_faster(test)[1] == {('A', 'C'), ('B', 'D')}

max_distance, pairs = largest_distance(rj)
assert max_distance == 3
assert len(pairs) == 12

(optional - extra credit)¶

What is the time complexity of largest_distance_faster?

Answer:

(optional - extra credit)¶

Here's a networkx function that generates graphs:

ladder = nx.generators.classic.ladder_graph(3)
nx.draw(ladder)

You can change the size by changing its argument:

ladder = nx.generators.classic.ladder_graph(5)
nx.draw(ladder)

Using this function to generate graphs, perform a timing experiment to time largest_distance (which you wrote) and largest_distance_faster (which I wrote, but which calls furthest_from_node that you wrote). Are the results consistent with the big-O running times we expected? Is largest_distance_faster indeed faster than largest_distance?

Answer:

Exercise: assessing virality (optional - extra credit)¶

Everyone wants their video or app to "go viral". This can occur by something spreading through a social network. Here, we will model virality as follows:

pick some virality coefficient $\xi\in (0,1)$
select one person (node) at random to be initially "infected"
each currently infected person loses interest with probability $\alpha$ and becomes permanently un-infected. By default we'll use $\alpha=0.01$.
for each infected person, each neighbour in the graph becomes infected with probability $\xi$. Note: if multiple neighbours of an un-infected node are infected, repeat this step multiple times. For example, if Mercutio and Paris like Gangnam Style, then Escalus has two chances of being infected at the current time step. Mathematically, the probability of infection is $1-(1-\xi)^2 = 2\xi-\xi^2$, but you don't need to calculate this in your code because you can just repeatedly try to infect the person.
repeat steps 3-4 some number of times, by default $1000$.

Write a function implementing this model. Your function should return a list/array of the proportion of people infected at each iteration. Using a subgraph, explore some or all of the following questions:

Investigate how the number of infected people proceeds as a function of time: what is the general shape you observe? Is it consistent across runs of the simulation?
Report the maximum proportion of your population that was infected at any given time. Try this for a couple values of $\xi$ and investigate how the maximum proportion of infected people depends on $\xi$. Note: For a given value of $\xi$ you will need to run several simulations and average the results to reduce noise.
Do your results depend significantly on the connectivity of the graph? For example, if you randomly remove half of the edges in the Facebook graph (be careful not to delete any vertices in the process!), what is the effect on the virality?

Note: this model for virality is similar to rudimentary models of how diseases spread across populations. See here.

Answer:

!jupyter nbconvert _05-py-graphs-practice.ipynb --to html --template classic --output 05-py-graphs-practice.html

Python - Graphs - Practice

QTM 350: Data Science Computing

Davi Moreira

Introduction

Exercise: The Romeo and Juliet graph¶

Graphs warmup¶

Largest distance¶

¶

Exercise: computing degrees of separation with BFS¶

Exercise: code timing (optional - extra credit)¶

(optional - extra credit)¶

(optional - extra credit)¶

Exercise: assessing virality (optional - extra credit)¶

Have fun!