Introduction to Social and Biological Networks Final Project
Overview¶
In this final project we will use network data from a survey of social networks in 75 villages in rural southern Karnataka, India. This dataset was originally collected to study diffusion of microfinance. The details of the original publication are: Abhijit Banerjee, Arun G. Chandrasekhar, Esther Duflo, Matthew O. Jackson. The Diffusion of Microfinance. Science, Vol. 341 no. 6144, 2013.
We will use the network data to investigate the spread of a fictious pathogen. We will assume that these social networks correspond to contact networks for the given pathogen. In other words, if any two individuals are connected in the network, then the pathogen may spread between them.
Question 1: Downloading and reading in village data¶
Your first task is to download the dataset from Harvard Dataverse. We'll work with the latest version of the dataset which is available here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/U3BIHX&version=9.4
Proceed to download the data to your computer. You will be asked a couple of questions. One of them is: "What is the intended use of the files that you are requesting to download?" You should select the following option: "Student: completing an exercise or problem set designed for this dataset".
The downloaded file is a compressed zip file, so you need to unzip it first and place it in the same directory where this notebook is located on your computer. The folder of interest to us is called Data
. Its subfolder 1. Network Data
contains adjacency matrices as CSV files for each village. The adjacency matrices themselves are in the folder Adjacency Matrices
. The main folder comes with a file called README.pdf
which you should consult as needed to get a better understanding of the data.
The code for reading in network adjacency matrices and for constructing the network objects is below. In this question, you need to do to two things: 1) add one comment line preceding each line of code (apart from the import statements) to clarify what each line does; and 2) run the code.
Note: Different operating systems have slightly different ways of specifying file paths. The code below runs as is if you're on Mac; if you're a Windows of Linux user, you may need to modify the line specifying the path.
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
import random
import statistics
# set out the plots from matplotlib display
%matplotlib notebook
# create path variable that is the path to where the adjacency matrix data are located
path = "./datav4.0/Data/1. Network Data/Adjacency Matrices/"
# create a list of villages indexed from 1 to 77 inclusive
villages = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, \
23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, \
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, \
59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77]
# initialize an empty dictionary
Gs = {}
# for loop that iterates over each village index to read it in and create a networkx graph object
for k in villages:
# concatenate element in village with path and file type to get that certain village
# ex: filename for the first village is:
# "./datav4.0/Data/1. Network Data/Adjacency Matrices/adj_allVillageRelationships_vilno_1.csv"
#noting we need to convert the int to a str to concatenate
filename = path + "adj_allVillageRelationships_vilno_" + str(k) + ".csv"
# load in the text with delimiter , (knows where to separate) and call it A
A = np.loadtxt(filename, delimiter=",")
# convert A to a networkx Graph object and call it G
G = nx.to_networkx_graph(A)
# add the graph to the Gs dictionary with key = village index and value = networkx graph object of that village
Gs[k] = G
Question 2: Extracting largest connected components¶
Most empirical networks contain a largest connected component (LCC) that contains most network nodes. Create dictionary LCCs
that consists of the largest connected components (graph objects) of the networks in the Gs
dictionary. (You may find nx.induced_subgraph
to be useful in this problem.) Make sure that the keys match across the two dictionaries. Create a dot plot where the y-axis shows the proportion of nodes in the LCC of each graph and the x-axis corresponds to the index of the village. (The village index runs from 1 to 77 but we're skipping over a couple of villages.) The ordering of the villages on the x-axis should be such that the y-axis values increase from left to right.
# ADD YOUR CODE HERE
#missing 13, 22
plt.figure(figsize=(10,5))
LCCs = {}
for i in Gs:
LCCs[i] = nx.induced_subgraph(Gs[i], sorted(nx.connected_components(Gs[i]), key=len, reverse=True)[0])
prop = []
for i in range(len(Gs)):
prop.append(len(list(LCCs.values())[i].nodes()) / len(list(Gs.values())[i].nodes()))
d = {}
d = dict(zip([str(i) for i in list(Gs.keys())],prop))
d1 = sorted(d, key=d.get, reverse = False)
d2 = sorted(d.values(), reverse = False)
d = dict(zip(d1, d2))
# plt.figure()
plt.scatter(d.keys(), d.values())
plt.xlabel("Index of Village")
plt.ylabel("Proportion of nodes in LCC")
plt.title("Proportion of nodes in LCC vs. Index of Village")
plt.rc('xtick', labelsize=3)
Question 3: Visualizing networks and their degree and distributions¶
Visualize the LCC (graph) of each network in a single figure using a layout consisting of 13x6 panels. Using plt.figure(figsize=(20,40))
should look reasonable.
In a separate figure, in a single panel, plot the degree distribution of each LCC. To make the plot more readable, you should use a line plot and not a histogram.
# # ADD YOUR CODE HERE
fig, axes = plt.subplots(nrows=13, ncols=6, figsize=(20,40))
ax = axes.flatten()
for i in range(len(LCCs)):
nx.draw(LCCs[list(LCCs.keys())[i]], ax=ax[i], node_size=5)
ax[i].set_axis_off()
plt.figure()
for i in LCCs:
# get rid of 0 degree since we are dealing with the LCC
freq = nx.degree_histogram(LCCs[i])
freq.pop(0)
plt.plot( [i for i in range(1, len(nx.degree_histogram(LCCs[i])))], freq)
plt.rc('xtick', labelsize=15)
plt.xlabel("Degree")
plt.ylabel("Degree frequency")
plt.title("Degree distribution of each LCC")
Text(0.5, 1.0, 'Degree distribution of each LCC')
Question 4: Simulating SIR process¶
Write code to simulate discrete-time SIR spreading process on the LCCs of four villages only: 1, 31, 61, and 77. Set the S to I transition probability p_si
to 0.2 and the I to R transition probability p_ir
to 0.1. Initially only one individual, selected uniformly at random among the LCC nodes, is infected; the rest of the LCC is susceptible. Use unit infectivity in the SIR simulation: each infected node, per time step, selects only one network neighbor uniformly at random among its neighbors and potentially transmits the pathogen to it. Any node, regardless of their degree, will therefore infect 0 or 1 nodes per time step.
Run the simulation 300 times for each of the four LCCs. In four separate panels, one per village, plot the proportion of individuals in each of the three states (S, I, R) against time for each simulation run. Explain your findings.
# util functions
def recover(i_nodes, r_nodes, p):
new_recoveries = []
for i in i_nodes:
if random.random() < p:
new_recoveries.append(i)
for i in new_recoveries:
i_nodes.remove(i)
r_nodes += new_recoveries
return new_recoveries
def spread(G, s_nodes, i_nodes, p):
new_infections = []
sampled = set()
for i in i_nodes:
if G.degree(i) > 0:
sampled.add(random.choice(list(G.neighbors(i))))
for i in sampled:
if i in s_nodes and random.random() < p:
i_nodes.append(i)
new_infections.append(i)
# remove all new infected nodes from susceptible nodes
for i in new_infections:
s_nodes.remove(i)
return new_infections
def simulate(G, p_si, p_ir, num_seeds, num_time_steps):
i_nodes = random.sample(list(G.nodes()), num_seeds)
s_nodes = [node for node in list(G.nodes()) if node not in i_nodes]
r_nodes = []
num_s_nodes = [len(s_nodes)]
num_i_nodes = [len(i_nodes)]
num_r_nodes = [len(r_nodes)]
# print(s_nodes)
while num_time_steps > 0:
recover(i_nodes, r_nodes, p_ir)
spread(G, s_nodes, i_nodes, p_si)
num_s_nodes.append(len(s_nodes))
num_i_nodes.append(len(i_nodes))
num_r_nodes.append(len(r_nodes))
# if len(i_nodes) == 0:
# break
num_time_steps-=1
return num_s_nodes, num_i_nodes, num_r_nodes
def make_plot(num_nodes, num_s_nodes, num_i_nodes, num_r_nodes, num_time_steps):
h1, = plt.plot(np.array(num_s_nodes) / num_nodes)
h2, = plt.plot(np.array(num_i_nodes) / num_nodes)
h3, = plt.plot(np.array(num_r_nodes) / num_nodes)
plt.xlabel("Time")
plt.ylabel("Fraction of S, I, and R nodes")
plt.legend([h1,h2,h3], ["S nodes","I nodes","R nodes"], loc="center left")
plt.xlim([0, num_time_steps])
p_si = 0.2 # S to I transition probability
p_ir = 0.1 # I to R transition probability
# # ADD YOUR CODE HERE
# figsize=(20,40)
fig, axes = plt.subplots(nrows=2, ncols=2, figsize = (10, 7))
ax = axes.flatten()
four_villages = [1, 31, 61, 77]
num_time_steps = 250
for j in range(300):
for i in range(len(four_villages)):
num_nodes = len(LCCs[four_villages[i]].nodes())
num_s_nodes, num_i_nodes, num_r_nodes = simulate(LCCs[four_villages[i]], p_si, p_ir, num_seeds = 1, num_time_steps = num_time_steps)
# make_plot(num_nodes, num_s_nodes, num_i_nodes, num_r_nodes, num_time_steps)
plt.sca(ax[i])
h1, = plt.plot(np.array(num_s_nodes) / num_nodes, color = "blue")
h2, = plt.plot(np.array(num_i_nodes) / num_nodes, color = "red")
h3, = plt.plot(np.array(num_r_nodes) / num_nodes, color = "green")
plt.xlabel("Time")
plt.ylabel("Fraction of S, I, and R nodes")
message = f"Village {four_villages[i]}"
plt.title(message)
plt.legend([h1,h2,h3], ["S nodes","I nodes","R nodes"], loc="center left")
plt.xlim([0, num_time_steps])
# plt.title("Village", str(four_villages[i]))
plt.suptitle("300 SIR Simulations for Villages 1, 31, 61, 77")
Text(0.5, 0.98, '300 SIR Simulations for Villages 1, 31, 61, 77')
From our charts above with p_si = 0.2 and p_ir = 0.1, in each village, we see that the fraction of infected rarely goes above 10%. It seems that the infected reachs a peak around 75-100 time steps then slowly decreases. The fraction of susceptible seems to stop around 65-70%. And the fraction of recovered slowly increases and I'm seeing it range from anywhere between 0 to 40%. These results are probably due to the fact that our p_si and p_ir are very low in the LCC with only unit infectivity. Because of the unit infectivity, each node has only one shot to spread it to others but because the probability is 0.2, I believe that is why we see such low fraction of infection. Moreover, because we only start with one infected node, that is another potential reason for these results. Overall, these four villages gave very similar results which suggests to me that the village structure / community structure is very similar in these 4 villages.
Question 5: Reproduction number¶
Reproduction number for any infectious disease can be succinctly summarized as the number of persons infected per person infecting. There are many variants of this concept. The first thing to note is that this number varies during the epidemic. The basic reproduction number $R_0$ is usually defined as the expected number of cases generated by one case in a fully susceptible population. Because the population is fully susceptible only at the very beginning of an epidemic, $R_0$ may be less informative about what happens later, especially for a network (vs. mass-action) model. The effective reproduction number $R_t$ is usually defined as the expected number of cases generated by one case at time $t$ in a partially susceptible population. If we set time $t=0$, then $R_t$ coincides with $R_0$.
Use simulation to compute (estimate) the effective reproduction number $R_t$ as a function of time for the SIR process for the LCC of village 1 and the LCC of village 31 using at least 1000 spreading process realizations for each. Make a plot with three panels: (1) plot $R_t$ for each realization over time and mean $R_t$ over time for village 1 (left panel); (2) same for village 31 (middle panel); and (3) plot the mean $R_t$ for the two villages (two curves) over time.
# ADD YOUR CODE HERE
villages = [1, 31]
# take the number of infected divded by the number of infected at that time
# new simulate function that returns Rt
def simulate_Rt(G, p_si, p_ir, num_seeds, num_time_steps):
i_nodes = random.sample(list(G.nodes()), num_seeds)
s_nodes = [node for node in list(G.nodes()) if node not in i_nodes]
r_nodes = []
num_s_nodes = [len(s_nodes)]
num_i_nodes = [len(i_nodes)]
num_r_nodes = [len(r_nodes)]
Rt = []
# print(s_nodes)
t = 0
while num_time_steps > 0:
recover(i_nodes, r_nodes, p_ir)
new_infections = spread(G, s_nodes, i_nodes, p_si)
# print("tets,",len(new_infections), "infected", num_i_nodes[t])
# print(new_infections, num_i_nodes[t])
if num_i_nodes[t] == 0:
Rt.append(0)
else:
Rt.append(len(new_infections)/num_i_nodes[t])
num_s_nodes.append(len(s_nodes))
num_i_nodes.append(len(i_nodes))
num_r_nodes.append(len(r_nodes))
num_time_steps-=1
t += 1
return Rt
fig, axes = plt.subplots(nrows=1, ncols=3, figsize = (10,7))
ax = axes.flatten()
mean_village_1 = []
plt.sca(ax[0])
for i in range(1000):
village1 = simulate_Rt(LCCs[1], p_si, p_ir, num_seeds = 1, num_time_steps = num_time_steps)
mean_village_1.append(village1)
plt.plot(village1, color = "lightblue")
plt.plot(np.mean(np.array(mean_village_1), axis=0))
plt.xlabel("Effective Reproductive \nNumber for Village 1 Over Time.\n Line is mean Rt.")
plt.ylabel("Effective Reproductive Number")
mean_village_31 = []
plt.sca(ax[1])
for i in range(1000):
village31 = simulate_Rt(LCCs[31], p_si, p_ir, num_seeds = 1, num_time_steps = num_time_steps)
mean_village_31.append(village31)
plt.plot(village31, color = "lightblue")
plt.plot(np.mean(np.array(mean_village_31), axis=0), color = "red")
plt.xlabel("Effective Reproductive \nNumber for Village 31 Over Time.\n Line is mean Rt.")
plt.sca(ax[2])
plt.plot(np.mean(np.array(mean_village_1), axis=0))
plt.plot(np.mean(np.array(mean_village_31), axis=0), color = "red")
plt.xlabel("Mean Effective Reproductive \nNumber for Village 1 (blue)\n and Village 31 (red) Over Time")
Text(0.5, 0, 'Mean Effective Reproductive \nNumber for Village 1 (blue)\n and Village 31 (red) Over Time')
Question 6: Individual-level randomized controlled vaccine trial¶
In this question, you will implement an individual-level randomized trial to assess the efficacy of a new vaccine against an infectious disease. In these trials, every individual in every village is randomized to treatment with probability 1/2 and to control with probability 1/2. On average, half of the individuals in each village will be assigned to treatment and the other half to control.
We will assume that the true S to I transmission probabilities for treated and untreated (control) individuals are known and they are p_si_treatment
and p_si_control
, respectively. We will assume that the I to R probability for recovery is known and is given by p_ir
.
Write the code to simulate this clinical trial. You will need to randomize each individual in every village to treatment or control and then propagate the SIR process on each village network independently until it naturally comes to a halt. The output for each village should be the proportion of cases in the treatment group (i.e, proportion of nodes in the treatment group that were infected at some point), the proportion of cases in the control group, and the difference between the two. Your code should print out three numbers: (1) the average proportion of cases in the treatment groups (where average is taken across all villages); (2) the average proportion of cases in the control groups (where average is taken across all villages); and (3) proportion of cases in the control group - proportion of cases in the treatment group (i.e., difference in two proportions), averaged over all villages.
Note: To simplify implementation, in this question "village" refers to the LCC of the village.
# new spread that takes into consideration of treatment vs control
def spread_Q6(G, s_nodes, i_nodes, p, group):
new_infections = []
sampled = set()
for i in i_nodes:
if G.degree(i) > 0:
sampled.add(random.choice(list(G.neighbors(i))))
for i in sampled:
if i in s_nodes and random.random() < group.get(i):
i_nodes.append(i)
new_infections.append(i)
# remove all new infected nodes from susceptible nodes
for i in new_infections:
s_nodes.remove(i)
return new_infections
def simulate_Q6(G, p_si, p_ir, num_seeds, num_time_steps, group):
i_nodes = random.sample(list(G.nodes()), num_seeds)
# print(i_nodes)
s_nodes = [node for node in list(G.nodes()) if node not in i_nodes]
r_nodes = []
total_infected = set(i_nodes)
num_s_nodes = [len(s_nodes)]
num_i_nodes = [len(i_nodes)]
num_r_nodes = [len(r_nodes)]
# print(s_nodes)
while num_time_steps > 0:
recover(i_nodes, r_nodes, p_ir)
new_infected = spread_Q6(G, s_nodes, i_nodes, p_si, group)
total_infected.update(new_infected)
num_s_nodes.append(len(s_nodes))
num_i_nodes.append(len(i_nodes))
num_r_nodes.append(len(r_nodes))
# if len(i_nodes) == 0:
# break
num_time_steps-=1
return total_infected
# fix some parameters
p_ir = 0.05
p_si_treatment = 0.05
p_si_control = 0.3
# ADD YOUR CODE HERE
total_villages = []
for i in LCCs:
group = dict(zip(list(LCCs[i].nodes()), random.choices([p_si_treatment,p_si_control], k=len(LCCs[i]))))
all_infected = simulate_Q6(LCCs[i], p_si, p_ir, num_seeds = 1, num_time_steps = num_time_steps, group = group)
# getting which group each infected node belonged to
all_infected_p = collections.Counter([group.get(key) for key in all_infected])
# total from each group
all_infected_total = collections.Counter(group.values())
try:
treatment_prop = all_infected_p.get(p_si_treatment)/all_infected_total.get(p_si_treatment)
except:
treatment_prop = 0
try:
control_prop = all_infected_p.get(p_si_control)/all_infected_total.get(p_si_control)
except:
control_prop = 0
total_villages.append([treatment_prop, control_prop,control_prop - treatment_prop])
vals = np.mean(np.array(total_villages), axis=0)
print("Average proportion of cases in the treatment groups:", vals[0])
print("Average proportion of cases in the control groups:", vals[1])
print("Average difference in proportion of cases in the control vs treatment groups:", vals[2])
Average proportion of cases in the treatment groups: 0.0930454572422352 Average proportion of cases in the control groups: 0.22984194582342846 Average difference in proportion of cases in the control vs treatment groups: 0.13679648858119328
Question 7: Cluster-level randomized controlled vaccine trial¶
In general, the direct effect of a vaccine refers to the protection against illness received by an individual because they themselves received the vaccine. In contrast, the indirect effect of a vaccine refers to the protection against illness received by an individual because others around them received the vaccine. When vaccines are rolled out at scale, we benefit from the combination of the direct and indirect effects, which is often called the total effect.
Cluster-randomized trials are used to study interventions for infectious diseases. The rationale for cluster-level vs. individual-level randomization is that the former can capture both direct and indirect effects of the intervention; this setting more closely corresponds to the situation of what would happen if the intervention would be rolled out at scale.
In this question, you will implement a cluster-level randomized trial, also called a cluster-randomized trial. In these trials, each village is randomized to treatment with probability 1/2 and to control with probability 1/2. The key point is that everyone in a given village receives the same treatment.
Write the code to simulate this clinical trial. Your code should print out three numbers: (1) the average proportion of cases in the treatment villages; (2) the average proportion of cases in the control villages; and (3) the difference between the two.
Provide a brief commment about the difference in results in Question 6 and Question 7.
Note: To simplify implementation, in this question "village" refers to the LCC of the village.
def simulate_Q7(G, p_si, p_ir, num_seeds, num_time_steps):
i_nodes = random.sample(list(G.nodes()), num_seeds)
s_nodes = [node for node in list(G.nodes()) if node not in i_nodes]
r_nodes = []
total_infected = set(i_nodes)
num_s_nodes = [len(s_nodes)]
num_i_nodes = [len(i_nodes)]
num_r_nodes = [len(r_nodes)]
# print(s_nodes)
while num_time_steps > 0:
recover(i_nodes, r_nodes, p_ir)
new_infected = spread(G, s_nodes, i_nodes, p_si)
total_infected.update(new_infected)
num_s_nodes.append(len(s_nodes))
num_i_nodes.append(len(i_nodes))
num_r_nodes.append(len(r_nodes))
# if len(i_nodes) == 0:
# break
num_time_steps-=1
return total_infected
# fix some parameters
p_ir = 0.05
p_si_treatment = 0.05
p_si_control = 0.3
# ADD YOUR CODE HERE
treatment_villages = []
control_villages = []
for i in LCCs:
curr_p_si = random.choices([p_si_treatment,p_si_control])[0]
all_infected = simulate_Q7(LCCs[i], curr_p_si ,\
p_ir, num_seeds = 1, num_time_steps = num_time_steps)
if curr_p_si == p_si_treatment:
treatment_villages.append(len(all_infected)/LCCs[i].number_of_nodes())
else:
control_villages.append(len(all_infected)/LCCs[i].number_of_nodes())
print("Average proportion of cases in the treatment villages:", np.mean(treatment_villages))
print("Average proportion of cases in the control villages:", np.mean(control_villages))
print("Average difference in proportion of cases in the control vs treatment villages:", np.mean(control_villages)- np.mean(treatment_villages))
Average proportion of cases in the treatment villages: 0.003566259163982035 Average proportion of cases in the control villages: 0.5420319868506677 Average difference in proportion of cases in the control vs treatment villages: 0.5384657276866857
In Q6, for those that had the treatment case, averaged over all the villages was generally less than 10% and control was about 15-20% the few times I ran it, leading to a difference of about 10% or so. Whereas in Q7, we saw a dramatic difference in the treatment villages vs control villages. The average proportion of cases in the treatment villages was extremely low (near 0), as seen above while control villages had more than half be cases at some point, leading to a difference of around 50%.
These results point to the idea of herd-immunity where if most of a community is protected from the disease, it's very unlikely that that disease spreads whereas on the other hand, without that protection we see lots of cases. In Q6, because we did random assigning which meant roughly half the group got the treatment and half got the control resulted in low proportion of cases for treatment and control as well. This suggests that even if we can't get the entire village on board, getting as many people as we can to be treated could have beneficial effects for everyone in the village.