Comparing the graph structures of the patients flow

Lenka

Last updated on Dec 30, 2019 14 min read Flow, Python

# we will first import all necessary libraries
import pandas as pd
import os
import numpy as np
import glob
from datetime import datetime
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from shapely.geometry import Point, LineString
#from mpl_toolkits.basemap import Basemap
import osmnx as ox
import networkx as nx
import matplotlib as mpl
from networkx.algorithms import bipartite as bi
import statsmodels.api as sm
from scipy import stats
from statsmodels.formula.api import glm

wgs84 = {'init':'EPSG:4326'}
bng = {'init':'EPSG:27700'}

Prescription flow model validation

Using graphs to detect changes in spatial structure of the prescription flow

Using a graphs to look at the properties of the flow network, we need to first think about what kind of graph the network actually is and what kind of connections it has.

Based on the number of edges between two points we recognize

Simple graph - Undirected graph with no loops and one edge between two points
Multiple graph - Undirected graph with no loops
Pseudograph - Graph with loops

Based on the type of the link we recognise;

Undirected graph - the flow does not have a direction
Directed graph - the flow have a direction, flow from point A to point B

We also recognize;

Incomplete graph - graph in which at least one set of nodes is missing connection
Complete graph - graph in which all the nodes are connected to all the other nodes

Lastly, we can also recognize very special case of a graph called Bipartite. What is Bipartiv graph? Recently, “bipartivity” has been proposed as an important topological characteristic of complex networks. A network (graph) G = (V,E) is called bipartite if its vertex set V can be partitioned into two subsets V1 and V2 such that all edges have one endpoint in V1 and the other in V2.

The GP and pharmacy flow are in nature Directed Incomplete pseudograph. that is because the flow goes from GP to pharmacy, always one direction, it does not contain the flows between all the pairs and also contains some loops; Gp can also dispense small amount odf prescriptions. Nevertheles, as the loops are very small proportoin of the data and we can benefit from the graph as a bipartite, I have attached dummy GPs into a pharmacy points so we can treat the loops as normal flows.

NOTE: what is the implication of this? We are loosing the information about connection of dome flows to origin, however, we are gaining an information on the process which reminds zero-inflated or hurdle process;

Step 1 - patient come to GP but leaves without a prescriprion
Step 2 - patient come to GP get the prescription virtually and recieve the drug from GP directly
Step 3 - patient come to GP, get prescription, go to nearest pharmacy to get the drug
Step 4 - patient come to GP, get the prescription and go to pharmacy close to shopping opportunity to get the drug
Step 5 - patient come to GP, get the prescription and go to other pharmacy to get the drug

How do we represent the graph in Networkx?

Here are types of graphs in the networkx package

Networkx Class	Type	Self-loops allowed	Parallel edges allowed
Graph	undirected	Yes	No
DiGraph	directed	Yes	No
MultiGraph	undirected	Yes	Yes
MultiDiGraph	directed	Yes	Yes

We know our graph is directed so Digraph should serve the purpose. However, the data consist from monthly prescription flow so for most of the nodes we can find 12 edges throughout the year, so we need to work with MultiDiGraph to make sure we can account for all of those.

1. Load the data

# load the output of the gravity models, the flow data
# set the path
file = r'./../../data/NHS/working/predicted_flow_17.csv'
# read the file
flow17 = pd.read_csv(file, index_col=None)
# plot the top of the table
flow17.head()

	P_Code	P_Name_x	P_Postcode	D_Code_x	D_Name_x	D_Postcode	N_Items	Date	loop	...	D_Left_y	D_X_y	D_Y_y	D_E_y	D_N_y	D_Setting_y	residents	Density	predict_items	predict_items_doubly
0	L81004	PORTISHEAD MEDICAL GROUP	BS20 6AQ	FA251	HORIZON PHARMACY LTD	BS34 6AS	130.0	2017-01-01	False	...	20180115.0	-2.562191	51.539195	361105.872423	182404.906640	50	456.517857	36.592857	29.316590	10.581379
1	L81004	PORTISHEAD MEDICAL GROUP	BS20 6AQ	FA613	DOWNHAM FB LTD	BS7 9JT	39.0	2017-01-01	False	...	NaN	-2.582219	51.480445	359664.895464	175881.865127	50	249.144737	54.493421	29.473430	2.853510
2	L81004	PORTISHEAD MEDICAL GROUP	BS20 6AQ	FCL91	MYGLOBE LTD	BS15 4ND	12.0	2017-01-01	False	...	NaN	-2.478870	51.459944	366826.940029	173549.785593	50	249.128205	30.569231	24.155556	2.061345
3	L81004	PORTISHEAD MEDICAL GROUP	BS20 6AQ	FCT35	LLOYDS PHARMACY LTD	BS20 7QA	661.0	2017-01-01	False	...	20180131.0	-2.758272	51.484474	347443.775309	176441.826480	50	369.081633	34.918367	1648.893267	1078.144522
4	L81004	PORTISHEAD MEDICAL GROUP	BS20 6AQ	FDD77	JOHN WARE LIMITED	BS20 6LT	4058.0	2017-01-01	False	...	NaN	-2.781349	51.481409	345837.744901	176117.839266	50	391.076923	29.092308	1194.123265	3570.014826

5 rows × 72 columns

# create a lists of the unique codes so we can subset the GP and PH point data
list_flow_GP = list(flow17['P_Code'].unique())
list_flow_PH = list(flow17['D_Code_new'].unique())

# get the pharmacies in
fl = './../../data/NHS/edispensary/PH_joined_daytimepop.geojson'
PH = gpd.read_file(fl)
PH = PH.to_crs(wgs84)
PH = PH[PH['D_Code_new'].isin(list_flow_PH)]
PH.plot();

C:\Users\tk18583\AppData\Local\Continuum\anaconda3\envs\geonet\lib\site-packages\pyproj\crs\crs.py:53: FutureWarning:

'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6

png

# get the GP's points in
GP = gpd.read_file("./../../data/NHS/epraccur/GP_AVON.geojson")
GP = GP[['P_Code','P_Name', 'full_address', 'P_Postcode', 'Open_date', 
                                         'Close_date', 'Status_Code', 'Sub_type', 'Join', 'left', 'Setting', 'X', 'Y', 'geometry']]
GP = GP.rename(columns={'P_Postcode':'P_Postcode2','Open_date': 'P_Open_date', 'Close_date':'P_Close_date','full_address':'P_full_address' ,'Status_Code':'P_Status_Code', 'Sub_type':'P_Sub_type', 'Join':'P_Join', 'left':'P_Left', 'Setting':'P_Setting', 'X':'P_X', 'Y':'P_Y'}, errors="raise")
GP = GP.to_crs(wgs84)
GP = GP[GP['P_Code'].isin(list_flow_GP)]
GP.plot();

C:\Users\tk18583\AppData\Local\Continuum\anaconda3\envs\geonet\lib\site-packages\pyproj\crs\crs.py:53: FutureWarning:

'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6

png

2. Create a graph

We now know that the graph is bipartite, has multiple edges and has a directions, which means that we need to create MultiDiGraph. We also have weight on edge, that is the number of prescriptions on flow and predicted items on flow for both of the used models.

G1 = nx.from_pandas_edgelist(flow17, source='P_Code', target='D_Code_new', edge_attr=['N_Items'], create_using=nx.MultiDiGraph())
print('Is the graph bipirtite?   ' + str(nx.is_bipartite(G1)))
print('Is the graph directed??   ' + str(nx.is_directed(G1)))

Is the graph bipirtite?   True
Is the graph directed??   True

G2 = nx.from_pandas_edgelist(flow17, source='P_Code', target='D_Code_new', edge_attr=['predict_items'], create_using=nx.MultiDiGraph())
print('Is the graph bipirtite?   ' + str(nx.is_bipartite(G2)))
print('Is the graph directed??   ' + str(nx.is_directed(G2)))

Is the graph bipirtite?   True
Is the graph directed??   True

G3 = nx.from_pandas_edgelist(flow17, source='P_Code', target='D_Code_new', edge_attr=['predict_items_doubly'], create_using=nx.MultiDiGraph())
print('Is the graph bipirtite?   ' + str(nx.is_bipartite(G3)))
print('Is the graph directed??   ' + str(nx.is_directed(G3)))

Is the graph bipirtite?   True
Is the graph directed??   True

3. Look at the properties of the graph

# Look at number on graph properties
nb_nodes = len(G1.nodes())
nb_arr = len(G1.edges())
print("Number of nodes : " + str(nb_nodes))
print("Number of edges : " + str(nb_arr))

Number of nodes : 485
Number of edges : 74668

degree_freq = np.array(nx.degree_histogram(G1)).astype('float')
plt.figure(figsize=(12, 8))
plt.stem(degree_freq)
plt.ylabel("Frequence")
plt.xlabel("Degre")
plt.show();

C:\Users\tk18583\AppData\Local\Continuum\anaconda3\envs\geonet\lib\site-packages\ipykernel_launcher.py:3: UserWarning:

In Matplotlib 3.3 individual lines on a stem plot will be added as a LineCollection instead of individual lines. This significantly improves the performance of a stem plot. To remove this warning and switch to the new behaviour, set the "use_line_collection" keyword argument to True.

png

#degree_freq[degree_freq > 0]
x = degree_freq[degree_freq < 80]

plt.figure(figsize=(12, 8))
plt.stem(x)
plt.ylabel("Frequence")
plt.xlabel("Degre")
plt.show();

C:\Users\tk18583\AppData\Local\Continuum\anaconda3\envs\geonet\lib\site-packages\ipykernel_launcher.py:2: UserWarning:

In Matplotlib 3.3 individual lines on a stem plot will be added as a LineCollection instead of individual lines. This significantly improves the performance of a stem plot. To remove this warning and switch to the new behaviour, set the "use_line_collection" keyword argument to True.

png


plt.figure(figsize=(14,14))
nx.draw_networkx(G1, with_labels=True, width=0.02)

4. Define the functions to calculate centrality and snap the resluts to the point data

The Bipirtate graph is very specific cas of graph and has its own dedicated centraliy functions that can be calculated with networkx package.

This is the closeness and degree centrality, however none of those contain any weight, they are only based on the number of flows coming into the nodes.

No centrality can be measured in weighted bipirtite graph

Other measures we can look at are the Link Analysis measures. Those are Page Rank and HIIT (hubs and authorities). However, only Page Rank include a weight on edges,

Page Rank is the only measure we can apply on weighted directed bipartite graph!

def close_bi(graph, nodes, name, joinsA):
    # calculate the closenest
    close = bi.closeness_centrality(G = graph, nodes = nodes)
    # put it into a dataframe 
    x = pd.DataFrame.from_dict(close,  orient='index')
    x = x.reset_index()
    x = x.rename({0: name},axis = 1)

    df = joinsA.merge(x, left_on = 'P_Code', right_on = 'index', how = 'left')
    return df

def degree_bi(graph, nodes, name, joinsA, joinsB):
    # calculate the closenest
    close = bi.degree_centrality(G = graph, nodes = nodes)
    # put it into a dataframe 
    x = pd.DataFrame.from_dict(close,  orient='index')
    x = x.reset_index()
    x = x.rename({0: name},axis = 1)

    df = joinsA.merge(x, left_on = 'P_Code', right_on = 'index', how = 'left')
    df2 = joinsB.merge(x, left_on = 'D_Code_new', right_on = 'index', how = 'left')
    return df, df2

def pgr(graph, name, joinsA, joinsB, weight):
    # calculate the closenest
    close = nx.pagerank_scipy(G = graph, weight = weight, max_iter = 50)
    # put it into a dataframe 
    x = pd.DataFrame.from_dict(close,  orient='index')
    x = x.reset_index()
    x = x.rename({0: name},axis = 1)

    df = joinsA.merge(x, left_on = 'P_Code', right_on = 'index', how = 'left')
    df2 = joinsB.merge(x, left_on = 'D_Code_new', right_on = 'index', how = 'left')
    return df, df2

def pgr(graph, name, joinsA, joinsB, weight):
    # calculate the closenest
    close = nx.pagerank_scipy(G = graph, weight = weight, max_iter = 50)
    # put it into a dataframe 
    x = pd.DataFrame.from_dict(close,  orient='index')
    x = x.reset_index()
    x = x.rename({0: name},axis = 1)

    df = joinsA.merge(x, left_on = 'P_Code', right_on = 'index', how = 'left')
    df2 = joinsB.merge(x, left_on = 'D_Code_new', right_on = 'index', how = 'left')
    return df, df2

# eigenvector centrality does not work for multidigraphs
def eigen_ce(graph, name, joinsA, joinsB, weight):
    # calculate the closenest
    close = nx.eigenvector_centrality(G = graph, max_iter = 100, weight = weight)
    # put it into a dataframe 
    x = pd.DataFrame.from_dict(close,  orient='index')
    x = x.reset_index()
    x = x.rename({0: name},axis = 1)

    df = joinsA.merge(x, left_on = 'P_Code', right_on = 'index', how = 'left')
    df2 = joinsB.merge(x, left_on = 'D_Code_new', right_on = 'index', how = 'left')
    return df, df2

5. Apply the functions for Bipartate graph

GP2 = close_bi(graph = G1, nodes = list_flow_GP, name = 'close_G1', joinsA = GP)
GP2 = close_bi(graph = G2, nodes = list_flow_GP, name = 'close_G2', joinsA = GP2)
GP2 = close_bi(graph = G3, nodes = list_flow_GP, name = 'close_G3', joinsA = GP2)

GP2, PH2 = degree_bi(graph = G1, nodes = list_flow_GP, name = 'degree_G1', joinsA = GP2, joinsB = PH)
GP2, PH2 = degree_bi(graph = G2, nodes = list_flow_GP, name = 'degree_G2', joinsA = GP2, joinsB = PH2)
GP2, PH2 = degree_bi(graph = G3, nodes = list_flow_GP, name = 'degree_G3', joinsA = GP2, joinsB = PH2)

GP2, PH2 =  pgr(graph = G1, name = 'pgr_G1', weight = 'N_Items', joinsA = GP2, joinsB = PH2)
GP2, PH2 =  pgr(graph = G2, name = 'pgr_G2', weight = 'predict_items', joinsA = GP2, joinsB = PH2)
GP2, PH2 =  pgr(graph = G3, name = 'pgr_G3', weight = 'predict_items_doubly', joinsA = GP2, joinsB = PH2)

# eigenvector centrality does not work for multidigraphs
#GP2, PH2 =  eigen_ce(graph = G1, name = 'eig_G1', weight = 'N_Items', joinsA = GP2, joinsB = PH2)
#GP2, PH2 =  eigen_ce(graph = G2, name = 'eig_G2', weight = 'predict_items', joinsA = GP2, joinsB = PH2)
#GP2, PH2 =  eigen_ce(graph = G3, name = 'eig_G3', weight = 'predict_items_doubly', joinsA = GP2, joinsB = PH2)

GP2.drop(['index_x','index_y' ,'index'], axis=1, inplace=True)
PH2.drop(['index_x','index_y'], axis=1, inplace=True)

PH2.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 357 entries, 0 to 356
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   D_Code          357 non-null    object  
 1   D_Name          357 non-null    object  
 2   D_full_address  357 non-null    object  
 3   D_Postcode2     357 non-null    object  
 4   D_Open_date     357 non-null    int64   
 5   D_Close_date    27 non-null     float64 
 6   D_Status_Code   357 non-null    object  
 7   D_Sub_type      357 non-null    object  
 8   D_Join          356 non-null    float64 
 9   D_Left          27 non-null     float64 
 10  D_X             357 non-null    float64 
 11  D_Y             357 non-null    float64 
 12  D_E             357 non-null    float64 
 13  D_N             357 non-null    float64 
 14  D_Setting       357 non-null    int64   
 15  D_Code_new      357 non-null    object  
 16  residents       357 non-null    float64 
 17  Density         357 non-null    float64 
 18  geometry        357 non-null    geometry
 19  degree_G1       357 non-null    float64 
 20  degree_G2       357 non-null    float64 
 21  degree_G3       357 non-null    float64 
 22  pgr_G1          357 non-null    float64 
 23  pgr_G2          357 non-null    float64 
 24  pgr_G3          357 non-null    float64 
dtypes: float64(15), geometry(1), int64(2), object(7)
memory usage: 72.5+ KB

# look at the values
GP2.iloc[:,14:23].head()

	close_G1	close_G2	close_G3	degree_G1	degree_G2	degree_G3	pgr_G1	pgr_G2	pgr_G3
0	1.262397	1.262397	1.262397	1.456583	1.456583	1.456583	0.001684	0.001684	0.001684
1	1.262397	1.262397	1.262397	2.207283	2.207283	2.207283	0.001684	0.001684	0.001684
2	1.262397	1.262397	1.262397	1.834734	1.834734	1.834734	0.001684	0.001684	0.001684
3	1.262397	1.262397	1.262397	1.733894	1.733894	1.733894	0.001684	0.001684	0.001684
4	1.262397	1.262397	1.262397	0.806723	0.806723	0.806723	0.001684	0.001684	0.001684

PH2.iloc[:,19:25].head()

	degree_G1	degree_G2	degree_G3	pgr_G1	pgr_G2	pgr_G3
0	2.140625	2.140625	2.140625	0.002018	0.001812	0.001996
1	1.687500	1.687500	1.687500	0.002194	0.001957	0.002179
2	0.648438	0.648438	0.648438	0.001689	0.001774	0.001689
3	2.937500	2.937500	2.937500	0.002471	0.002330	0.002497
4	3.773438	3.773438	3.773438	0.001700	0.001867	0.001718

6. Changing graph structure

We can see that the degree centrality and closeness centraility is the same for all the graphs, because the wight is not included.

The Pagerank for the GP's is not changing either as the flows are leaving them and page rank focuses on the incoming flows only.

The Page Rank algorithm computes a ranking of the nodes in the graph G based on the structure of the incoming links, so we can only observe the importance of pharmacies.

plt.figure(figsize=(30, 6))
# Degree Centrality
f, axarr = plt.subplots(1,3, num=1)

plt.sca(axarr[0])
PH2['pgr_G1'].plot.hist(grid=True, bins=20, rwidth=0.9, color='#607c8e')
axarr[0].set_title('Page Rank Graph 1', size=16)

plt.sca(axarr[1])
PH2['pgr_G2'].plot.hist(grid=True, bins=20, rwidth=0.9, color='#607c8e')
axarr[1].set_title('Page Rank Graph 2', size=16)

plt.sca(axarr[2])
PH2['pgr_G3'].plot.hist(grid=True, bins=20, rwidth=0.9, color='#607c8e')
axarr[2].set_title('Page Rank Graph 3', size=16);

png

# which one is the most important? the one that sticks out
PH2[PH2['pgr_G1'] > 0.0044]

	D_Code	D_Name	D_full_address	D_Postcode2	D_Open_date	D_Close_date	D_Status_Code	D_Sub_type	D_Join	D_Left	...	D_Code_new	residents	Density	geometry	degree_G1	degree_G2	degree_G3	pgr_G1	pgr_G2	pgr_G3
112	FLQ56	BOOTS UK LIMITED	BOOTS UK LIMITED,59 BROADMEAD,BRISTOL,nan,BS1 3ED	BS1 3ED	19480101	NaN	A	1	20040401.0	NaN	...	FLQ56	1021.5	202.028571	POINT (-2.58972 51.45774)	8.898438	8.898438	8.898438	0.004406	0.002691	0.003961

1 rows × 25 columns

ax = plt.figure(figsize=(10, 10))
plt.axis([PH2['pgr_G1'].min(), PH2['pgr_G1'].max(), PH2['pgr_G1'].min(), PH2['pgr_G1'].max()])
# plots
plt.scatter(PH2['pgr_G1'],PH2['pgr_G1'])
plt.scatter(PH2['pgr_G1'],PH2['pgr_G2'])
plt.plot(PH2['pgr_G1'].mean(),PH2['pgr_G1'].mean(), c = 'r', marker = 'X', markersize = 30)
# regression plots
sns.regplot(PH2['pgr_G1'],PH2['pgr_G1'])
sns.regplot(PH2['pgr_G1'],PH2['pgr_G2'])

plt.legend(['Mean','Page Rank from data', 'Page Rank from base gravity']);

png

PH2['pgr_G1'].mean()

0.002197291434997281

ax = plt.figure(figsize=(10, 10))
plt.axis([PH2['pgr_G1'].min(), PH2['pgr_G1'].max(), PH2['pgr_G1'].min(), PH2['pgr_G1'].max()])
# plots
plt.scatter(PH2['pgr_G1'],PH2['pgr_G1'])
plt.scatter(PH2['pgr_G1'],PH2['pgr_G3'])
plt.plot(PH2['pgr_G1'].mean(),PH2['pgr_G1'].mean(), c = 'r', marker = 'X', markersize = 30)
# regression plots
sns.regplot(PH2['pgr_G1'],PH2['pgr_G1'])
sns.regplot(PH2['pgr_G1'],PH2['pgr_G3'])

plt.legend(['Mean','Page Rank from data', 'Page Rank from doubly gravity model' ]);

png

Results

What we can see on the figure above is how the node centrality in the graph conctructed from prediction relates to a node centrality in the graph constructed from raw data. In ideal situation, the line of best fit and the observations would follow the blue line, however, this is not the case. From the first figure, which looks at the importance of the nodes based on base gravity model (very simple one), you can observe that where the actual importannce of the node is high, the importance based on predicted flows drop down drastically, and where the actual importance of the node is low, the importance based on predicted flows is higher. You can also notice line of zeros trying to hide behind the y axes. So although the model predicts flow with certain accuracy, th quality of the flow is not the same as from the data. More specifically the model is not really able to see the hierarchical structure of the nodes and creates predictions that are close to mean for all nodes.

Not suprisingly this is not the case for more complex gravity model such as doubly constrained gravity model. Here the hierarchy of the nodes is sustained by the prediction, however, that is expected as each node has it's own parameter in the model. Seems like a simple and smart solution, however, you need to remember the with doubly constrained model, where each node is considered as unique, predictions can be made only to existing nodes. So question like ‘What will happen with the flows if one of the GP's closes’ are hardly answerable. It's like a cheeky shortcut; if I am not able to meodel complicated flow structure, let's model each node separately.

Drop me a message on twitter if you have any comments :)

NHS Migration O-D Distance Decay