Comparing the graph structures of the patients flow

# we will first import all necessary libraries
import pandas as pd
import os
import numpy as np
import glob
from datetime import datetime
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from shapely.geometry import Point, LineString
#from mpl_toolkits.basemap import Basemap
import osmnx as ox
import networkx as nx
import matplotlib as mpl
from networkx.algorithms import bipartite as bi
import statsmodels.api as sm
from scipy import stats
from statsmodels.formula.api import glm
wgs84 = {'init':'EPSG:4326'}
bng = {'init':'EPSG:27700'}

Prescription flow model validation

Using graphs to detect changes in spatial structure of the prescription flow

Using a graphs to look at the properties of the flow network, we need to first think about what kind of graph the network actually is and what kind of connections it has.

Based on the number of edges between two points we recognize

  1. Simple graph - Undirected graph with no loops and one edge between two points
  2. Multiple graph - Undirected graph with no loops
  3. Pseudograph - Graph with loops

Based on the type of the link we recognise;

  1. Undirected graph - the flow does not have a direction
  2. Directed graph - the flow have a direction, flow from point A to point B

We also recognize;

  1. Incomplete graph - graph in which at least one set of nodes is missing connection
  2. Complete graph - graph in which all the nodes are connected to all the other nodes

Lastly, we can also recognize very special case of a graph called Bipartite. What is Bipartiv graph? Recently, “bipartivity” has been proposed as an important topological characteristic of complex networks. A network (graph) G = (V,E) is called bipartite if its vertex set V can be partitioned into two subsets V1 and V2 such that all edges have one endpoint in V1 and the other in V2.

The GP and pharmacy flow are in nature Directed Incomplete pseudograph. that is because the flow goes from GP to pharmacy, always one direction, it does not contain the flows between all the pairs and also contains some loops; Gp can also dispense small amount odf prescriptions. Nevertheles, as the loops are very small proportoin of the data and we can benefit from the graph as a bipartite, I have attached dummy GPs into a pharmacy points so we can treat the loops as normal flows.

NOTE: what is the implication of this? We are loosing the information about connection of dome flows to origin, however, we are gaining an information on the process which reminds zero-inflated or hurdle process;

  • Step 1 - patient come to GP but leaves without a prescriprion
  • Step 2 - patient come to GP get the prescription virtually and recieve the drug from GP directly
  • Step 3 - patient come to GP, get prescription, go to nearest pharmacy to get the drug
  • Step 4 - patient come to GP, get the prescription and go to pharmacy close to shopping opportunity to get the drug
  • Step 5 - patient come to GP, get the prescription and go to other pharmacy to get the drug

How do we represent the graph in Networkx?

Here are types of graphs in the networkx package

Networkx Class Type Self-loops allowed Parallel edges allowed
Graph undirected Yes No
DiGraph directed Yes No
MultiGraph undirected Yes Yes
MultiDiGraph directed Yes Yes

We know our graph is directed so Digraph should serve the purpose. However, the data consist from monthly prescription flow so for most of the nodes we can find 12 edges throughout the year, so we need to work with MultiDiGraph to make sure we can account for all of those.

1. Load the data

# load the output of the gravity models, the flow data
# set the path
file = r'./../../data/NHS/working/predicted_flow_17.csv'
# read the file
flow17 = pd.read_csv(file, index_col=None)
# plot the top of the table
flow17.head()

P_Code P_Name_x P_Postcode D_Code_x D_Name_x D_Postcode N_Items N_EPSItems Date loop ... D_Left_y D_X_y D_Y_y D_E_y D_N_y D_Setting_y residents Density predict_items predict_items_doubly
0 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FA251 HORIZON PHARMACY LTD BS34 6AS 130.0 0.0 2017-01-01 False ... 20180115.0 -2.562191 51.539195 361105.872423 182404.906640 50 456.517857 36.592857 29.316590 10.581379
1 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FA613 DOWNHAM FB LTD BS7 9JT 39.0 0.0 2017-01-01 False ... NaN -2.582219 51.480445 359664.895464 175881.865127 50 249.144737 54.493421 29.473430 2.853510
2 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FCL91 MYGLOBE LTD BS15 4ND 12.0 0.0 2017-01-01 False ... NaN -2.478870 51.459944 366826.940029 173549.785593 50 249.128205 30.569231 24.155556 2.061345
3 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FCT35 LLOYDS PHARMACY LTD BS20 7QA 661.0 0.0 2017-01-01 False ... 20180131.0 -2.758272 51.484474 347443.775309 176441.826480 50 369.081633 34.918367 1648.893267 1078.144522
4 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FDD77 JOHN WARE LIMITED BS20 6LT 4058.0 0.0 2017-01-01 False ... NaN -2.781349 51.481409 345837.744901 176117.839266 50 391.076923 29.092308 1194.123265 3570.014826

5 rows × 72 columns

# create a lists of the unique codes so we can subset the GP and PH point data
list_flow_GP = list(flow17['P_Code'].unique())
list_flow_PH = list(flow17['D_Code_new'].unique())
# get the pharmacies in
fl = './../../data/NHS/edispensary/PH_joined_daytimepop.geojson'
PH = gpd.read_file(fl)
PH = PH.to_crs(wgs84)
PH = PH[PH['D_Code_new'].isin(list_flow_PH)]
PH.plot();
C:\Users\tk18583\AppData\Local\Continuum\anaconda3\envs\geonet\lib\site-packages\pyproj\crs\crs.py:53: FutureWarning:

'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6

png

# get the GP's points in
GP = gpd.read_file("./../../data/NHS/epraccur/GP_AVON.geojson")
GP = GP[['P_Code','P_Name', 'full_address', 'P_Postcode', 'Open_date', 
                                         'Close_date', 'Status_Code', 'Sub_type', 'Join', 'left', 'Setting', 'X', 'Y', 'geometry']]
GP = GP.rename(columns={'P_Postcode':'P_Postcode2','Open_date': 'P_Open_date', 'Close_date':'P_Close_date','full_address':'P_full_address' ,'Status_Code':'P_Status_Code', 'Sub_type':'P_Sub_type', 'Join':'P_Join', 'left':'P_Left', 'Setting':'P_Setting', 'X':'P_X', 'Y':'P_Y'}, errors="raise")
GP = GP.to_crs(wgs84)
GP = GP[GP['P_Code'].isin(list_flow_GP)]
GP.plot();
C:\Users\tk18583\AppData\Local\Continuum\anaconda3\envs\geonet\lib\site-packages\pyproj\crs\crs.py:53: FutureWarning:

'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6

png

2. Create a graph

We now know that the graph is bipartite, has multiple edges and has a directions, which means that we need to create MultiDiGraph. We also have weight on edge, that is the number of prescriptions on flow and predicted items on flow for both of the used models.

G1 = nx.from_pandas_edgelist(flow17, source='P_Code', target='D_Code_new', edge_attr=['N_Items'], create_using=nx.MultiDiGraph())
print('Is the graph bipirtite?   ' + str(nx.is_bipartite(G1)))
print('Is the graph directed??   ' + str(nx.is_directed(G1)))
Is the graph bipirtite?   True
Is the graph directed??   True
G2 = nx.from_pandas_edgelist(flow17, source='P_Code', target='D_Code_new', edge_attr=['predict_items'], create_using=nx.MultiDiGraph())
print('Is the graph bipirtite?   ' + str(nx.is_bipartite(G2)))
print('Is the graph directed??   ' + str(nx.is_directed(G2)))
Is the graph bipirtite?   True
Is the graph directed??   True
G3 = nx.from_pandas_edgelist(flow17, source='P_Code', target='D_Code_new', edge_attr=['predict_items_doubly'], create_using=nx.MultiDiGraph())
print('Is the graph bipirtite?   ' + str(nx.is_bipartite(G3)))
print('Is the graph directed??   ' + str(nx.is_directed(G3)))
Is the graph bipirtite?   True
Is the graph directed??   True

3. Look at the properties of the graph

# Look at number on graph properties
nb_nodes = len(G1.nodes())
nb_arr = len(G1.edges())
print("Number of nodes : " + str(nb_nodes))
print("Number of edges : " + str(nb_arr))
Number of nodes : 485
Number of edges : 74668
degree_freq = np.array(nx.degree_histogram(G1)).astype('float')
plt.figure(figsize=(12, 8))
plt.stem(degree_freq)
plt.ylabel("Frequence")
plt.xlabel("Degre")
plt.show();
C:\Users\tk18583\AppData\Local\Continuum\anaconda3\envs\geonet\lib\site-packages\ipykernel_launcher.py:3: UserWarning:

In Matplotlib 3.3 individual lines on a stem plot will be added as a LineCollection instead of individual lines. This significantly improves the performance of a stem plot. To remove this warning and switch to the new behaviour, set the "use_line_collection" keyword argument to True.

png

#degree_freq[degree_freq > 0]
x = degree_freq[degree_freq < 80]
plt.figure(figsize=(12, 8))
plt.stem(x)
plt.ylabel("Frequence")
plt.xlabel("Degre")
plt.show();
C:\Users\tk18583\AppData\Local\Continuum\anaconda3\envs\geonet\lib\site-packages\ipykernel_launcher.py:2: UserWarning:

In Matplotlib 3.3 individual lines on a stem plot will be added as a LineCollection instead of individual lines. This significantly improves the performance of a stem plot. To remove this warning and switch to the new behaviour, set the "use_line_collection" keyword argument to True.

png


plt.figure(figsize=(14,14))
nx.draw_networkx(G1, with_labels=True, width=0.02)

4. Define the functions to calculate centrality and snap the resluts to the point data

The Bipirtate graph is very specific cas of graph and has its own dedicated centraliy functions that can be calculated with networkx package.

This is the closeness and degree centrality, however none of those contain any weight, they are only based on the number of flows coming into the nodes.

No centrality can be measured in weighted bipirtite graph

Other measures we can look at are the Link Analysis measures. Those are Page Rank and HIIT (hubs and authorities). However, only Page Rank include a weight on edges,

Page Rank is the only measure we can apply on weighted directed bipartite graph!

def close_bi(graph, nodes, name, joinsA):
    # calculate the closenest
    close = bi.closeness_centrality(G = graph, nodes = nodes)
    # put it into a dataframe 
    x = pd.DataFrame.from_dict(close,  orient='index')
    x = x.reset_index()
    x = x.rename({0: name},axis = 1)

    df = joinsA.merge(x, left_on = 'P_Code', right_on = 'index', how = 'left')
    return df

def degree_bi(graph, nodes, name, joinsA, joinsB):
    # calculate the closenest
    close = bi.degree_centrality(G = graph, nodes = nodes)
    # put it into a dataframe 
    x = pd.DataFrame.from_dict(close,  orient='index')
    x = x.reset_index()
    x = x.rename({0: name},axis = 1)

    df = joinsA.merge(x, left_on = 'P_Code', right_on = 'index', how = 'left')
    df2 = joinsB.merge(x, left_on = 'D_Code_new', right_on = 'index', how = 'left')
    return df, df2

def pgr(graph, name, joinsA, joinsB, weight):
    # calculate the closenest
    close = nx.pagerank_scipy(G = graph, weight = weight, max_iter = 50)
    # put it into a dataframe 
    x = pd.DataFrame.from_dict(close,  orient='index')
    x = x.reset_index()
    x = x.rename({0: name},axis = 1)

    df = joinsA.merge(x, left_on = 'P_Code', right_on = 'index', how = 'left')
    df2 = joinsB.merge(x, left_on = 'D_Code_new', right_on = 'index', how = 'left')
    return df, df2

def pgr(graph, name, joinsA, joinsB, weight):
    # calculate the closenest
    close = nx.pagerank_scipy(G = graph, weight = weight, max_iter = 50)
    # put it into a dataframe 
    x = pd.DataFrame.from_dict(close,  orient='index')
    x = x.reset_index()
    x = x.rename({0: name},axis = 1)

    df = joinsA.merge(x, left_on = 'P_Code', right_on = 'index', how = 'left')
    df2 = joinsB.merge(x, left_on = 'D_Code_new', right_on = 'index', how = 'left')
    return df, df2

# eigenvector centrality does not work for multidigraphs
def eigen_ce(graph, name, joinsA, joinsB, weight):
    # calculate the closenest
    close = nx.eigenvector_centrality(G = graph, max_iter = 100, weight = weight)
    # put it into a dataframe 
    x = pd.DataFrame.from_dict(close,  orient='index')
    x = x.reset_index()
    x = x.rename({0: name},axis = 1)

    df = joinsA.merge(x, left_on = 'P_Code', right_on = 'index', how = 'left')
    df2 = joinsB.merge(x, left_on = 'D_Code_new', right_on = 'index', how = 'left')
    return df, df2

5. Apply the functions for Bipartate graph

GP2 = close_bi(graph = G1, nodes = list_flow_GP, name = 'close_G1', joinsA = GP)
GP2 = close_bi(graph = G2, nodes = list_flow_GP, name = 'close_G2', joinsA = GP2)
GP2 = close_bi(graph = G3, nodes = list_flow_GP, name = 'close_G3', joinsA = GP2)

GP2, PH2 = degree_bi(graph = G1, nodes = list_flow_GP, name = 'degree_G1', joinsA = GP2, joinsB = PH)
GP2, PH2 = degree_bi(graph = G2, nodes = list_flow_GP, name = 'degree_G2', joinsA = GP2, joinsB = PH2)
GP2, PH2 = degree_bi(graph = G3, nodes = list_flow_GP, name = 'degree_G3', joinsA = GP2, joinsB = PH2)

GP2, PH2 =  pgr(graph = G1, name = 'pgr_G1', weight = 'N_Items', joinsA = GP2, joinsB = PH2)
GP2, PH2 =  pgr(graph = G2, name = 'pgr_G2', weight = 'predict_items', joinsA = GP2, joinsB = PH2)
GP2, PH2 =  pgr(graph = G3, name = 'pgr_G3', weight = 'predict_items_doubly', joinsA = GP2, joinsB = PH2)

# eigenvector centrality does not work for multidigraphs
#GP2, PH2 =  eigen_ce(graph = G1, name = 'eig_G1', weight = 'N_Items', joinsA = GP2, joinsB = PH2)
#GP2, PH2 =  eigen_ce(graph = G2, name = 'eig_G2', weight = 'predict_items', joinsA = GP2, joinsB = PH2)
#GP2, PH2 =  eigen_ce(graph = G3, name = 'eig_G3', weight = 'predict_items_doubly', joinsA = GP2, joinsB = PH2)

GP2.drop(['index_x','index_y' ,'index'], axis=1, inplace=True)
PH2.drop(['index_x','index_y'], axis=1, inplace=True)
PH2.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 357 entries, 0 to 356
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   D_Code          357 non-null    object  
 1   D_Name          357 non-null    object  
 2   D_full_address  357 non-null    object  
 3   D_Postcode2     357 non-null    object  
 4   D_Open_date     357 non-null    int64   
 5   D_Close_date    27 non-null     float64 
 6   D_Status_Code   357 non-null    object  
 7   D_Sub_type      357 non-null    object  
 8   D_Join          356 non-null    float64 
 9   D_Left          27 non-null     float64 
 10  D_X             357 non-null    float64 
 11  D_Y             357 non-null    float64 
 12  D_E             357 non-null    float64 
 13  D_N             357 non-null    float64 
 14  D_Setting       357 non-null    int64   
 15  D_Code_new      357 non-null    object  
 16  residents       357 non-null    float64 
 17  Density         357 non-null    float64 
 18  geometry        357 non-null    geometry
 19  degree_G1       357 non-null    float64 
 20  degree_G2       357 non-null    float64 
 21  degree_G3       357 non-null    float64 
 22  pgr_G1          357 non-null    float64 
 23  pgr_G2          357 non-null    float64 
 24  pgr_G3          357 non-null    float64 
dtypes: float64(15), geometry(1), int64(2), object(7)
memory usage: 72.5+ KB
# look at the values
GP2.iloc[:,14:23].head()

close_G1 close_G2 close_G3 degree_G1 degree_G2 degree_G3 pgr_G1 pgr_G2 pgr_G3
0 1.262397 1.262397 1.262397 1.456583 1.456583 1.456583 0.001684 0.001684 0.001684
1 1.262397 1.262397 1.262397 2.207283 2.207283 2.207283 0.001684 0.001684 0.001684
2 1.262397 1.262397 1.262397 1.834734 1.834734 1.834734 0.001684 0.001684 0.001684
3 1.262397 1.262397 1.262397 1.733894 1.733894 1.733894 0.001684 0.001684 0.001684
4 1.262397 1.262397 1.262397 0.806723 0.806723 0.806723 0.001684 0.001684 0.001684
PH2.iloc[:,19:25].head()

degree_G1 degree_G2 degree_G3 pgr_G1 pgr_G2 pgr_G3
0 2.140625 2.140625 2.140625 0.002018 0.001812 0.001996
1 1.687500 1.687500 1.687500 0.002194 0.001957 0.002179
2 0.648438 0.648438 0.648438 0.001689 0.001774 0.001689
3 2.937500 2.937500 2.937500 0.002471 0.002330 0.002497
4 3.773438 3.773438 3.773438 0.001700 0.001867 0.001718

6. Changing graph structure

We can see that the degree centrality and closeness centraility is the same for all the graphs, because the wight is not included.

The Pagerank for the GP's is not changing either as the flows are leaving them and page rank focuses on the incoming flows only.

The Page Rank algorithm computes a ranking of the nodes in the graph G based on the structure of the incoming links, so we can only observe the importance of pharmacies.

plt.figure(figsize=(30, 6))
# Degree Centrality
f, axarr = plt.subplots(1,3, num=1)

plt.sca(axarr[0])
PH2['pgr_G1'].plot.hist(grid=True, bins=20, rwidth=0.9, color='#607c8e')
axarr[0].set_title('Page Rank Graph 1', size=16)

plt.sca(axarr[1])
PH2['pgr_G2'].plot.hist(grid=True, bins=20, rwidth=0.9, color='#607c8e')
axarr[1].set_title('Page Rank Graph 2', size=16)

plt.sca(axarr[2])
PH2['pgr_G3'].plot.hist(grid=True, bins=20, rwidth=0.9, color='#607c8e')
axarr[2].set_title('Page Rank Graph 3', size=16);

png

# which one is the most important? the one that sticks out
PH2[PH2['pgr_G1'] > 0.0044]

D_Code D_Name D_full_address D_Postcode2 D_Open_date D_Close_date D_Status_Code D_Sub_type D_Join D_Left ... D_Code_new residents Density geometry degree_G1 degree_G2 degree_G3 pgr_G1 pgr_G2 pgr_G3
112 FLQ56 BOOTS UK LIMITED BOOTS UK LIMITED,59 BROADMEAD,BRISTOL,nan,BS1 3ED BS1 3ED 19480101 NaN A 1 20040401.0 NaN ... FLQ56 1021.5 202.028571 POINT (-2.58972 51.45774) 8.898438 8.898438 8.898438 0.004406 0.002691 0.003961

1 rows × 25 columns

ax = plt.figure(figsize=(10, 10))
plt.axis([PH2['pgr_G1'].min(), PH2['pgr_G1'].max(), PH2['pgr_G1'].min(), PH2['pgr_G1'].max()])
# plots
plt.scatter(PH2['pgr_G1'],PH2['pgr_G1'])
plt.scatter(PH2['pgr_G1'],PH2['pgr_G2'])
plt.plot(PH2['pgr_G1'].mean(),PH2['pgr_G1'].mean(), c = 'r', marker = 'X', markersize = 30)
# regression plots
sns.regplot(PH2['pgr_G1'],PH2['pgr_G1'])
sns.regplot(PH2['pgr_G1'],PH2['pgr_G2'])

plt.legend(['Mean','Page Rank from data', 'Page Rank from base gravity']);

png

PH2['pgr_G1'].mean()
0.002197291434997281
ax = plt.figure(figsize=(10, 10))
plt.axis([PH2['pgr_G1'].min(), PH2['pgr_G1'].max(), PH2['pgr_G1'].min(), PH2['pgr_G1'].max()])
# plots
plt.scatter(PH2['pgr_G1'],PH2['pgr_G1'])
plt.scatter(PH2['pgr_G1'],PH2['pgr_G3'])
plt.plot(PH2['pgr_G1'].mean(),PH2['pgr_G1'].mean(), c = 'r', marker = 'X', markersize = 30)
# regression plots
sns.regplot(PH2['pgr_G1'],PH2['pgr_G1'])
sns.regplot(PH2['pgr_G1'],PH2['pgr_G3'])

plt.legend(['Mean','Page Rank from data', 'Page Rank from doubly gravity model' ]);

png

Results

What we can see on the figure above is how the node centrality in the graph conctructed from prediction relates to a node centrality in the graph constructed from raw data. In ideal situation, the line of best fit and the observations would follow the blue line, however, this is not the case. From the first figure, which looks at the importance of the nodes based on base gravity model (very simple one), you can observe that where the actual importannce of the node is high, the importance based on predicted flows drop down drastically, and where the actual importance of the node is low, the importance based on predicted flows is higher. You can also notice line of zeros trying to hide behind the y axes. So although the model predicts flow with certain accuracy, th quality of the flow is not the same as from the data. More specifically the model is not really able to see the hierarchical structure of the nodes and creates predictions that are close to mean for all nodes.

Not suprisingly this is not the case for more complex gravity model such as doubly constrained gravity model. Here the hierarchy of the nodes is sustained by the prediction, however, that is expected as each node has it's own parameter in the model. Seems like a simple and smart solution, however, you need to remember the with doubly constrained model, where each node is considered as unique, predictions can be made only to existing nodes. So question like ‘What will happen with the flows if one of the GP's closes’ are hardly answerable. It's like a cheeky shortcut; if I am not able to meodel complicated flow structure, let's model each node separately.

Drop me a message on twitter if you have any comments :)

Related