# Application of XGBoost regression for Spatial Interaction of Urban flow

Gravity based model with Gradient Boosting Trees

## Introduction

This notebook provides a simple example of the application of XGBoost Spatial Interaction model in python. For the purposes of this tutorial I will expect you to have some knowledge of Spatial Interaction Models and their purpose as well as some python skills. Yet, if you want to refresh a little or wish to learn, here is a list of relevant sources.

• Watch Maarten Vanhoof talking about what are the spatial Interaction Models and how they can be used here (6 min)
• Have a look at a list of some relevant literature in different fields here
• Look at the example of Spatial Interaction Model application in R from Adam Dennett Part1 and Part2
• Look at how you can apply the the same method in python with SpInt package

## So what is the XGBoost? and how does it work?

XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient Boosting” originates from the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. Please have a look at this A Gentle Introduction to XGBoost for Applied Machine Learning and the Full documentation to XGBoost that also provides a tutorials. If you are like me and need something more visual and efficient explanation of the how the model works, I would highly recommend YouTube videos from Josh Starmer; Gradient Boost Part 1: Regression Main Ideas, Gradient Boost Part 2: Regression Details.

So after you gone through all the resources you know that there are two types of the model depending on what do you want to do with them. You can use them as a classifier, similar to decision trees, or as a regressor. Using the method as a classifier, we can predict discrete and categorical variables, and using it as a regressors, continuous variables can be predicted. Here we will be talking solely about Gradient Boosting as a regressor, because we are trying to model Urban flow, number of people that moves from one point to another.

### What is an Urban flow here?

In this example, I am looking at patients Urban flow. to be specific, this is a patients that are being treated at particular GP, receive a prescription and then go to one of the pharmacies to pick up the prescribed drug. This process of prescribing and dispensing a drug is closely monitored by the National Health Service, although the precise information about the patients is disclosed, the NHS Digital provides completely anonymised data of number of the prescriptions that has been prescribed and dispensed between each pair of GP and Pharmacy each month. Those can be translated as traces of patients movement between two points and together creates rich network of patients flow.

For further description of the data follow up on this notebook.

### Cross validation challenge

With any ordinary data, we would usually divide the data into testing set and training set by 80/20 split or any other proportion. However, we face more difficult case here.

The Urban flow networks are very particular case of network, which is basically a representation of the human behaviour in a space. And this is related to anything that affect our decision to choose a destination and the journey to it. Starting from maps, road or public transport structure and the type of vehicle, ending with our personal preference, gender, or family status. The network that the humans created by acting in it has very complex pattern in which the existence of a destination and its location matters.

Now, imagine we would divide the dataset into training and testing, how do we decide which locations will be for training and which ones will be for testing if each location is unique? For example, if we train the model on the flow data in a Scotland, we cannot test the same model on the data from London as people in both places has different behavioural patterns. There is more to it, read through this study that provides nice overview.

Luckily, the data I collected ranges from 2013 to mid-2019, which means that we can divide the data into blocks with the same timeframe. In this case I will use the data from 2017 for training and data from 2018 for testing. This block cross validation is only possible thanks to the consistent data.

## Let's get started

import xgboost as xgb # xgboost model library
import numpy as np # array manipulation
import pandas as pd # dataframe manipulation library
import geopandas as gpd # spatial data frame manipulatioon library
import statsmodels.api as sm
from sklearn.model_selection import train_test_split  # take out the validation packages from scikit.learn
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt # plotting
import seaborn as sns


### Load the 2017 train data

# define the path
file = r'./../.././data/NHS/working/year_data_flow/predicted_flow_2017.csv'

# define the path
file = r'./../.././data/NHS/working/year_data_flow/complete_predicted_flow_2017.csv'

fl17.head()


P_Code P_Name_x P_Postcode D_Code_x D_Name_x D_Postcode N_Items N_EPSItems Date loop ... P_full_address_y P_Status_Code_y P_Sub_type_y P_Setting_y P_X_y P_Y_y CODE POSTCODE NUMBER_OF_PATIENTS yhat
0 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FA251 HORIZON PHARMACY LTD BS34 6AS 130.0 0.0 2017-01-01 False ... PORTISHEAD MEDICAL GROUP,PORTISHEAD MEDICAL GR... A B 4 -2.767318 51.482805 L81004 BS20 6AQ 18363 0.044989
1 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FA613 DOWNHAM FB LTD BS7 9JT 39.0 0.0 2017-01-01 False ... PORTISHEAD MEDICAL GROUP,PORTISHEAD MEDICAL GR... A B 4 -2.767318 51.482805 L81004 BS20 6AQ 18363 0.240777
2 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FCL91 MYGLOBE LTD BS15 4ND 12.0 0.0 2017-01-01 False ... PORTISHEAD MEDICAL GROUP,PORTISHEAD MEDICAL GR... A B 4 -2.767318 51.482805 L81004 BS20 6AQ 18363 0.001761
3 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FCT35 LLOYDS PHARMACY LTD BS20 7QA 661.0 0.0 2017-01-01 False ... PORTISHEAD MEDICAL GROUP,PORTISHEAD MEDICAL GR... A B 4 -2.767318 51.482805 L81004 BS20 6AQ 18363 2393.873899
4 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FDD77 JOHN WARE LIMITED BS20 6LT 4058.0 0.0 2017-01-01 False ... PORTISHEAD MEDICAL GROUP,PORTISHEAD MEDICAL GR... A B 4 -2.767318 51.482805 L81004 BS20 6AQ 18363 2149.320471

5 rows × 77 columns

### Load the 2018 test data

# first get the 018 data in, those will be the test data
file = r'./../.././data/NHS/working/year_data_flow/predicted_flow_2018.csv'

# define the path
file = r'./../.././data/NHS/working/year_data_flow/complete_predicted_flow_2018.csv'

cfl18.head()


P_Code D_Code_new Date Density NUMBER_OF_PATIENTS N_Items Euc_distance great_circle_dist predicted
0 L81004 FA696 2018-01-01 36.696000 18615 8.0 0.224446 16.560437 0.000777
1 L81004 FCT35 2018-01-01 34.918367 18615 21.0 0.009198 0.653229 1963.680982
2 L81004 FDD77 2018-01-01 29.092308 18615 5309.0 0.014101 0.983999 1673.379704
3 L81004 FD125 2018-01-01 28.670000 18615 26.0 0.177172 12.816035 0.030142
4 L81004 FEA59 2018-01-01 30.833333 18615 1.0 0.292934 20.302017 0.000028

I have already selected the columns we need for the analysis but let me provide you with some explanation.

The gravity equation says that the flow, can be explained by gravitational power of two masses on origin and destination, and inversely proportional to the distance between them.

$$F_{od} = k\frac{M_o M_d}{D_{od}}$$

Where $F_{od}$ is the flow mass, $M_o$ is the mass on the origin, $M_d$ is mass on the destination and $D_{od}$ is a distance between them.

In our case the flow is the number of patient's between the origin (General Practitioner) and the destination (Pharmacy). The mass of the origin is represented by the number of patients registered at that particular GP in that year (mid-year estimate). The mass of the destination is the average daytime population in 1km radius from the pharmacy. And the distance is a great circle distance between the origin and destination. All of those has been calculated previously and I'm going to bore you with it another time.

In our datasets this is;

Variable
$F_{od}$ ‘N_Items’
$M_o$ ‘NUMBER_OF_PATIENTS’
$M_d$ ‘Density’
$D_{od}$ ‘great_circle_dist’

There are different spatial interaction models but for this purpose we only want to stick with the very basic form, one based on gravity model. We just need to translate the formula into an regression, Poisson regression to be more specific, because the flow is a count of the patients and skewed dis

## Spatial Interaction model definition

$$F_{od} = \exp( k + \mu \ln{M_o} + \alpha \ln{M_d} − \beta \ln{ D_{od}} )$$

Where $\lambda_{od}$ is the flow from $o$ to $d$, logarithmically linked or modelled by a linear combination of the logged independent vars. $\kappa ; \mu ; \alpha$ and $\beta$ are parameters to be estimated, $M_o$ is the mass on origin, $M_d$ is the mass on destination and $D_{od}$ is the distance between origin and destination.

this is in log-normal equation;

$$\ln{F_{od}} = k + \mu \ln{M_d} + \alpha \ln{M_d} − \beta \ln{ D_{od}}$$

and Poisson equation, which should have the same result;

$$F_{od} = k + \mu \ln{M_o} + \alpha \ln{M_d} − \beta \ln{ D_{od}}$$

Let's try both

1. log-normal model
2. Poisson model, correct definition

The XGBOOST offers

1. objective = “reg:squarederror”
2. objective = “count:poisson”

Note the regressor takes a all the data from data frame as they are, so we need to apply the transformations before we build to model as I understand.

# log all the variables
fl17['log_N_Items'] = np.log(fl17['N_Items'])
fl18['log_N_Items'] = np.log(fl18['N_Items'])

# log all the variables
fl17['1*great_circle_dist'] = 1*fl17['great_circle_dist']
fl18['1*great_circle_dist'] = 1*fl18['great_circle_dist']

# Do we need to do the 1-Distance???

fl17['log_NUMBER_OF_PATIENTS'] = np.log(fl17['NUMBER_OF_PATIENTS'])
fl18['log_NUMBER_OF_PATIENTS'] = np.log(fl18['NUMBER_OF_PATIENTS'])

fl17['log_Density'] = np.log(fl17['Density'])
fl18['log_Density'] = np.log(fl18['Density'])

# do the same thing for the complete data

# log all the variables
cfl17['1*great_circle_dist'] = 1*cfl17['great_circle_dist']
cfl18['1*great_circle_dist'] = 1*cfl18['great_circle_dist']

# Do we need to do the 1-Distance???

cfl17['log_NUMBER_OF_PATIENTS'] = np.log(cfl17['NUMBER_OF_PATIENTS'])
cfl18['log_NUMBER_OF_PATIENTS'] = np.log(cfl18['NUMBER_OF_PATIENTS'])

cfl17['log_Density'] = np.log(cfl17['Density'])
cfl18['log_Density'] = np.log(cfl18['Density'])

fl17log = fl17.iloc[:,[77,78,79,80]]
fl18log = fl18.iloc[:,[77,78,79,80]]

fl18log


log_N_Items 1*great_circle_dist log_NUMBER_OF_PATIENTS log_Density
0 2.079442 16.560437 9.831723 3.602668
1 3.044522 0.653229 9.831723 3.553013
2 8.577159 0.983999 9.831723 3.370474
3 3.258097 12.816035 9.831723 3.355851
4 0.000000 20.302017 9.831723 3.428596
... ... ... ... ...
71729 0.000000 0.973280 6.907755 5.144711
71730 0.000000 1.015876 6.907755 4.824431
71731 0.693147 1.507605 6.907755 4.806714
71732 1.098612 1.739077 6.907755 4.549942
71733 0.693147 1.578409 6.907755 4.358013

71734 rows × 4 columns

fl17log.hist(figsize = (10,10));


## Log-normal XGBoost SIM model

### Parameters

• booster = ‘dart’

XGBoost mostly combines a huge number of regression trees with a small learning rate. In this situation, trees added early are significant and trees added late are unimportant. Vinayak and Gilad-Bachrach proposed a new method to add dropout techniques from the deep neural net community to boosted trees and reported better results in some situations. Drop trees to solve the over-fitting. Trivial trees (to correct trivial errors) may be prevented. Because of the randomness introduced in the training, expect the following few differences: Training can be slower than gbtree because the random dropout prevents usage of the prediction buffer. The early stop might not be stable, due to the randomness.

Follow here to see all the parameters.

I also found this tutorial which has used the grid search from ‘scikit’ cross validation to test several versions of the model and it's parameters. This comes very handy, as we can instantly find the best performing model in instance. Maybe this can help to explain a bit more about that.

Learning rate = Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.

max_depth = Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 is only accepted in lossguided growing policy when tree_method is set as hist and it indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree. Range: [0,∞] (0 is only accepted in lossguided growing policy when tree_method is set as hist)

.

.

.

#### Note

Below, you can see an example for grid search for the best model within certain parameters. It was very efficient when I was working with the log-normal model only, however, using it within the Poisson version, the model converges after third of all the runs (depends how many options you give it)

# define the training and testing dataset
X_train, y_train = fl17log.loc[:,fl17log.columns != 'log_N_Items'], fl17log.loc[:,['log_N_Items']]
X_test, y_test = fl18log.loc[:,fl18log.columns != 'log_N_Items'], fl18log.loc[:,['log_N_Items']]

# transfer those into matrix as recommended class by in the package
DM_train = xgb.DMatrix(data = X_train,
label = y_train)

DM_test =  xgb.DMatrix(data = X_test,
label = y_test)

# define the parameters that we want to inspect
gbm_param_grid = {
'colsample_bytree': np.linspace(0.5, 0.9), # define fairly wide range
'n_estimators':[15, 20], # try different numbers
'max_depth': [ 15, 20] # different depths
}

# define the regressor model
gbm = xgb.XGBRegressor(objective = "reg:squarederror", booster = 'dart')

# define the grid search cross validation
grid_mse = GridSearchCV(estimator = gbm, param_grid = gbm_param_grid, scoring = 'neg_mean_squared_error', cv = 5, verbose = 1)

# fit the models and go through all the parameters
grid_mse.fit(X_train, y_train)

# print the best results
print("Best parameters found: ",grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

# define the training and testing dataset
X_train, y_train = fl17log.loc[:,fl17log.columns != 'log_N_Items'], fl17log.loc[:,['log_N_Items']]
X_test, y_test = fl18log.loc[:,fl18log.columns != 'log_N_Items'], fl18log.loc[:,['log_N_Items']]

# define the model (I already tested several version and stacked with the default setting and learning rate 0.4, the grid search is work in progress)
gbm = xgb.XGBRegressor(objective = "reg:squarederror", learning_rate= 0.4).fit(X_train, y_train)


pred = gbm.predict(X_test)

print("Root mean square error for test dataset: {}".format(np.round(np.sqrt(mean_squared_error(np.exp(y_test), np.exp(pred))), 2)))

Root mean square error for test dataset: 949.98

test = pd.DataFrame({"prediction": np.exp(pred), "observed": np.exp(y_test['log_N_Items'])})

lowess = sm.nonparametric.lowess

z = lowess(np.exp(pred).flatten(), np.exp(y_test['log_N_Items']))

test.plot(figsize = [14,8],
x ="prediction", y = "observed", kind = "scatter", color = 'darkred')

plt.title("Extreme Gradient Boosting: Prediction Vs Test Data", fontsize = 18, color = "darkgreen")

plt.xlabel("Predicted Power Output", fontsize = 18)

plt.ylabel("Observed Power Output", fontsize = 18)

plt.plot(z[:,0], z[:,1], color = "blue", lw= 3)

plt.show()



fl17['log'] =  np.exp(gbm.predict(X_train))

fl18['log'] =  np.exp(gbm.predict(X_test))


## Poisson XGBoost SIM model

# create new dataset for the Poisson
# this will be logged masses, 1*dist and no change to y
fl17poi = fl17.iloc[:,[6,78,79,80]]
fl18poi = fl18.iloc[:,[6,78,79,80]]
cfl17poi = cfl17.iloc[:,[9,10,11,5]]
cfl18poi = cfl18.iloc[:,[9,10,11,5]]

fl17poi.hist(figsize = (10,10));
cfl17poi.hist(figsize = (10,10));


### For published data only

# define the training and testing dataset
X_train2, y_train2 = fl17poi.loc[:,fl17poi.columns != 'N_Items'], fl17poi.loc[:,['N_Items']]
X_test2, y_test2 = fl18poi.loc[:,fl18poi.columns != 'N_Items'], fl18poi.loc[:,['N_Items']]

# define the model
gbm2 = xgb.XGBRegressor(objective = "count:poisson",  learning_rate= 0.4).fit(X_train2, y_train2)



### For complete data with all the 0 flows in

# define the training and testing dataset
X_train3, y_train3 = cfl17poi.loc[:,cfl17poi.columns != 'N_Items'], cfl17poi.loc[:,['N_Items']]
X_test3, y_test3 = cfl18poi.loc[:,cfl18poi.columns != 'N_Items'], cfl18poi.loc[:,['N_Items']]

gbm3 = xgb.XGBRegressor(objective = "count:poisson",  learning_rate= 0.4).fit(X_train3, y_train3)



Best parameters found: {‘colsample_bytree’: 0.8, ‘max_depth’: 40, ‘n_estimators’: 30} Lowest RMSE found: 1159.4761503247805

pred2 = gbm2.predict(X_test2)

print("Root mean square error for test dataset: {}".format(np.round(np.sqrt(mean_squared_error(y_test2, pred2)), 2)))

pred3 = gbm3.predict(X_test3)

print("Root mean square error for test dataset: {}".format(np.round(np.sqrt(mean_squared_error(y_test3, pred3)), 2)))

Root mean square error for test dataset: 704.17
Root mean square error for test dataset: 279.73

sns.regplot(np.exp(pred), np.exp(y_test['log_N_Items']), color = 'blue',line_kws = {'color':'red'})
plt.title('Predicted vs actual, Log-normal XGBOOST, published data only', fontsize=20);


sns.regplot(pred2, y_test2['N_Items'], color = 'blue',line_kws = {'color':'red'})
plt.title('Predicted vs actual, Poisson XGBOOST, published data only', fontsize=20);


sns.regplot(pred3, y_test3['N_Items'], color = 'blue',line_kws = {'color':'red'})
plt.title('Predicted vs actual, Poisson XGBOOST, complete data', fontsize=20);


#### The Poisson seems to be better when using the default parameter setting and learning rate 0.4

We already have the in sample and out of sample predictions from gravity model in the data, so now we need to add the in and out of sample prediction from the XGB model

# published
fl17['xgb'] =  gbm2.predict(X_train2)

fl18['xgb'] =  gbm2.predict(X_test2)

# complete
cfl17['xgb'] =  gbm2.predict(X_train3)

cfl18['xgb'] =  gbm2.predict(X_test3)



P_Code P_Name_x P_Postcode D_Code_x D_Name_x D_Postcode N_Items N_EPSItems Date loop ... CODE POSTCODE NUMBER_OF_PATIENTS predicted log_N_Items 1*great_circle_dist log_NUMBER_OF_PATIENTS log_Density log xgb
0 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FA696 THE BULLEN HEALTHCARE GROUP LIMITED BS32 4JT 8.0 0.0 2018-01-01 False ... L81004 BS20 6AQ 18615 0.021203 2.079442 16.560437 9.831723 3.602668 3.172393 2.952730
1 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FCT35 LLOYDS PHARMACY LTD BS20 7QA 21.0 0.0 2018-01-01 False ... L81004 BS20 6AQ 18615 2417.864485 3.044522 0.653229 9.831723 3.553013 163.713943 861.965210
2 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FDD77 JOHN WARE LIMITED BS20 6LT 5309.0 0.0 2018-01-01 False ... L81004 BS20 6AQ 18615 2170.860226 8.577159 0.983999 9.831723 3.370474 833.031494 3730.304199
3 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FD125 BOOTS UK LIMITED BS34 5UP 26.0 0.0 2018-01-01 False ... L81004 BS20 6AQ 18615 0.390401 3.258097 12.816035 9.831723 3.355851 6.255939 5.463578
4 L81004 PORTISHEAD MEDICAL GROUP BS20 6AQ FEA59 LLOYDS PHARMACY LTD BS16 7AE 1.0 0.0 2018-01-01 False ... L81004 BS20 6AQ 18615 0.001570 0.000000 20.302017 9.831723 3.428596 3.263482 1.575869

5 rows × 83 columns

## The accuracy measure

Lets organize this into a table;

R2 Gravity SIM R2 XGBOOST SIM RMSE Gravity SIM RMSE XGBOOST SIM
Models based on published data PGIr2 PXIr2 PGIrs PXIrs
PGOr2 PXOr2 PGOrs PXOrs
Models based on complete data CGIr2 CXIr2 CGIrs CXIrs
CGOr2 CXOr2 CGOrs CXOrs
# define R2 function
def get_r2_numpy_corrcoef(x, y):
return np.corrcoef(x, y)[0, 1]**2

# define RMSE function
def rmse(predictions, targets):
return np.sqrt(((predictions - targets) ** 2).mean())


## Other evaluations?

Some is available at Robinsons github, but requires matrices.

See to definition of the Common Part of Commuters measure. Source article

def cpc(actual, predicted):

# reshape the variables
predicted = np.reshape(np.array(predicted), (-1, 1))
actual = np.reshape(np.array(actual),(-1, 1))
# stack them together
YYhat = np.hstack([actual, predicted])
# define sum of the minimym of the predicted var
NCC = np.sum(np.min(YYhat, axis=1)) # , axis=1
# number of observations
N = len(actual)
# sum of all the actual numbers (reshaped)
NCY = np.sum(actual)
# sum of the predicted numbers (reshaped)
NCYhat = np.sum(predicted)

numerator = (N * NCC)
denominator = (NCY + NCYhat)

return numerator/denominator


$$\frac{ N_i * \sum \min(ij) }{ \sum_i + \sum_j }$$

def sorensen(actual, predicted):
# reshape the variables
predicted = np.reshape(np.array(predicted), (-1, 1))
actual = np.reshape(np.array(actual),(-1, 1))
# define length of the variables
N =  len(actual)
# stack them together in array
YYhat = np.hstack([actual, predicted])
# define 2 times minimum from both
num = 2.0 * np.min(YYhat, axis=1)
# define the differnece
den =  predicted + actual

return (1.0 / N) * (np.sum(num.reshape((-1, 1)) / den.reshape((-1, 1))))


$$\frac{1}{N} * \frac{\sum{2 * \min(i,j)} }{j-i}$$

# Let's do the result tables again with those measures

R2 Gravity SIM R2 XGBOOST SIM RMSE Gravity SIM RMSE XGBOOST SIM CPC Gravity SIM CPC XGBOOST SIM SORENSEN I Gravity SORENSEN I XGBOOST
Models based on published data PGIr2 PXIr2 PGIrs PXIrs PGIcpc PXIcpc PGIsn PXIsn
PGOr2 PXOr2 PGOrs PXOrs PGOcpc PXOcpc PGOsn PXOsn
Models based on complete data CGIr2 CXIr2 CGIrs CXIrs CGIcpc CXIcpc CGIsn CXIsn
CGOr2 CXOr2 CGOrs CXOrs CGOcpc CXOcpc CGOsn CXOsn
# R2
# published data
PGIr2 = round(get_r2_numpy_corrcoef(fl17['N_Items'], fl17['yhat']),3)
PGOr2 = round(get_r2_numpy_corrcoef(fl18['N_Items'], fl18['predicted']),3)

PXIr2 = round(get_r2_numpy_corrcoef(fl17['N_Items'], fl17['xgb']),3)
PXOr2 = round(get_r2_numpy_corrcoef(fl18['N_Items'], fl18['xgb']),3)

# complete data
CGIr2 = round(get_r2_numpy_corrcoef(cfl17['N_Items'], cfl17['yhat']),3)
CGOr2 = round(get_r2_numpy_corrcoef(cfl18['N_Items'], cfl18['predicted']),3)

CXIr2 = round(get_r2_numpy_corrcoef(cfl17['N_Items'], cfl17['xgb']),3)
CXOr2 = round(get_r2_numpy_corrcoef(cfl18['N_Items'], cfl18['xgb']),3)


# RMSE

PGIrs = round(rmse(fl17['N_Items'], fl17['yhat']),1)
PGOrs = round(rmse(fl18['N_Items'], fl18['predicted']),1)

PXIrs = round(rmse(fl17['N_Items'], fl17['xgb']),1)
PXOrs = round(rmse(fl18['N_Items'], fl18['xgb']),1)

CGIrs = round(rmse(cfl17['N_Items'], cfl17['yhat']),1)
CGOrs = round(rmse(cfl18['N_Items'], cfl18['predicted']),1)

CXIrs = round(rmse(cfl17['N_Items'], cfl17['xgb']),1)
CXOrs = round(rmse(cfl18['N_Items'], cfl18['xgb']),1)

# CPC

PGIcpc = round(cpc(fl17['N_Items'], fl17['yhat']),1)
PGOcpc = round(cpc(fl18['N_Items'], fl18['predicted']),1)

PXIcpc = round(cpc(fl17['N_Items'], fl17['xgb']),1)
PXOcpc = round(cpc(fl18['N_Items'], fl18['xgb']),1)

CGIcpc = round(cpc(cfl17['N_Items'], cfl17['yhat']),1)
CGOcpc = round(cpc(cfl18['N_Items'], cfl18['predicted']),1)

CXIcpc = round(cpc(cfl17['N_Items'], cfl17['xgb']),1)
CXOcpc = round(cpc(cfl18['N_Items'], cfl18['xgb']),1)

# Sorensen

PGIsn = round(sorensen(fl17['N_Items'], fl17['yhat']),3)
PGOsn = round(sorensen(fl18['N_Items'], fl18['predicted']),3)

PXIsn = round(sorensen(fl17['N_Items'], fl17['xgb']),3)
PXOsn = round(sorensen(fl18['N_Items'], fl18['xgb']),3)

CGIsn = round(sorensen(cfl17['N_Items'], cfl17['yhat']),3)
CGOsn = round(sorensen(cfl18['N_Items'], cfl18['predicted']),3)

CXIsn = round(sorensen(cfl17['N_Items'], cfl17['xgb']),3)
CXOsn = round(sorensen(cfl18['N_Items'], cfl18['xgb']),3)

# everything for the log-normal model
LXIr2 = round(get_r2_numpy_corrcoef(fl17['N_Items'], fl17['log']),3)
LXOr2 = round(get_r2_numpy_corrcoef(fl18['N_Items'], fl18['log']),3)

LXIrs = round(rmse(fl17['N_Items'], fl17['log']),1)
LXOrs = round(rmse(fl18['N_Items'], fl18['log']),1)

LXIcpc = round(cpc(fl17['N_Items'], fl17['log']),1)
LXOcpc = round(cpc(fl18['N_Items'], fl18['log']),1)

LXIsn = round(sorensen(fl17['N_Items'], fl17['log']),3)
LXOsn = round(sorensen(fl18['N_Items'], fl18['log']),3)

model_name = ['In-sample','Out-sample']
mG = [PGIr2, PGOr2]
mX = [PXIr2, PXOr2]
mF = [PGIrs, PGOrs]
mH = [PXIrs, PXOrs]
mD1 = [PGIcpc, PGOcpc]
mD2 = [PXIcpc, PXOcpc]
mK1 = [PGIsn, PGOsn]
mK2 = [PXIsn, PXOsn]

cols = {'Poisson models based on published data': model_name,'R2 Gravity SIM': mG, 'R2 XGBOOST SIM': mX, 'RMSE Gravity SIM': mF, 'RMSE XGBOOST SIM': mH, 'CPC Gravity SIM': mD1,
'CPC XGBOOST SIM': mD2,  'SORENSEN I Gravity': mK1, 'SORENSEN I XGBOOST': mK2}

table_published = pd.DataFrame(cols).set_index('Poisson models based on published data')

model_name = ['In-sample','Out-sample']
G = [CGIr2, CGOr2]
X = [CXIr2, CXOr2]
mF = [CGIrs, CGOrs]
mH = [CXIrs, CXOrs]
mD1 = [CGIcpc, CGOcpc]
mD2 = [CXIcpc, CXOcpc]
mK1 = [CGIsn, CGOsn]
mK2 = [CXIsn, CXOsn]

cols = {'Poisson models based on complete data': model_name,'R2 Gravity SIM': G, 'R2 XGBOOST SIM': X, 'RMSE Gravity SIM': mF, 'RMSE XGBOOST SIM': mH,  'CPC Gravity SIM': mD1,
'CPC XGBOOST SIM': mD2,  'SORENSEN I Gravity': mK1, 'SORENSEN I XGBOOST': mK2}

table_complete = pd.DataFrame(cols).set_index('Poisson models based on complete data')

NA = 'NA'
model_name = ['In-sample','Out-sample']
mG = [NA, NA]
mX = [LXIr2, LXOr2]
mF = [NA, NA]
mH = [LXIrs, LXOrs]
mD1 = [NA, NA]
mD2 = [LXIcpc, LXOcpc]
mK1 = [NA, NA]
mK2 = [LXIsn, LXOsn]

cols = {'Log-normal models based on published data': model_name,'R2 Gravity SIM': mG, 'R2 XGBOOST SIM': mX, 'RMSE Gravity SIM': mF, 'RMSE XGBOOST SIM': mH, 'CPC Gravity SIM': mD1,
'CPC XGBOOST SIM': mD2,  'SORENSEN I Gravity': mK1, 'SORENSEN I XGBOOST': mK2}

table_log = pd.DataFrame(cols).set_index('Log-normal models based on published data')

table_log


R2 Gravity SIM R2 XGBOOST SIM RMSE Gravity SIM RMSE XGBOOST SIM CPC Gravity SIM CPC XGBOOST SIM SORENSEN I Gravity SORENSEN I XGBOOST
Log-normal models based on published data
In-sample NA 0.560 NA 809.9 NA 25504.1 NA 0.686
Out-sample NA 0.381 NA 950.0 NA 21690.5 NA 0.620
# models based on published data only
table_published


R2 Gravity SIM R2 XGBOOST SIM RMSE Gravity SIM RMSE XGBOOST SIM CPC Gravity SIM CPC XGBOOST SIM SORENSEN I Gravity SORENSEN I XGBOOST
Poisson models based on published data
In-sample 0.321 0.736 976.2 611.1 17813.1 27736.8 0.318 0.513
Out-sample 0.330 0.653 981.4 704.2 17207.5 24790.0 0.314 0.490
# models based on complete data with all the zero flows included
table_complete


R2 Gravity SIM R2 XGBOOST SIM RMSE Gravity SIM RMSE XGBOOST SIM CPC Gravity SIM CPC XGBOOST SIM SORENSEN I Gravity SORENSEN I XGBOOST
Poisson models based on complete data
In-sample 0.287 0.691 380.6 249.9 106667.5 168990.5 0.045 0.070
Out-sample 0.297 0.605 385.3 288.8 100976.7 145788.9 0.046 0.068
x = table_log.append(table_published)
x = x.append(table_complete)
x['model_type'] = ['log-normal', 'log-normal', 'Poisson', 'Poisson', 'Poisson', 'Poisson']
x['data_type'] = ['published_only', 'published_only', 'published_only', 'published_only', 'complete', 'complete']
x['CV'] = ['In-sample', 'Out-sample','In-sample', 'Out-sample','In-sample', 'Out-sample']
x = x.set_index(['model_type', 'data_type'])
x


R2 Gravity SIM R2 XGBOOST SIM RMSE Gravity SIM RMSE XGBOOST SIM CPC Gravity SIM CPC XGBOOST SIM SORENSEN I Gravity SORENSEN I XGBOOST CV
model_type data_type
log-normal published_only NA 0.560 NA 809.9 NA 25504.1 NA 0.686 In-sample
published_only NA 0.381 NA 950.0 NA 21690.5 NA 0.620 Out-sample
Poisson published_only 0.321 0.736 976.2 611.1 17813.1 27736.8 0.318 0.513 In-sample
published_only 0.33 0.653 981.4 704.2 17207.5 24790.0 0.314 0.490 Out-sample
complete 0.287 0.691 380.6 249.9 106668 168990.5 0.045 0.070 In-sample
complete 0.297 0.605 385.3 288.8 100977 145788.9 0.046 0.068 Out-sample
x.iloc[[0,2,4],:]['R2 XGBOOST SIM'].plot()
x.iloc[[1,3,5],:]['R2 XGBOOST SIM'].plot()
plt.title('R2 XGBOOST', fontsize=20)
plt.legend(['In-sample', 'Out-sample'])
plt.xlabel(' ');


x.iloc[[0,2,4],:]['RMSE XGBOOST SIM'].plot()
x.iloc[[1,3,5],:]['RMSE XGBOOST SIM'].plot()
plt.title('RMSE  XGBOOST', fontsize=20)
plt.legend(['In-sample', 'Out-sample'])
plt.xlabel(' ');


x.iloc[[0,2,4],:]['CPC XGBOOST SIM'].plot()
x.iloc[[1,3,5],:]['CPC XGBOOST SIM'].plot()
plt.title('CPC XGBOOST', fontsize=20)
plt.legend(['In-sample', 'Out-sample'])
plt.xlabel(' ');


x.iloc[[0,2,4],:]['SORENSEN I XGBOOST'].plot()
x.iloc[[1,3,5],:]['SORENSEN I XGBOOST'].plot()
plt.title('SORENSEN XGBOOST', fontsize=20)
plt.legend(['In-sample', 'Out-sample'])
plt.xlabel(' ');


Sorensen similarity Index is bounded between values of 0 and 1 with values closer to 1 indicating a better model fit.

How come that the complete dataset has exceptionally low index, because of the zero issue?

cfl17.loc[:,['N_Items', 'xgb']]


N_Items xgb
0 130.0 11.200489
1 39.0 7.248779
2 12.0 3.174812
3 661.0 861.965210
4 4058.0 3730.304199
... ... ...
543739 0.0 8.880725
543740 0.0 5.127942
543741 0.0 5.786361
543742 0.0 7.492306
543743 0.0 18.535973

543744 rows × 2 columns

Prob it is caused by the excessive zeros, we would probably need to introduce an ensemble model to deal with it.

Some other is available from Tylor Oshan He has CPC as well as Sorensen index