Retrieving custom networks from OSM using Pyrosm and translating to Pandana and NetworkX


Pyrosm is a very clever python library, written by Henrikki Tenkane, that allows users to quickly and effectively retrieve data from OpenStreetMap, using Cython and Protocolbuffer Binary Format -files (PBFs). It is ultimately the most useful tool to retrieve OSM data within python and is fairly well documented.

Yet, there were few places which can be confusing and where extended description and tutorial is needed. This is specifically;

  • retrieving custom network (cycling example)
  • converting custom network to Pandana/Networkx graph

In this tutorial, I'll also repeat the distances calculation with pandana from this post and update on any changes.

I would have like to contribute to the Pyrosm library with the first point, however, Cython is not amongst my skills right now. So if someone knows how to do that, feel free.

Retrieving custom driving/walking/cycling network

The Pyrosm uses custom_filter to define what features should be extracted from the PBF, however, this filter is not available for the networks. Instead the Pyrosm offers get_data_by_custom_criteria() and keep vs exclude feature which allows you to define which keys to keep or exclude from the OSM. This function also allows you to choose if you want to keep the nodes in the data, but those are not the network nodes, those are point features that has the tags defined in your custom filter. This means, that you can get custom network, but only the edge list. However, in order to construct graph in Pandana we need both edge list and node list. Now you could just get the edge list, create a graph in Networkx and then translate to Pandana by exporting edgelist and node list out, but let's be realistic. If there is a way to avoid multiple conversions and installing another package, I would be much happier.

I found an easy way to retrieve both nodes and edges for custom created network. The process is as follows;

  • Retrieve complete network for the selected area
  • Filter out the edges features based on selected keys from OSM
  • List all the existing nodes in the edge dataframe
  • Filter the nodes by the list from the previous step
# import the necessary packages
import geopandas as gpd
import pandas as pd

# import pyrosm functions
from pyrosm import OSM
from pyrosm import get_data
# network libraries
import pandana
import networkx as nx
def get_network_custom(place,filter_type):
    place: string 
        name of the pbf to be downloaded 
    filter_type: string
        'cycling' only for now
    # connect to the data
    osm = OSM(get_data(place))
    ## if you have already downloaded pbf file and it’s in your file directory use following
    ## osm = OSM(place)
    ## where place is 'filepath.pbf'
    # retrieve the complete network 
    nodes,edges = osm.get_network(nodes=True, network_type="all")
    # define the filter
    if filter_type == 'cycling':
        highway = ['primary',
    else: raise ValueError('try cycling')
    # you could define other profiles such as driving or walking
    # choose only those edges that are inside the filter
    edges = edges[edges['highway'].isin(highway)]
    # list origin nodes
    list_origins = list(edges.u.unique())
    # list destination nodes
    list_destinations = list(edges.v.unique())
    # combine the to lists to see all nodes that exist in filtered network 
    list_nodes = list_origins + list_destinations
    # cut the nodes by the existing list
    nodes_gdf = nodes[] 
    return nodes_gdf,edges

nodes, edges = get_network_custom('Bristol','cycling')
ax = edges.plot()
nodes.plot(ax = ax, color = 'orange', markersize = 2);


Converting custom network to pandana/networkx graph

Converting the two geodataframes to graphs is then dead easy.
<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 515871 entries, 0 to 522320
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype   
---  ------     --------------   -----   
 0   lon        515871 non-null  float64 
 1   lat        515871 non-null  float64 
 2   tags       20935 non-null   object  
 3   timestamp  515871 non-null  int8    
 4   version    515871 non-null  int8    
 5   changeset  515871 non-null  int8    
 6   id         515871 non-null  int64   
 7   geometry   515871 non-null  geometry
dtypes: float64(2), geometry(1), int64(1), int8(3), object(1)
memory usage: 25.1+ MB
<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 551813 entries, 0 to 560619
Data columns (total 43 columns):
 #   Column          Non-Null Count   Dtype   
---  ------          --------------   -----   
 0   access          45542 non-null   object  
 1   area            10881 non-null   object  
 2   bicycle         40145 non-null   object  
 3   bridge          2960 non-null    object  
 4   busway          84 non-null      object  
 5   cycleway        7300 non-null    object  
 6   est_width       785 non-null     object  
 7   foot            87694 non-null   object  
 8   footway         5027 non-null    object  
 9   highway         551813 non-null  object  
 10  int_ref         0 non-null       object  
 11  junction        6390 non-null    object  
 12  lanes           30165 non-null   object  
 13  lit             49202 non-null   object  
 14  maxspeed        97150 non-null   object  
 15  motorcar        672 non-null     object  
 16  motorroad       8 non-null       object  
 17  motor_vehicle   8631 non-null    object  
 18  name            202888 non-null  object  
 19  oneway          49142 non-null   object  
 20  overtaking      61 non-null      object  
 21  path            835 non-null     object  
 22  passing_places  47 non-null      object  
 23  psv             873 non-null     object  
 24  ref             25646 non-null   object  
 25  service         45451 non-null   object  
 26  segregated      13048 non-null   object  
 27  sidewalk        38667 non-null   object  
 28  smoothness      2099 non-null    object  
 29  surface         145345 non-null  object  
 30  tracktype       5161 non-null    object  
 31  tunnel          1281 non-null    object  
 32  turn            7 non-null       object  
 33  width           3221 non-null    object  
 34  id              551813 non-null  int64   
 35  timestamp       551813 non-null  int64   
 36  version         551813 non-null  int8    
 37  tags            183067 non-null  object  
 38  osm_type        551813 non-null  object  
 39  geometry        551813 non-null  geometry
 40  u               551813 non-null  int64   
 41  v               551813 non-null  int64   
 42  length          551813 non-null  float64 
dtypes: float64(1), geometry(1), int64(4), int8(1), object(36)
memory usage: 181.6+ MB
# This might be necessary if you get the error 
## ValueError: Buffer dtype mismatch, expected 'long' but got 'double'
nodes2 = nodes.set_index('id')
# create pandana graph
pandana_graph = pandana.Network(nodes2['lon'], 
# create networkx graph
networkx_graph = nx.from_pandas_edgelist(edges,source = 'u', target = 'v', edge_attr = 'length')

Calculating shortest distance between set of points using pandana

pois = gpd.read_file("./points.geojson")
pois2 = pois.rename(columns = {'ID_code':'code_ID'})
import itertools
def get_shortest_path(points, polygons, point_id, poly_id, Graph):
    # check the projections
    if != 'EPSG:4326':
        raise ValueError("The point CRS is not Mercator, Use gpd.to_crs('EPSG4326') to transform the CRS")

    if != 'EPSG:4326':
        raise ValueError("The polygon CRS is not Mercator, Use gpd.to_crs('EPSG4326') to transform the CRS")

        Parses graph and two sets of locational datasets and returns dataframe with Origin destination ids, distances and times along the network.
        point : point geodataframe 
        polygons : polygon or second point geodataframe
        point_id : str
            unique ID column of points
        poly_id : str
            unique ID column of polygons or the second points
         Graph: object 
        dataframe with distances and times between each pair of points
        Return type

    # define the unique origins and destinations
    origins = list(points[point_id].unique())
    destinations = list(polygons[poly_id].unique())
    # create OD data
    od_data = pd.DataFrame(list(itertools.product(origins,destinations))).rename(columns = {0:point_id,1:poly_id})
    # add coordinates to the data
    # if the data is other then points, it would take centrooid of the geometry
    if points.type[0] == 'Point':
        point_xy = pd.DataFrame({point_id: points[point_id], 'X':points.geometry.x, 'Y':points.geometry.y}).reset_index(drop=True)
    elif points.type[0] == 'MultiPoint':
        point_xy = pd.DataFrame({point_id: points[point_id], 'X':points.geometry.x, 'Y':points.geometry.y}).reset_index(drop=True)
    else: point_xy = pd.DataFrame({point_id: points[point_id], 'X':points.geometry.centroid.x, 'Y':points.geometry.centroid.y}).reset_index(drop=True)
    if polygons.type[0] == 'Point':
        poly_xy = pd.DataFrame({poly_id: polygons[poly_id], 'X':polygons.geometry.x, 'Y':polygons.geometry.y}).reset_index(drop=True)
    elif polygons.type[0] == 'MultiPoint':
        poly_xy = pd.DataFrame({poly_id: polygons[poly_id], 'X':polygons.geometry.x, 'Y':polygons.geometry.y}).reset_index(drop=True)
    else: poly_xy = pd.DataFrame({poly_id: polygons[poly_id], 'X':polygons.geometry.centroid.x, 'Y':polygons.geometry.centroid.y}).reset_index(drop=True)
    # build origin destination data    
    od_data = od_data.merge(point_xy, on=point_id, how = 'left').merge(poly_xy, on = poly_id, how = 'left')

    # find nodes on the network
    origin_nodes = Graph.get_node_ids(od_data.X_x, od_data.Y_x).values
    dests_nodes = Graph.get_node_ids(od_data.X_y, od_data.Y_y).values

    # find the distances and times
    od_data['distance'] = pd.Series(Graph.shortest_path_lengths(origin_nodes, dests_nodes, 'length'))
    return od_data
import time

start = time.time()

dists = get_shortest_path(points = pois, polygons = pois2, point_id = 'ID_code', poly_id = 'code_ID', Graph = pandana_graph)

end = time.time()
print(end - start )


Notes: You can see that there is a suspicious distance that are bigger than 300 000 meters. This happens when the graph can't find the route for set of two points. This is most likely because one of the nodes is on disconnected graphs within the network and can't reach the other ones, or they could be outside the graph extent or something else. It would be smart to investigate the graph, but that’s not part of this post :)