09 November, 2022 / by Paul Kinsvater / in Maps API

How to integrate Geoapify into your Python data science toolbox

Geobatchpy package connects Geoapify with Python

Practically every business makes use of location data. For billing and delivery, supply chains, prospects, etc. Does yours also make use of geospatial analytics? Don't worry if you have not started yet. It never was more accessible thanks to broadly adapted standards and a rich open-source and open-data community. The tutorial below will give you a kick-start using Python, the Geoapify Batch API, and just a few lines of code.

This tutorial introduces you to the geobatchpy package, developed by Paul Kinsvater. The package helps to call Geoapify batch requests in Python with a simple interface. It's not the official SDK, but we find it helpful for Python developers.

Why Python for geospatial analytics?

The Python data science community is undoubtedly one of the richest, with the R community likely its strongest competitor. Don't stop here if your team prefers R. It is easy to follow along and translate to R.

Python comes with a long list of open-source packages for geospatial analytics. Most notably, GeoPandas, the workhorse of the geospatial ecosystem. It's the tool of choice for manipulating arrays of geometric objects, such as computing areas, distances, and joining on the neighborhood. A recent addition to this list is my geobatchpy Python package. The package is using the Geoapify API and has, first and foremost geospatial analytics in mind. It also ships with a command line interface for the Geoapify Batch API.

The following tutorial shows you how to put those tools into action. Do you want to follow along with your data? Don't forget to subscribe to Geoapify if you have not already. The free plan comes with 3k credits daily, including commercial use. This plan will cover our tutorial for almost 2k addresses thanks to the discounts in Batch API pricing.

A receipt to start with geospatial analytics in Python

We executed the code in this tutorial using Python 3.9.13 with main dependencies geobatchpy==0.2.1 and geopandas==0.11.1. This installation guide shows you have to set up a Conda virtual environment on your computer.

Part 1 - prepare the address data

Our example consists of roughly one thousand address records of hospitals. But the code easily scales to hundreds of thousands more. Again, only your Geoapify subscription dictates the limit.

import pandas as pd

df = pd.read_csv('hospitals-sample.csv', usecols=['Street', 'Housenumber', 'Postcode', 'City', 'Country'])
print(f'Total number of records: {df.shape[0]}')
print(df.head().to_markdown())  # prints first 5 rows

The console output gives you an idea of how our example data looks like:

	Street	Housenumber	Postcode	City	Country
0	Jan Tooropstraat	164	1061AE	Amsterdam	Netherlands
1	Barbarastraße	1	45964	Gladbeck	Germany
2	Am Gesundheitspark	nan	51375	Leverkusen	Germany
3	Rue de Tivoli	nan	57000	Metz	France
4	Propst-Sievert-Weg	nan	46325	Borken	Germany

The geocoding API accepts structured input and free text search. We encounter data quality issues in practically every real-world dataset, like attributes ingested into the wrong columns. Free text is just the safer, more comfortable choice. For this, we need to parse each row to a single string:

import numpy as np

def concat_columns(df: pd.DataFrame, sep: str = ' ', fill_value: str = np.nan) -> pd.Series:
    """Concatenates row-wise all columns of df and returns series with same index.
    
    """
    return (df.apply(lambda s: sep.join(s.dropna().astype(str).str.strip().replace('', np.nan).dropna()), axis=1)
            .replace('', fill_value))

df['AddressLine1'] = concat_columns(df=df[['Street', 'Housenumber']])
df['AddressLine2'] = concat_columns(df=df[['Postcode', 'City']])
addresses = concat_columns(df[['AddressLine1', 'AddressLine2', 'Country']], sep=', ', fill_value='')
print(addresses[:5].rename('Query').to_markdown())

Here is what the first five search queries look like:

	Query
0	Jan Tooropstraat 164, 1061AE Amsterdam, Netherlands
1	Barbarastraße 1, 45964 Gladbeck, Germany
2	Am Gesundheitspark, 51375 Leverkusen, Germany
3	Rue de Tivoli, 57000 Metz, France
4	Propst-Sievert-Weg, 46325 Borken, Germany

Part 2 - batch geocode

You can choose between geobatchpy's Python client or the geobatch command line interface to run batch jobs. We prefer the 2nd option for larger inputs. And that requires some preparation which we do again with Python:

from geobatchpy.batch import parse_geocoding_inputs
from geobatchpy.utils import write_data_to_json_file

data = {
    'api': '/v1/geocode/search',  # This tells Geoapify which service to run.
    'inputs': parse_geocoding_inputs(locations=addresses),
    'params': {'limit': 1, 'format': 'geojson'},  # We recommend always choosing GeoJSON.
    'batch_len': 200,  # Results in math.ceil(1042 / 200) = 6 jobs in total.
    'id': 'tutorial-batch-geocode'  # Optional label for your own reference.
}
write_data_to_json_file(data=data, file_path='tutorial-geocode-input.json')

The last cell creates a .json file inside your active directory, which we use to submit jobs to Geoapify servers on the command line:

geobatch submit tutorial-geocode-input.json tutorial-geocode-urls.json --api-key <your-key>;

Tip: You can omit the --api-key option when setting your GEOAPIFY_KEY environment variable.

The file created by the first command, tutorial-geocode-urls.json, is the input for the next:

geobatch receive tutorial-geocode-urls.json tutorial-geocode-results.json --api-key <your-key>;

Processing on the Geoapify servers can take some time, depending on the size of your inputs, your subscription plan, and how busy the servers are. Pause here and continue until the servers complete all jobs.

Remember, we chose GeoJSON as the output format. Next, we read the results and simplify so that we end up with a list of GeoJSON-like Python dictionaries, one dictionary per address:

from geobatchpy.batch import simplify_batch_geocoding_results
from geobatchpy.utils import read_data_from_json_file

results = read_data_from_json_file('tutorial-geocode-results.json')['results']
results = simplify_batch_geocoding_results(results=results, input_format='geojson')

print(results[0])

We print the first of those to the console:

{
    "type": "Feature",
    "properties": {
        "datasource": {
            "sourcename": "openstreetmap",
            "attribution": "© OpenStreetMap contributors",
            "license": "Open Database License",
            "url": "https://www.openstreetmap.org/copyright",
        },
        "housenumber": "164",
        "street": "Jan Tooropstraat",
        "suburb": "Bos en Lommer",
        "city": "Amsterdam",
        "state": "North Holland",
        "country": "Netherlands",
        "postcode": "1061AE",
        "country_code": "nl",
        "lon": 4.8391891,
        "lat": 52.3712642,
        "formatted": "Jan Tooropstraat 164, 1061 AE Amsterdam, Netherlands",
        "address_line1": "Jan Tooropstraat 164",
        "address_line2": "1061 AE Amsterdam, Netherlands",
        "timezone": {
            "name": "Europe/Amsterdam",
            "name_alt": "Europe/Brussels",
            "offset_STD": "+01:00",
            "offset_STD_seconds": 3600,
            "offset_DST": "+02:00",
            "offset_DST_seconds": 7200,
            "abbreviation_STD": "CET",
            "abbreviation_DST": "CEST",
        },
        "result_type": "building",
        "rank": {
            "importance": 0.31100000000000005,
            "popularity": 7.836896301654736,
            "confidence": 1,
            "confidence_city_level": 1,
            "confidence_street_level": 1,
            "match_type": "full_match",
        },
        "place_id": "5181a32e63545b1340597a96d695852f4a40f00103f9013b29891b02000000c00203",
    },
    "geometry": {"type": "Point", "coordinates": [4.8391891, 52.3712642]},
    "bbox": [4.8391391, 52.3712142, 4.8392391, 52.3713142],
    "query": {
        "text": "Jan Tooropstraat 164, 1061AE Amsterdam, Netherlands",
        "parsed": {
            "housenumber": "164",
            "street": "jan tooropstraat",
            "postcode": "1061ae",
            "city": "amsterdam",
            "country": "netherlands",
            "expected_type": "building",
        },
    },
}

And now, you will see why GeoJSON is a perfect fit. It is a standard for storing geographic objects with metadata broadly supported by the community. GeoPandas is no exception. One of its methods parses the geometry into a Shapely geometric object, puts all properties into separate columns, and ignores the rest:

import geopandas as gpd

df_geocodes = gpd.GeoDataFrame.from_features(results, crs='EPSG:4326')  # 4326 = lon, lat coorindates

show_cols = ['geometry', 'name', 'street', 'housenumber', 'postcode', 'city', 'suburb', 'state', 'country', 'rank']
print(df_geocodes[show_cols].head().to_markdown())

We print the first five rows to the console:

	geometry	name	street	housenumber	postcode	city	suburb	state	country	rank
0	POINT (4.8391891 52.3712642)	nan	Jan Tooropstraat	164	1061AE	Amsterdam	Bos en Lommer	North Holland	Netherlands	{importance': 0.31100000000000005, 'popularity': 7.836896301654736, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'}
1	POINT (6.987436 51.574441)	nan	Barbarastraße	1	45964	Gladbeck	nan	North Rhine-Westphalia	Germany	{'popularity': 6.574173044673481, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'}
2	POINT (7.0350426 51.0299791)	Am Gesundheitspark	Am Gesundheitspark	nan	51375	Leverkusen	Schlebusch	North Rhine-Westphalia	Germany	{'importance': 0.3, 'popularity': 6.8695683348015155, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'}
3	POINT (6.1923058 49.09872)	Rue de Tivoli	Rue de Tivoli	nan	57000	Metz	Plantières-Queuleu	Grand Est	France	{'importance': 0.4, 'popularity': 6.673758921453906, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'}
4	POINT (6.8600285 51.8408674)	Propst-Sievert-Weg	Propst-Sievert-Weg	nan	46325	Borken	Borken	North Rhine-Westphalia	Germany	{'importance': 0.4, 'popularity': 5.540347061941222, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'}

Some properties are dictionaries that we can flatten using pandas.json_normalize:

df_rank = pd.json_normalize(df_geocodes['rank'])
print(df_rank.head().to_markdown())

	importance	popularity	confidence	confidence_city_level	confidence_street_level	match_type
0	0.311	7.8369	1	1	1	full_match
1	nan	6.57417	1	1	1	full_match
2	0.3	6.86957	1	1	1	full_match
3	0.4	6.67376	1	1	1	full_match
4	0.4	5.54035	1	1	1	full_match

Table df_rank is a good starting point to investigate data quality issues in your original address data. Check out this Towards Data Science publication if you want to dive deeper into this topic.

Part 3 - add building details

Geocodes, which are just points on a map, is not the only geometry we get out of Geoapify's services. The Place Details API covers the shapes of buildings and more. And again, a batch version of this service is available at a discount. Requesting this service through the command line is almost identical to the previous example. The main differences we like to stress out are:

The service accepts geocodes as input.
Its output format is, by default, GeoJSON.
We can choose from a long list of feature types.

The service will respond with feature type 'details' by default. However, we also expand to cover type 'building' to illustrate how to deal with multiple feature types:

from geobatchpy.batch import parse_geocodes, simplify_batch_place_details_results
from geobatchpy.utils import write_data_to_json_file

geocodes = parse_geocodes(geocodes=[(item.x, item.y) for _, item in df_geocodes['geometry'].items()])
features = ','.join(['details', 'building'])
data = {
    'api': '/v2/place-details',  # See the Geoapify API docs for other batch APIs.
    'inputs': geocodes,
    'params': {'features': features},
    'batch_len': 200,
    'id': 'tutorial-batch-details'
}
write_data_to_json_file(data=data, file_path='tutorial-details-input.json')

The last cell creates a .json file inside your active directory. Next, we switch to the command line and submit jobs using this file as input:

geobatch submit tutorial-details-input.json tutorial-details-urls.json --api-key <your-key>

The output of the previous step is the input for the next:

geobatch receive tutorial-details-urls.json tutorial-details-results.json --api-key <your-key>

Pause here and continue after the servers complete all your jobs. Again, this can take some time. Next, we simplify the results to a list of lists of Python dictionaries and print one example to the console:

results = read_data_from_json_file('tutorial-details-results.json')['results']
results = simplify_batch_place_details_results(results)

[
    {
        "type": "Feature",
        "properties": {
            "feature_type": "details",
            "wiki_and_media": {"wikidata": "Q3269552"},
            "building": {"type": "hospital", "start_date": 1966},
            "categories": ["building", "building.healthcare"],
            "datasource": {
                "sourcename": "openstreetmap",
                "attribution": "© OpenStreetMap contributors",
                "license": "Open Database Licence",
                "url": "https://www.openstreetmap.org/copyright",
                "raw": {
                    "osm_id": -1837609,
                    "ref:bag": 363100012074757,
                    "building": "hospital",
                    "osm_type": "r",
                    "wikidata": "Q3269552",
                    "start_date": 1966,
                },
            },
            "place_id": "517c2f782daa5b1340594adc7e13872f4a40f00101f901290a1c0000000000",
        },
        "geometry": {
            "type": "Polygon",
            "coordinates": [
                [
                    [4.8381998, 52.370803499],
                    ...
                    [4.8404546, 52.370985099],
                ],
            ],
        },
    },
    {
        "type": "Feature",
        "properties": {
            "feature_type": "building",
            "categories": ["building", "building.healthcare"],
            "datasource": {
                "sourcename": "openstreetmap",
                "attribution": "© OpenStreetMap contributors",
                "license": "Open Database Licence",
                "url": "https://www.openstreetmap.org/copyright",
                "raw": {
                    "osm_id": -1837609,
                    "ref:bag": 363100012074757,
                    "building": "hospital",
                    "osm_type": "r",
                    "wikidata": "Q3269552",
                    "start_date": 1966,
                },
            },
            "wiki_and_media": {"wikidata": "Q3269552"},
            "building": {"type": "hospital", "start_date": 1966},
            "area": 15475,
            "place_id": "517c2f782daa5b1340594adc7e13872f4a40f00101f901290a1c0000000000",
        },
        "geometry": {
            "type": "Polygon",
            "coordinates": [
                [
                    [4.8381998, 52.370803499],
                    ...
                    [4.8404546, 52.370985099],
                ],
            ],
        },
    },
]

Every element of the outer list corresponds to a location - a single geocode. The inner list consists of GeoJSON-like dictionaries, ideally one per requested feature type. But not all locations are buildings; in some cases, even the details are unavailable.

print('Number of feature types per location:')
print(pd.Series([len(res) for res in results]).value_counts().rename('Count').to_markdown())

Number of feature types per location:

	Count
2	602
1	439
0	1

Since it is again GeoJSON, we can use GeoPandas to bring the data into tabular format, one row per location. This time it is a bit more tricky but still simple enough to do this in a few lines of code:

from typing import List

def convert_place_details_results_to_dataframe(results: List[dict], index: pd.Index = None) -> pd.DataFrame:
    if index is None:
        index = pd.Index(range(len(results)))
    index_name = index.name if index.name is not None else 'index'
    df_details = pd.concat([gpd.GeoDataFrame.from_features(res).assign(**{index_name: idx})
                            for idx, res in zip(index, results)])
    
    dfs_details = [df_split.set_index(index_name) for _, df_split in df_details.groupby('feature_type')]
    dfs_details = [df_split.rename(columns={col: df_split['feature_type'][0] + '.' + col
                                            for col in df_split.columns})
                   for df_split in dfs_details]
    df_details = pd.concat(dfs_details, axis=1)
    
    prop_null_is_1 = df_details.isnull().mean().eq(1.)
    cols_all_null = prop_null_is_1[prop_null_is_1].index.values
    return df_details.drop([col for col in df_details.columns
                            if col.endswith('feature_type') or col in cols_all_null], axis=1)

We apply the conversion function to our data and consolidate the two geometries into a single column for simplicity:

df_details = convert_place_details_results_to_dataframe(results=results, index=df.index)

df_details['geometry'] = df_details['building.geometry']
df_details.loc[df_details['geometry'].isnull(), 'geometry'] = df_details['details.geometry']
df_details = df_details.drop(['building.geometry', 'details.geometry'], axis=1)

df_details.geometry = df_details.geometry.set_crs('EPSG:4326')

df_details is a wide table with a geometry, many atomic attributes, and a few dictionary columns. If you prefer having dictionaries spread across multiple columns, use pandas.json_normalize.

show_cols = ['geometry', 'details.name', 'building.area', 'building.building']

# we manipulate the geometry column for pretty printing:
print(df_details
      .assign(geometry=df_details.geometry.astype(str).str.slice(stop=20) + '...')
      [show_cols].head().to_markdown())

index	geometry	details.name	building.area	building.building
0	POLYGON ((4.8382 52....	nan	15475	{'type': 'hospital', 'start_date': 1966}
1	POLYGON ((6.986498 5...	nan	7006	{'levels': 3, 'type': 'hospital', 'roof': {'levels': 1}}
2	LINESTRING (7.035218...	Am Gesundheitspark	nan	nan
3	POLYGON ((6.192371 4...	Rue de Tivoli	130	{'type': 'detached'}
4	LINESTRING (6.860072...	Propst-Sievert-Weg	nan	nan

We could move on using the Isolines API, extending our addresses by reachability regions. But we will stop here with the geospatial data preparation and move on with some analytics.

Part 4 - manipulate geometries with GeoPandas

The first three parts are about data preparation, particularly geometries. Next, we will touch on how we can use geometries other than plotting on a map. We start with the shapes of buildings put together in df_details. Here is what the first building looks like according to our data:

df_details.geometry[0]

With GeoPandas, we can summarize the occupied area and total length of boundaries. But first, we need to tell GeoPandas which coordinate reference system to use for these statistics. EPSG:3035 is a metric system and a good fit for European locations:

# crs=3035 is for the metric system, European locations
df_geometry_stats = pd.concat([df_details.geometry.to_crs(crs='EPSG:3035').area,
                               df_details.geometry.boundary.to_crs(crs='EPSG:3035').length], axis=1)
df_geometry_stats.columns = ['AreaSquareMeters', 'BoundaryLengthMeters']
print(df_geometry_stats.head().to_markdown())

index	AreaSquareMeters	BoundaryLengthMeters
0	15501	1361.78
1	7016.58	697.259
2	0	0
3	130.552	46.5606
4	0	0

Expanding the building's shape to a convex hull may make more sense. GeoPandas also helps you with this and much more. Check out the GeoPandas documentation to learn more.

Our 2nd illustration is about joining two spatial datasets. Traditionally, we join tables by exact matches on keys. And most relational database engines also support pattern matching. And GeoPandas allows you to match by proximity which we illustrate using a dataset of airports from opentraveldata under the CC BY 4.0 license.

from shapely.geometry import Point

df_airports = pd.read_csv('airports.csv')
df_airports = (df_airports.loc[df_airports.type.isin(['medium_airport', 'large_airport'])]
               .rename(columns={'name': 'AirportName'}).reset_index(drop=True))

geometry = [Point(xy) for xy in zip(df_airports.longitude_deg, df_airports.latitude_deg)]
df_airports = gpd.GeoDataFrame(df_airports, geometry=geometry, crs='EPSG:4326')[['geometry', 'AirportName']]
print(df_airports.head().to_markdown())

	geometry	AirportName
0	POINT (-158.617996216 59.2826004028)	Aleknagik / New Airport
1	POINT (69.80734 33.284605)	Khost International Airport
2	POINT (160.054993 -9.428)	Honiara International Airport
3	POINT (157.263 -8.32797)	Munda Airport
4	POINT (102.35224 32.53154)	Hongyuan Airport

For every address in our original input, you may ask, what is the closest airport and how far away is that? So we provide this answer below:

df_join = (df_geocodes.to_crs('EPSG:3035')
           .sjoin_nearest(df_airports.to_crs('EPSG:3035'),
                          how='left', distance_col='DistanceToAirport')
           .to_crs('EPSG:4326').rename(columns={'AirportName': 'ClosestAirport'})
           .drop('index_right', axis=1))

show_cols = ['formatted', 'ClosestAirport', 'DistanceToAirport']
print(df_join[show_cols].head().to_markdown())

	formatted	ClosestAirport	DistanceToAirport
0	Jan Tooropstraat 164, 1061 AE Amsterdam, Netherlands	Amsterdam Airport Schiphol	8659.8
1	Barbarastraße 1, 45964 Gladbeck, Germany	Düsseldorf Airport	35221.7
2	Am Gesundheitspark, 51375 Leverkusen, Germany	Cologne Bonn Airport	19763.5
3	Rue de Tivoli, 57000 Metz, France	Metz-Nancy-Lorraine Airport	13671.2
4	Propst-Sievert-Weg, 46325 Borken, Germany	Twente Airport	48446.2

A brief outlook on geospatial advanced analytics

We illustrated how to enrich address data by geometries (points and shapes), manipulate those, and join them with other geospatial datasets. And these data manipulations typically are 90%, if not more, of every data analytics problem we face. This last section gives an outlook on so-called "advanced" analytics problems using geospatial data.

The most used geospatial technique is spatial autocorrelation. Say, with every of your address records, you also observe a numeric attribute like household income. With spatial autocorrelation, you can study if a household's income correlates with its neighbors. This PySAL tutorial shows you how to evaluate such questions by formal statistical testing. Again, it integrates well with what we did in the tutorial: it uses a GeoPandas table as input to compute spatial weights to quantify geographic proximity.

Say you study customer churn using a classification model, predicting if any given customer will stay with your business for a future period or not. Provided you also have your customer's addresses, have you considered integrating geospatial features into your model? As a first step, you can confirm the significance by studying spatial autocorrelation on customer churn. If yes, incorporate average churn rates (or any other customer signals) weighted over space and time. E.g., compute the average churn rate in a neighborhood of every current customer over the last 12 months.