How to integrate Geoapify into your Python data science toolbox

Geobatchpy package connects Geoapify with Python

Practically every business makes use of location data. For billing and delivery, supply chains, prospects, etc. Does yours also make use of geospatial analytics? Don't worry if you have not started yet. It never was more accessible thanks to broadly adapted standards and a rich open-source and open-data community. The tutorial below will give you a kick-start using Python, the Geoapify Batch API, and just a few lines of code.

This tutorial introduces you to the geobatchpy package, developed by Paul Kinsvater. The package helps to call Geoapify batch requests in Python with a simple interface. It's not the official SDK, but we find it helpful for Python developers.

Why Python for geospatial analytics?

The Python data science community is undoubtedly one of the richest, with the R community likely its strongest competitor. Don't stop here if your team prefers R. It is easy to follow along and translate to R.

Python comes with a long list of open-source packages for geospatial analytics. Most notably, GeoPandas, the workhorse of the geospatial ecosystem. It's the tool of choice for manipulating arrays of geometric objects, such as computing areas, distances, and joining on the neighborhood. A recent addition to this list is my geobatchpy Python package. The package is using the Geoapify API and has, first and foremost geospatial analytics in mind. It also ships with a command line interface for the Geoapify Batch API.

The following tutorial shows you how to put those tools into action. Do you want to follow along with your data? Don't forget to subscribe to Geoapify if you have not already. The free plan comes with 3k credits daily, including commercial use. This plan will cover our tutorial for almost 2k addresses thanks to the discounts in Batch API pricing.

A receipt to start with geospatial analytics in Python

We executed the code in this tutorial using Python 3.9.13 with main dependencies geobatchpy==0.2.1 and geopandas==0.11.1. This installation guide shows you have to set up a Conda virtual environment on your computer.

Part 1 - prepare the address data

Our example consists of roughly one thousand address records of hospitals. But the code easily scales to hundreds of thousands more. Again, only your Geoapify subscription dictates the limit.

import pandas as pd

df = pd.read_csv('hospitals-sample.csv', usecols=['Street', 'Housenumber', 'Postcode', 'City', 'Country'])
print(f'Total number of records: {df.shape[0]}')
print(df.head().to_markdown())  # prints first 5 rows

The console output gives you an idea of how our example data looks like:

StreetHousenumberPostcodeCityCountry
0Jan Tooropstraat1641061AEAmsterdamNetherlands
1Barbarastraße145964GladbeckGermany
2Am Gesundheitsparknan51375LeverkusenGermany
3Rue de Tivolinan57000MetzFrance
4Propst-Sievert-Wegnan46325BorkenGermany

The geocoding API accepts structured input and free text search. We encounter data quality issues in practically every real-world dataset, like attributes ingested into the wrong columns. Free text is just the safer, more comfortable choice. For this, we need to parse each row to a single string:

import numpy as np

def concat_columns(df: pd.DataFrame, sep: str = ' ', fill_value: str = np.nan) -> pd.Series:
    """Concatenates row-wise all columns of df and returns series with same index.
    
    """
    return (df.apply(lambda s: sep.join(s.dropna().astype(str).str.strip().replace('', np.nan).dropna()), axis=1)
            .replace('', fill_value))

df['AddressLine1'] = concat_columns(df=df[['Street', 'Housenumber']])
df['AddressLine2'] = concat_columns(df=df[['Postcode', 'City']])
addresses = concat_columns(df[['AddressLine1', 'AddressLine2', 'Country']], sep=', ', fill_value='')
print(addresses[:5].rename('Query').to_markdown())

Here is what the first five search queries look like:

Query
0Jan Tooropstraat 164, 1061AE Amsterdam, Netherlands
1Barbarastraße 1, 45964 Gladbeck, Germany
2Am Gesundheitspark, 51375 Leverkusen, Germany
3Rue de Tivoli, 57000 Metz, France
4Propst-Sievert-Weg, 46325 Borken, Germany

Part 2 - batch geocode

You can choose between geobatchpy's Python client or the geobatch command line interface to run batch jobs. We prefer the 2nd option for larger inputs. And that requires some preparation which we do again with Python:

from geobatchpy.batch import parse_geocoding_inputs
from geobatchpy.utils import write_data_to_json_file

data = {
    'api': '/v1/geocode/search',  # This tells Geoapify which service to run.
    'inputs': parse_geocoding_inputs(locations=addresses),
    'params': {'limit': 1, 'format': 'geojson'},  # We recommend always choosing GeoJSON.
    'batch_len': 200,  # Results in math.ceil(1042 / 200) = 6 jobs in total.
    'id': 'tutorial-batch-geocode'  # Optional label for your own reference.
}
write_data_to_json_file(data=data, file_path='tutorial-geocode-input.json')

The last cell creates a .json file inside your active directory, which we use to submit jobs to Geoapify servers on the command line:

geobatch submit tutorial-geocode-input.json tutorial-geocode-urls.json --api-key <your-key>;

Tip: You can omit the --api-key option when setting your GEOAPIFY_KEY environment variable.

The file created by the first command, tutorial-geocode-urls.json, is the input for the next:

geobatch receive tutorial-geocode-urls.json tutorial-geocode-results.json --api-key <your-key>;

Processing on the Geoapify servers can take some time, depending on the size of your inputs, your subscription plan, and how busy the servers are. Pause here and continue until the servers complete all jobs.

Remember, we chose GeoJSON as the output format. Next, we read the results and simplify so that we end up with a list of GeoJSON-like Python dictionaries, one dictionary per address:

from geobatchpy.batch import simplify_batch_geocoding_results
from geobatchpy.utils import read_data_from_json_file

results = read_data_from_json_file('tutorial-geocode-results.json')['results']
results = simplify_batch_geocoding_results(results=results, input_format='geojson')

print(results[0])

We print the first of those to the console:

{
    "type": "Feature",
    "properties": {
        "datasource": {
            "sourcename": "openstreetmap",
            "attribution": "© OpenStreetMap contributors",
            "license": "Open Database License",
            "url": "https://www.openstreetmap.org/copyright",
        },
        "housenumber": "164",
        "street": "Jan Tooropstraat",
        "suburb": "Bos en Lommer",
        "city": "Amsterdam",
        "state": "North Holland",
        "country": "Netherlands",
        "postcode": "1061AE",
        "country_code": "nl",
        "lon": 4.8391891,
        "lat": 52.3712642,
        "formatted": "Jan Tooropstraat 164, 1061 AE Amsterdam, Netherlands",
        "address_line1": "Jan Tooropstraat 164",
        "address_line2": "1061 AE Amsterdam, Netherlands",
        "timezone": {
            "name": "Europe/Amsterdam",
            "name_alt": "Europe/Brussels",
            "offset_STD": "+01:00",
            "offset_STD_seconds": 3600,
            "offset_DST": "+02:00",
            "offset_DST_seconds": 7200,
            "abbreviation_STD": "CET",
            "abbreviation_DST": "CEST",
        },
        "result_type": "building",
        "rank": {
            "importance": 0.31100000000000005,
            "popularity": 7.836896301654736,
            "confidence": 1,
            "confidence_city_level": 1,
            "confidence_street_level": 1,
            "match_type": "full_match",
        },
        "place_id": "5181a32e63545b1340597a96d695852f4a40f00103f9013b29891b02000000c00203",
    },
    "geometry": {"type": "Point", "coordinates": [4.8391891, 52.3712642]},
    "bbox": [4.8391391, 52.3712142, 4.8392391, 52.3713142],
    "query": {
        "text": "Jan Tooropstraat 164, 1061AE Amsterdam, Netherlands",
        "parsed": {
            "housenumber": "164",
            "street": "jan tooropstraat",
            "postcode": "1061ae",
            "city": "amsterdam",
            "country": "netherlands",
            "expected_type": "building",
        },
    },
}

And now, you will see why GeoJSON is a perfect fit. It is a standard for storing geographic objects with metadata broadly supported by the community. GeoPandas is no exception. One of its methods parses the geometry into a Shapely geometric object, puts all properties into separate columns, and ignores the rest:

import geopandas as gpd

df_geocodes = gpd.GeoDataFrame.from_features(results, crs='EPSG:4326')  # 4326 = lon, lat coorindates

show_cols = ['geometry', 'name', 'street', 'housenumber', 'postcode', 'city', 'suburb', 'state', 'country', 'rank']
print(df_geocodes[show_cols].head().to_markdown())

We print the first five rows to the console:

geometrynamestreethousenumberpostcodecitysuburbstatecountryrank
0POINT (4.8391891 52.3712642)nanJan Tooropstraat1641061AEAmsterdamBos en LommerNorth HollandNetherlands{importance': 0.31100000000000005, 'popularity': 7.836896301654736, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'}
1POINT (6.987436 51.574441)nanBarbarastraße145964GladbecknanNorth Rhine-WestphaliaGermany{'popularity': 6.574173044673481, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'}
2POINT (7.0350426 51.0299791)Am GesundheitsparkAm Gesundheitsparknan51375LeverkusenSchlebuschNorth Rhine-WestphaliaGermany{'importance': 0.3, 'popularity': 6.8695683348015155, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'}
3POINT (6.1923058 49.09872)Rue de TivoliRue de Tivolinan57000MetzPlantières-QueuleuGrand EstFrance{'importance': 0.4, 'popularity': 6.673758921453906, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'}
4POINT (6.8600285 51.8408674)Propst-Sievert-WegPropst-Sievert-Wegnan46325BorkenBorkenNorth Rhine-WestphaliaGermany{'importance': 0.4, 'popularity': 5.540347061941222, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'}

Some properties are dictionaries that we can flatten using pandas.json_normalize:

df_rank = pd.json_normalize(df_geocodes['rank'])
print(df_rank.head().to_markdown())
importancepopularityconfidenceconfidence_city_levelconfidence_street_levelmatch_type
00.3117.8369111full_match
1nan6.57417111full_match
20.36.86957111full_match
30.46.67376111full_match
40.45.54035111full_match

Table df_rank is a good starting point to investigate data quality issues in your original address data. Check out this Towards Data Science publication if you want to dive deeper into this topic.

Part 3 - add building details

Geocodes, which are just points on a map, is not the only geometry we get out of Geoapify's services. The Place Details API covers the shapes of buildings and more. And again, a batch version of this service is available at a discount. Requesting this service through the command line is almost identical to the previous example. The main differences we like to stress out are:

  1. The service accepts geocodes as input.
  2. Its output format is, by default, GeoJSON.
  3. We can choose from a long list of feature types.

The service will respond with feature type 'details' by default. However, we also expand to cover type 'building' to illustrate how to deal with multiple feature types:

from geobatchpy.batch import parse_geocodes, simplify_batch_place_details_results
from geobatchpy.utils import write_data_to_json_file

geocodes = parse_geocodes(geocodes=[(item.x, item.y) for _, item in df_geocodes['geometry'].items()])
features = ','.join(['details', 'building'])
data = {
    'api': '/v2/place-details',  # See the Geoapify API docs for other batch APIs.
    'inputs': geocodes,
    'params': {'features': features},
    'batch_len': 200,
    'id': 'tutorial-batch-details'
}
write_data_to_json_file(data=data, file_path='tutorial-details-input.json')

The last cell creates a .json file inside your active directory. Next, we switch to the command line and submit jobs using this file as input:

geobatch submit tutorial-details-input.json tutorial-details-urls.json --api-key <your-key>

The output of the previous step is the input for the next:

geobatch receive tutorial-details-urls.json tutorial-details-results.json --api-key <your-key>

Pause here and continue after the servers complete all your jobs. Again, this can take some time. Next, we simplify the results to a list of lists of Python dictionaries and print one example to the console:

results = read_data_from_json_file('tutorial-details-results.json')['results']
results = simplify_batch_place_details_results(results)
[
    {
        "type": "Feature",
        "properties": {
            "feature_type": "details",
            "wiki_and_media": {"wikidata": "Q3269552"},
            "building": {"type": "hospital", "start_date": 1966},
            "categories": ["building", "building.healthcare"],
            "datasource": {
                "sourcename": "openstreetmap",
                "attribution": "© OpenStreetMap contributors",
                "license": "Open Database Licence",
                "url": "https://www.openstreetmap.org/copyright",
                "raw": {
                    "osm_id": -1837609,
                    "ref:bag": 363100012074757,
                    "building": "hospital",
                    "osm_type": "r",
                    "wikidata": "Q3269552",
                    "start_date": 1966,
                },
            },
            "place_id": "517c2f782daa5b1340594adc7e13872f4a40f00101f901290a1c0000000000",
        },
        "geometry": {
            "type": "Polygon",
            "coordinates": [
                [
                    [4.8381998, 52.370803499],
                    ...
                    [4.8404546, 52.370985099],
                ],
            ],
        },
    },
    {
        "type": "Feature",
        "properties": {
            "feature_type": "building",
            "categories": ["building", "building.healthcare"],
            "datasource": {
                "sourcename": "openstreetmap",
                "attribution": "© OpenStreetMap contributors",
                "license": "Open Database Licence",
                "url": "https://www.openstreetmap.org/copyright",
                "raw": {
                    "osm_id": -1837609,
                    "ref:bag": 363100012074757,
                    "building": "hospital",
                    "osm_type": "r",
                    "wikidata": "Q3269552",
                    "start_date": 1966,
                },
            },
            "wiki_and_media": {"wikidata": "Q3269552"},
            "building": {"type": "hospital", "start_date": 1966},
            "area": 15475,
            "place_id": "517c2f782daa5b1340594adc7e13872f4a40f00101f901290a1c0000000000",
        },
        "geometry": {
            "type": "Polygon",
            "coordinates": [
                [
                    [4.8381998, 52.370803499],
                    ...
                    [4.8404546, 52.370985099],
                ],
            ],
        },
    },
]

Every element of the outer list corresponds to a location - a single geocode. The inner list consists of GeoJSON-like dictionaries, ideally one per requested feature type. But not all locations are buildings; in some cases, even the details are unavailable.

print('Number of feature types per location:')
print(pd.Series([len(res) for res in results]).value_counts().rename('Count').to_markdown())

Number of feature types per location:

Count
2602
1439
01

Since it is again GeoJSON, we can use GeoPandas to bring the data into tabular format, one row per location. This time it is a bit more tricky but still simple enough to do this in a few lines of code:

from typing import List

def convert_place_details_results_to_dataframe(results: List[dict], index: pd.Index = None) -> pd.DataFrame:
    if index is None:
        index = pd.Index(range(len(results)))
    index_name = index.name if index.name is not None else 'index'
    df_details = pd.concat([gpd.GeoDataFrame.from_features(res).assign(**{index_name: idx})
                            for idx, res in zip(index, results)])
    
    dfs_details = [df_split.set_index(index_name) for _, df_split in df_details.groupby('feature_type')]
    dfs_details = [df_split.rename(columns={col: df_split['feature_type'][0] + '.' + col
                                            for col in df_split.columns})
                   for df_split in dfs_details]
    df_details = pd.concat(dfs_details, axis=1)
    
    prop_null_is_1 = df_details.isnull().mean().eq(1.)
    cols_all_null = prop_null_is_1[prop_null_is_1].index.values
    return df_details.drop([col for col in df_details.columns
                            if col.endswith('feature_type') or col in cols_all_null], axis=1)

We apply the conversion function to our data and consolidate the two geometries into a single column for simplicity:

df_details = convert_place_details_results_to_dataframe(results=results, index=df.index)

df_details['geometry'] = df_details['building.geometry']
df_details.loc[df_details['geometry'].isnull(), 'geometry'] = df_details['details.geometry']
df_details = df_details.drop(['building.geometry', 'details.geometry'], axis=1)

df_details.geometry = df_details.geometry.set_crs('EPSG:4326')

df_details is a wide table with a geometry, many atomic attributes, and a few dictionary columns. If you prefer having dictionaries spread across multiple columns, use pandas.json_normalize.

show_cols = ['geometry', 'details.name', 'building.area', 'building.building']

# we manipulate the geometry column for pretty printing:
print(df_details
      .assign(geometry=df_details.geometry.astype(str).str.slice(stop=20) + '...')
      [show_cols].head().to_markdown())
indexgeometrydetails.namebuilding.areabuilding.building
0POLYGON ((4.8382 52....nan15475{'type': 'hospital', 'start_date': 1966}
1POLYGON ((6.986498 5...nan7006{'levels': 3, 'type': 'hospital', 'roof': {'levels': 1}}
2LINESTRING (7.035218...Am Gesundheitsparknannan
3POLYGON ((6.192371 4...Rue de Tivoli130{'type': 'detached'}
4LINESTRING (6.860072...Propst-Sievert-Wegnannan

We could move on using the Isolines API, extending our addresses by reachability regions. But we will stop here with the geospatial data preparation and move on with some analytics.

Part 4 - manipulate geometries with GeoPandas

The first three parts are about data preparation, particularly geometries. Next, we will touch on how we can use geometries other than plotting on a map. We start with the shapes of buildings put together in df_details. Here is what the first building looks like according to our data:

df_details.geometry[0]
Building shape

With GeoPandas, we can summarize the occupied area and total length of boundaries. But first, we need to tell GeoPandas which coordinate reference system to use for these statistics. EPSG:3035 is a metric system and a good fit for European locations:

# crs=3035 is for the metric system, European locations
df_geometry_stats = pd.concat([df_details.geometry.to_crs(crs='EPSG:3035').area,
                               df_details.geometry.boundary.to_crs(crs='EPSG:3035').length], axis=1)
df_geometry_stats.columns = ['AreaSquareMeters', 'BoundaryLengthMeters']
print(df_geometry_stats.head().to_markdown())
indexAreaSquareMetersBoundaryLengthMeters
0155011361.78
17016.58697.259
200
3130.55246.5606
400

Expanding the building's shape to a convex hull may make more sense. GeoPandas also helps you with this and much more. Check out the GeoPandas documentation to learn more.

Our 2nd illustration is about joining two spatial datasets. Traditionally, we join tables by exact matches on keys. And most relational database engines also support pattern matching. And GeoPandas allows you to match by proximity which we illustrate using a dataset of airports from opentraveldata under the CC BY 4.0 license.

from shapely.geometry import Point

df_airports = pd.read_csv('airports.csv')
df_airports = (df_airports.loc[df_airports.type.isin(['medium_airport', 'large_airport'])]
               .rename(columns={'name': 'AirportName'}).reset_index(drop=True))

geometry = [Point(xy) for xy in zip(df_airports.longitude_deg, df_airports.latitude_deg)]
df_airports = gpd.GeoDataFrame(df_airports, geometry=geometry, crs='EPSG:4326')[['geometry', 'AirportName']]
print(df_airports.head().to_markdown())
geometryAirportName
0POINT (-158.617996216 59.2826004028)Aleknagik / New Airport
1POINT (69.80734 33.284605)Khost International Airport
2POINT (160.054993 -9.428)Honiara International Airport
3POINT (157.263 -8.32797)Munda Airport
4POINT (102.35224 32.53154)Hongyuan Airport

For every address in our original input, you may ask, what is the closest airport and how far away is that? So we provide this answer below:

df_join = (df_geocodes.to_crs('EPSG:3035')
           .sjoin_nearest(df_airports.to_crs('EPSG:3035'),
                          how='left', distance_col='DistanceToAirport')
           .to_crs('EPSG:4326').rename(columns={'AirportName': 'ClosestAirport'})
           .drop('index_right', axis=1))

show_cols = ['formatted', 'ClosestAirport', 'DistanceToAirport']
print(df_join[show_cols].head().to_markdown())
formattedClosestAirportDistanceToAirport
0Jan Tooropstraat 164, 1061 AE Amsterdam, NetherlandsAmsterdam Airport Schiphol8659.8
1Barbarastraße 1, 45964 Gladbeck, GermanyDüsseldorf Airport35221.7
2Am Gesundheitspark, 51375 Leverkusen, GermanyCologne Bonn Airport19763.5
3Rue de Tivoli, 57000 Metz, FranceMetz-Nancy-Lorraine Airport13671.2
4Propst-Sievert-Weg, 46325 Borken, GermanyTwente Airport48446.2

A brief outlook on geospatial advanced analytics

We illustrated how to enrich address data by geometries (points and shapes), manipulate those, and join them with other geospatial datasets. And these data manipulations typically are 90%, if not more, of every data analytics problem we face. This last section gives an outlook on so-called "advanced" analytics problems using geospatial data.

The most used geospatial technique is spatial autocorrelation. Say, with every of your address records, you also observe a numeric attribute like household income. With spatial autocorrelation, you can study if a household's income correlates with its neighbors. This PySAL tutorial shows you how to evaluate such questions by formal statistical testing. Again, it integrates well with what we did in the tutorial: it uses a GeoPandas table as input to compute spatial weights to quantify geographic proximity.

Say you study customer churn using a classification model, predicting if any given customer will stay with your business for a future period or not. Provided you also have your customer's addresses, have you considered integrating geospatial features into your model? As a first step, you can confirm the significance by studying spatial autocorrelation on customer churn. If yes, incorporate average churn rates (or any other customer signals) weighted over space and time. E.g., compute the average churn rate in a neighborhood of every current customer over the last 12 months.