Cities, Towns, Villages and Hamlets dataset: How to Use for Data Analysis

Uncover insights in our city and town dataset with JavaScript and Python code samples for actionable analysis

AI-Generated Urban Landscapes
AI-Generated Urban Landscapes

We've recently acquired an extensive dataset from OpenStreetMap, packed with information about cities, towns, villages, and hamlets from all around the globe. But what's even more exciting is that we're not just here to talk about it; we're here to show you how to put it to practical use, focusing on US cities, towns, villages, and hamlets for this article. You can download datasets for any other country.

Data Insights from Settlements Dataset

Our dataset is derived from OpenStreetMap and exclusively comprises objects tagged with 'place=city', 'place=town', 'place=village', and 'place=hamlet'. The dataset includes information for several cities, towns, and villages in JSON format. Each entry provides the following details:

  • Name: The name of the city, town, or village.
  • Other Names: Multilingual alternative names for the settlement, including names in languages such as Arabic, German, and Chinese.
  • Display Name: The official name and location of the settlement, including the county and state.
  • Address: Details about the settlement's administrative location, including city, county, state, postal code, and country.
  • Population: The population count for the respective settlement.
  • OSM Type and OSM ID: The OpenStreetMap object type and ID.
  • Location: Geographical coordinates (latitude and longitude) pinpointing the settlement's location.
  • Bounding Box (Bbox): The geographical coordinates defining the outer boundaries of the settlement.

Here's an example of a JSON object representing a city entry:

{
   "name": "San Diego",
   "other_names": {
      "name:ar": "سان دييغو",
      "name:en": "San Diego",
      "name:ja": "サンディエゴ",
      "name:ko": "샌디에이고",
      "name:ru": "Сан-Диего",
      "name:zh-Hans": "圣迭戈",
      "name:zh-Hant": "聖地牙哥",
      "name:he": "סן דייגו",
      "name:lt": "San Diegas",
      "name:oc": "San Diego",
      "name:pt": "São Diego",
      "name:uk": "Сан-Дієго",
      "name:zh": "聖地牙哥"
   },
   "display_name": "San Diego, San Diego County, California, United States",
   "address": {
      "city": "San Diego",
      "county": "San Diego County",
      "state": "California",
      "ISO3166-2-lvl4": "US-CA",
      "country": "United States",
      "country_code": "us"
   },
   "population": 1394928,
   "osm_type": "relation",
   "osm_id": 253832,
   "type": "administrative",
   "location": [-117.1627728, 32.7174202],
   "bbox": [-117.3098053, 32.5347737, -116.9057226, 33.114249]
}

This dataset is rich in geospatial and administrative information, making it a valuable resource for various analytical and research applications, particularly for data analysts, urban planners, and researchers interested in demographic studies, urban development, and more. Here are some practical examples of how this dataset can be valuable for data analysts:

  • Grouping by Administrative Divisions: Data analysts can group settlements by their respective administrative divisions, such as counties and states. This allows for regional analysis and comparison, making it possible to assess the distribution of cities, towns, villages, and hamlets within larger geographic units. This can be invaluable for regional planning, resource allocation, and understanding the urban-rural divide within specific administrative regions.
  • Obtaining Settlement Boundaries: Data analysts can use the geographical coordinates and other data attributes to define the boundaries of each settlement. This information is essential for mapping the precise areas of cities, towns, villages, and hamlets. It enables the creation of accurate boundary maps, which can be utilized in various applications, such as land-use planning, zoning regulations, and urban development projects.
  • Identifying the Largest and Most Populated Areas: Data analysts can easily identify the largest and most populated cities or towns by sorting the dataset based on population figures. This is useful for various purposes, such as market prioritization, resource allocation, and targeting specific areas for services.
  • Multilingual Place Names: The dataset provides multilingual place names, allowing data analysts to access international names for cities and towns. This is particularly useful for businesses and organizations operating in diverse linguistic regions, enabling them to display location names in the preferred languages of their target audience.

Loading and Reading the Dataset

To tap into the valuable information in the dataset of cities, towns, villages, and hamlets, you'll need to load and read the data. Since the dataset can be quite large, it's best to do this on a server or your local computer, especially if you're working with big datasets. This helps make sure everything runs smoothly and doesn't strain your resources.

You can use programming languages like Python, JavaScript(Node.js), or PHP to read and manipulate the dataset.

JavaScript(Node.js)

Here's a simple example using JavaScript(Node.js):

const fs = require("fs");

// Load data for cities, towns, villages, and hamlets
const cities = getLocationsData("place-city.ndjson");
const towns = getLocationsData("place-town.ndjson");
const villages = getLocationsData("place-village.ndjson");
const hamlets = getLocationsData("place-hamlet.ndjson");

// Merge data into one array
const allData = [...cities, ...towns, ...villages, ...hamlets];

// [Optionally] Write the combined data to a JSON file
//fs.writeFileSync("output/all-settlements.json", JSON.stringify(allData), "utf-8");

// Function to process data from a file
function getLocationsData(filename) {
  const sourceData = fs.readFileSync(filename, "utf-8");
  const data = sourceData.split("}\n{").join("},{");
  let jsonArr = JSON.parse("[" + data + "]");

  // [Optionally] Preprocess open-data by removing duplicates
  //jsonArr = removeDuplicates(jsonArr);
  return jsonArr;
}
  1. Save the JavaScript code as 'script.js' (or any preferred filename) in the same directory as the following NDJSON files: 'place-city.ndjson,' 'place-town.ndjson,' 'place-village.ndjson,' and 'place-hamlet.ndjson.'
  2. Install Node.js if required and run the code by executing the command: node script.js.
  3. Ensure that you have created an 'output' folder in advance if you plan to save the resulting JSON object.
  4. As the datasets are based on open data, it's important to note that they may contain inconsistencies. For instance, duplicates can be present, where multiple records represent the same city. These duplicates have been retained in their original state because they may serve different purposes.

For example, here are two separate OpenStreetMap (OSM) objects that both refer to the city "New London":

These objects could contain similar or identical information such as the city name, location, and population. As a result, when processing the dataset, these two entries could be identified as potential duplicates.

In such cases, the removeDuplicates function aims to retain the entry with the largest geographical area, ensuring that only one representation of "New London" remains in the dataset while discarding the redundant entry. This process helps maintain data consistency and efficiency by eliminating unnecessary duplicates.

// Removes duplicates from an array of settlement data.
function removeDuplicates(arr) {
  const arrNoDup = [];
  const checked = [];

  arr.forEach((value, index) => {
    if (checked.indexOf(index) >= 0) {
      return;
    }

    // Find potential duplicates for the current value.
    const duplicates = arr
      .filter((possibleDuplicate, possibleDuplicateIndex) => {
        const isDup =
          (possibleDuplicate.name === value.name && possibleDuplicate.population && possibleDuplicate.population === value.population) ||
          (possibleDuplicate.display_name === value.display_name) ||
          (possibleDuplicate.name === value.name && possibleDuplicate.address.county === value.address.county && possibleDuplicate.address.state === value.address.state);

        if (isDup && possibleDuplicateIndex !== index) {
          checked.push(possibleDuplicateIndex);
        }

        return isDup;
      });

    // Get the largest area settlement from the duplicates.
    const largest = duplicates.reduce(function (p, v) {
      const areaV =
        Math.abs(v.bbox[0] - v.bbox[2]) * Math.abs(v.bbox[1] - v.bbox[3]);
      const areaP =
        Math.abs(p.bbox[0] - p.bbox[2]) * Math.abs(p.bbox[1] - p.bbox[3]);
      return areaP > areaV ? p : v;
    });

    arrNoDup.push(largest);
  });

  return arrNoDup;
}

This function takes an array of settlement data as input and removes duplicates based on specific criteria. It goes through the array, identifies potential duplicates, and keeps the largest area settlement while discarding the others. The function returns an array with duplicates removed, making the data more manageable and consistent.

It's important to mention that the duplication removal process in this code is suboptimal and may take a significant amount of time, especially when dealing with large datasets.

Python

Here's the equivalent code in Python using the json module to work with JSON data:

import json

def get_locations_data(filename):
    with open(filename, 'r', encoding='utf-8') as file:
        source_data = file.read()
        data = source_data.replace("}\n{", "},{")
        json_arr = json.loads("[" + data + "]")
        json_arr = remove_duplicates(json_arr)
        return json_arr

def remove_duplicates(arr):
    arr_no_dup = []
    checked = []

    for index, value in enumerate(arr):
        if index in checked:
            continue

        duplicates = []
        for possible_duplicate_index, possible_duplicate in enumerate(arr):
            if ((possible_duplicate.get('name') == value.get('name') and possible_duplicate.get('population') == value.get('population')) or
                      possible_duplicate['display_name'] == value['display_name'] or
                      (possible_duplicate.get('name') == value.get('name') and possible_duplicate['address'].get('county') == value['address'].get('county') and possible_duplicate['address'].get('state') == value['address'].get('state'))):
                duplicates.append(possible_duplicate)
            
        for possible_duplicate_index, possible_duplicate in enumerate(duplicates):
            if possible_duplicate_index != index:
                checked.append(possible_duplicate_index)

        # get the largest
        largest = max(duplicates, key=lambda x: (abs(x['bbox'][0] - x['bbox'][2]) * abs(x['bbox'][1] - x['bbox'][3])))

        arr_no_dup.append(largest)

    return arr_no_dup

def main():
    cities = get_locations_data("place-city.ndjson")
    towns = get_locations_data("place-town.ndjson")
    villages = get_locations_data("place-village.ndjson")
    hamlets = get_locations_data("place-hamlet.ndjson")

    all_data = cities + towns + villages + hamlets

    with open("output/all-settlements.json", 'w', encoding='utf-8') as file:
        json.dump(all_data, file)

if __name__ == "__main__":
    main()

To run the provided Python code, follow these steps:

  1. Install Python: If you don't already have Python installed on your system, you can download it from the official Python website and follow the installation instructions for your operating system.

  2. Download Input Files: Place your input NDJSON files (place-city.ndjson, place-town.ndjson, place-village.ndjson, and place-hamlet.ndjson) in the same directory as the Python script.

  3. Run the Python Script:

    • Open a command prompt or terminal.
    • Navigate to the directory where you saved the Python script and input files.
    • Run the script by executing the following command:
      python script.py

    Replace script.py with the actual name of your Python script if you named it differently.

  4. Processing: The script will process the input files, remove duplicates, and create the all-settlements.json output file in the same directory.

Make sure you have the necessary permissions to read and write files in the directory where you're running the script. The output file, all-settlements.json, will contain the processed data with duplicates removed.

Example: Getting Settlements in a US State

You can use the provided dataset of cities, towns, villages, and hamlets to extract information about settlements within a particular US state. This is particularly useful for geographic analysis or when you need data specific to a region.

The JavaScript and Python code samples illustrate a straightforward approach to group settlements by state. Utilizing this code allows you to easily organize the dataset into separate collections, each representing the settlements within a specific state. This makes it convenient for state-specific analyses.

JavaScript (Node.js)

const fs = require("fs");

const settlements = JSON.parse(fs.readFileSync("output/all-settlements.json", "utf-8"));
const settlementsByState = settlements.reduce((byStatesMap, settlement) => {
    byStatesMap[settlement.address.state] = byStatesMap[settlement.address.state] || [];
    byStatesMap[settlement.address.state].push(settlement);
    return byStatesMap;
}, {});


console.log(settlementsByState['California'].length);

This JavaScript code sample demonstrates how to efficiently group settlements by their respective states. The code starts by reading the preprocessed data from a JSON file, which contains information about various settlements. It then employs a reduction process to categorize settlements into distinct groups based on their associated states.

The key function here is reduce(), which iterates through the settlements, allocating each settlement to its corresponding state within the settlementsByState object. The result is a structured collection where each state serves as a key, and its associated settlements are stored as values in an array.

In the example, it showcases how to access the number of settlements in California, which is just one application of this code. The code's flexibility allows data analysts to organize and access settlement data based on different states, making state-specific analysis more accessible and efficient.

Python

import json

# Load the preprocessed data from the JSON file
with open("output/all-settlements.json", "r", encoding="utf-8") as file:
    settlements = json.load(file)

# Create a dictionary to group settlements by state
settlements_by_state = {}
for settlement in settlements:
    state = settlement["address"].get("state")
    if state not in settlements_by_state:
        settlements_by_state[state] = []
    settlements_by_state[state].append(settlement)

# Access the number of settlements in California as an example
california_settlements_count = len(settlements_by_state.get("California", []))
print(california_settlements_count)

This Python code accomplishes the same task as the JavaScript code. It loads the JSON data, organizes settlements into state-specific lists, and allows for easy access to the number of settlements in a specific state, such as California.

Example: Finding the Most Populated Cities in the US

The code sample demonstrates an efficient method for identifying the most populated cities in the US. By processing the dataset, you can obtain a ranked list of cities based on their population size.

JavaScript (Node.js)

const fs = require("fs");

// Load the preprocessed city data from the JSON file
const cities = JSON.parse(fs.readFileSync("output/cities.json", "utf-8"));

// Sort the cities based on population in descending order
cities.sort((city1, city2) => {
  return (city2.population || 0) - (city1.population || 0);
});

// Display the names of the top 100 most populated cities
console.log(cities.slice(0, 100).map(city => city.name));
  • The code loads preprocessed city data from a JSON file using the fs (File System) module.
  • The cities are then sorted based on their population in descending order. This ensures that the cities with the highest populations are positioned at the beginning of the sorted list.
  • Then the code extracts and displays the names of the top 100 most populous cities. This provides a concise list of the largest urban centers, making it convenient for further analysis.

Python

import json

# Load the preprocessed city data from the JSON file
with open("output/cities.json", "r", encoding="utf-8") as file:
    cities = json.load(file)

# Sort the cities based on population in descending order
cities.sort(key=lambda city: city.get("population", 0), reverse=True)

# Display the names of the top 100 most populated cities
top_cities = [city["name"] for city in cities[:100]]
print(top_cities)

This Python code performs the same tasks as the JavaScript code: it loads preprocessed city data from a JSON file, sorts the cities based on population in descending order, and displays the names of the top 100 most populated cities.

Example: Get Cities nearby

One of the valuable features of this dataset is its inclusion of location information for each city, town, village, and hamlet. This geographical data opens up various possibilities, for example, the ability to find cities that are in close proximity to a given location.

This code example demonstrates how to find the cities nearest to a specific geographical point using the dataset of cities:

JavaScript (Node.js)

const fs = require("fs");

// Load the preprocessed city data from the JSON file
const cities = JSON.parse(fs.readFileSync("output/cities.json", "utf-8"));

// Define the reference point as longitude and latitude
const lonLatPosition = [-121.122384, 44.006523];

// Sort cities based on proximity to the reference point
cities.sort((city1, city2) => {
  // Calculate the distance using the Haversine formula
  const distance1 = getDistanceFromLatLonInKm(lonLatPosition[1], lonLatPosition[0], city1.location[1], city1.location[0]);
  const distance2 = getDistanceFromLatLonInKm(lonLatPosition[1], lonLatPosition[0], city2.location[1], city2.location[0]);
  return distance1 - distance2;
});

// Haversine formula to calculate distance in kilometers
// https://stackoverflow.com/a/27943
function getDistanceFromLatLonInKm(lat1, lon1, lat2, lon2) {
  var R = 6371; // Radius of the Earth in kilometers
  var dLat = deg2rad(lat2 - lat1); // Convert latitude difference to radians
  var dLon = deg2rad(lon2 - lon1); // Convert longitude difference to radians
  var a =
    Math.sin(dLat / 2) * Math.sin(dLat / 2) +
    Math.cos(deg2rad(lat1)) * Math.cos(deg2rad(lat2)) * Math.sin(dLon / 2) * Math.sin(dLon / 2);
  var c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a)); // Calculate angular distance
  var d = R * c; // Calculate distance in kilometers
  return d;
}

// Utility function to convert degrees to radians
function deg2rad(deg) {
  return deg * (Math.PI / 180);
}

// Display the names of the top 100 cities nearest to the reference point
console.log(cities.slice(0, 100).map(city => city.name));
  • The code begins by loading the preprocessed city data from the JSON file "cities.json".
  • The cities are sorted based on their proximity to the reference point. The cities.sort function calculates the distance between the reference point and each city using the Haversine formula. This formula computes the great-circle distance between two points on the Earth's surface given their longitude and latitude.
  • Then the code displays the names of the top 100 cities that are nearest to the reference point. The cities.slice(0, 100) operation retrieves the 100 closest cities based on the calculated distances, and the .map(city => city.name) extracts and displays their names.

Python

import json
import math

# Haversine formula to calculate distance in kilometers
def get_distance_from_lat_lon_km(lat1, lon1, lat2, lon2):
    R = 6371  # Radius of the Earth in kilometers
    d_lat = deg2rad(lat2 - lat1)
    d_lon = deg2rad(lon2 - lon1)
    a = (
        math.sin(d_lat / 2) * math.sin(d_lat / 2) +
        math.cos(deg2rad(lat1)) * math.cos(deg2rad(lat2)) *
        math.sin(d_lon / 2) * math.sin(d_lon / 2)
    )
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    distance = R * c
    return distance

# Utility function to convert degrees to radians
def deg2rad(deg):
    return deg * (math.pi / 180)

# Load the preprocessed city data from the JSON file
with open("output/cities.json", "r", encoding="utf-8") as file:
    cities = json.load(file)

# Define the reference point as longitude and latitude
lon_lat_position = [-121.122384, 44.006523]

# Sort cities based on proximity to the reference point
cities.sort(key=lambda city: get_distance_from_lat_lon_km(lon_lat_position[1], lon_lat_position[0], city["location"][1], city["location"][0]))

# Display the names of the top 100 cities nearest to the reference point
print([city.get("name") for city in cities[:100]])

This Python code performs the same tasks as the JavaScript code described earlier. It loads city data from a JSON file, calculates the distances between cities and a reference point using the Haversine formula, and then sorts and displays the names of the top 100 cities closest to the reference point.

Conclusion

In conclusion, the dataset containing cities, towns, villages, and hamlets is a valuable resource for data analysts seeking to explore and analyze geographic data. Whether you're interested in understanding population distributions, identifying administrative boundaries, or finding the closest settlements to a specific location, this dataset offers a wealth of information.

We've demonstrated how to work with this dataset using both JavaScript and Python, providing code samples to simplify the data loading and analysis process. It's important to note that due to the size of the dataset, it's recommended to perform data processing on the server side or locally, especially for larger datasets.

The dataset has the potential to assist data analysts in a variety of ways. You can categorize settlements by state, identify the most populated cities in the United States, and even find nearby cities in a given location. These practical examples showcase the versatility and utility of the dataset for data analysis tasks.

Moreover, Geoapify's APIs, such as the Places API and Boundary API, provide even more powerful tools to streamline the process of accessing and utilizing geographic data. These APIs can enhance your analytical capabilities and make working with settlement data more efficient and user-friendly.

As you continue your journey into the world of data analysis, remember that the dataset of cities, towns, villages, and hamlets is a valuable asset, and with the right tools and techniques, you can unlock valuable insights and information about the world's settlements.