Big data and cartography#
In this tutorial, we are going to use Python to explore and visualize some big geospatial data. We are going to work with the following datasets:
Point dataset of Flickr posts in Finland 2019–2023 acquired through the platform’s API. The posts have been stripped of all identifiable information and the exact locations obfuscated using the Laplace noise algorithm implemented in GeoPriv QGIS plugin. Only attribute information remaining are pseudo-ids and month-year timestamps. (161544 datapoints)
Origin-destination line dataset of student mobilities to Germany.
network speeds in the first quarter of 2023 fetched from Ookla Speedtest.
In addition to the Python libraries we have already worked with, we are going to get familiar with some additional python libraries that can be handy for handling and visulization of big geospatial data.
Flickr data from Finland#
As always, let’s start by exploring our data a little bit. Let’s load our data, make a quick plot, count the number of points, and print the head of our data first.
[1]:
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
# Load the data
file_path = 'data/flickr_finland_2019_2023_obfuscated.gpkg'
flickr_data = gpd.read_file(file_path)
flickr_data.plot(markersize=2, edgecolor='none')
print(flickr_data.head())
print(f"\nThere are {len(flickr_data)} points in our dataset\n")
pseudo_id month_year geometry
0 2 2020-02-01 POINT (22.45380 62.69018)
1 3 2020-02-01 POINT (22.45371 62.69033)
2 4 2020-06-01 POINT (29.76744 62.59329)
3 5 2020-06-01 POINT (21.60417 63.09216)
4 6 2020-06-01 POINT (22.85247 63.22937)
There are 161544 points in our dataset

As we can see, our data is from Finland and we have 161544 points altogether representing the location of the photos. In our data, we have a column named month_year
which as name suggests, containes the month and year when the photo was taken. Let’s explore the temporal distribution of our data quickly using a time series plot.
[2]:
# Convert 'month_year' to datetime format for better handling
flickr_data['month_year'] = pd.to_datetime(flickr_data['month_year'])
# Group by 'month_year' and count occurrences
time_series = flickr_data.groupby('month_year').size()
# Plotting the time series
plt.figure(figsize=(12, 6))
plt.plot(time_series, marker='o', linestyle='-', color='b')
plt.title('Number of Photos per Month (2019-2023)')
plt.xlabel('Month and Year')
plt.ylabel('Number of Photos')
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout() # Adjusts plot to ensure everything fits without overlap
plt.show()

Hexbin map#
Hexagons are increasingly favored in the visualization of large geospatial datasets for several reasons:
Efficient Coverage and Less Wastage: Hexagons cover a surface area more efficiently than squares or triangles. When visualizing large areas, hexagonal tiling reduces gaps and overlaps, ensuring more uniform data representation.
Uniformity of Data Representation: Each hexagon has six sides of equal length, which results in a more uniform distance between the center of each hexagon and any point on its boundary. This uniformity ensures that each data point within a hexagon is equally representative of its center. In contrast, square grids have varying distances from the center to the middle of the edges versus the corners, potentially introducing bias in how data is represented spatially.
Better Nearest Neighbor Analysis: Hexagons have a unique advantage due to their six equidistant neighbors, which is beneficial for nearest neighbor analysis. This property ensures that each cell interacts more symmetrically with its surroundings, providing a more natural flow of data and smoother transitions across the grid. Squares, on the other hand, connect to their neighbors at four sides and interact less directly with their diagonal neighbors, which can complicate analyses involving spatial relationships.
Visual Appeal and Reduced Visual Errors: Hexagonal grids tend to be more visually appealing and easier for the human eye to follow. This can enhance the overall readability and interpretation of maps. Furthermore, the equidistant properties of hexagons can reduce visual distortions that often occur with square grids, where the clustering of data points might appear more intense at the corners compared to the center.
Efficient Computation: Despite a seemingly complex shape, algorithms for processing hexagonal grids are often more straightforward and computationally efficient for spatial data analyses than those for squares. This is due to the consistent distances and connectivity, which simplify the computation of spatial relationships and aggregations.
[3]:
import h3
from shapely.geometry import Polygon
import matplotlib.pyplot as plt
import mapclassify
from shapely import Polygon
# for zoomin to Helsinki area, if needed
helsinki_bounds = {
"min_lon": 24.50,
"max_lon": 25.50,
"min_lat": 60.00,
"max_lat": 60.50
}
# Function to assign hexagon using H3
def assign_hexagon(row, resolution=8):
return h3.geo_to_h3(row.geometry.y, row.geometry.x, resolution)
# Apply function to data
flickr_data['hex_id'] = flickr_data.apply(assign_hexagon, axis=1)
# Aggregate data within each hexagon
hex_counts = flickr_data['hex_id'].value_counts().reset_index()
hex_counts.columns = ['hex_id', 'count']
# Generate hexagon geometries
hex_counts['geometry'] = hex_counts['hex_id'].apply(lambda x: h3.h3_to_geo_boundary(x, geo_json=True))
# Convert hexagon geometries to GeoDataFrame
def hex_to_polygon(hex):
return Polygon([(lon, lat) for lon, lat in hex])
hex_counts['geometry'] = hex_counts['geometry'].apply(hex_to_polygon)
hex_gdf = gpd.GeoDataFrame(hex_counts, geometry='geometry')
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
# Set the bounds for the plot to zoom into Helsinki. if desired
#ax.set_xlim(helsinki_bounds["min_lon"], helsinki_bounds["max_lon"])
#ax.set_ylim(helsinki_bounds["min_lat"], helsinki_bounds["max_lat"])
hex_gdf.plot(ax=ax, column="count",scheme="Natural_Breaks", k=5, cmap="RdYlBu", legend=True, legend_kwds={'loc': 'lower right'})
[3]:
<Axes: >

Here is a breakdown of what we are doing in the block of code above:
Define a Function to Assign Hexagons:
def assign_hexagon(row, resolution=8):
return h3.geo_to_h3(row.geometry.y, row.geometry.x, resolution)
Purpose: This function converts geographic coordinates into H3 indices. Each H3 index represents a hexagonal cell on the globe at a specified resolution.
Parameters:
row
: A row of a DataFrame, expected to havegeometry
attributes (y
for latitude andx
for longitude).resolution
: The granularity of the hexagonal grid. Higher values create smaller hexagons.
Functionality: The function
h3.geo_to_h3()
takes latitude (row.geometry.y
), longitude (row.geometry.x
), andresolution
to generate a unique H3 index (hexagon ID) for that location.
Apply Function to Data:
flickr_data['hex_id'] = flickr_data.apply(assign_hexagon, axis=1)
Purpose: To assign an H3 hexagon ID to each record in the
flickr_data
GeoDataFrame.Method:
DataFrame.apply()
applies theassign_hexagon
function along the DataFrame’s rows (axis=1
).
Aggregate Data Within Each Hexagon:
hex_counts = flickr_data['hex_id'].value_counts().reset_index()
hex_counts.columns = ['hex_id', 'count']
Purpose: To count occurrences of each unique H3 index in
flickr_data
.Functionality:
value_counts()
: This function counts how many times each unique value appears in thehex_id
column.reset_index()
: Converts the Series returned byvalue_counts()
into a DataFrame.Renaming Columns: Columns are named to
hex_id
andcount
for clarity.
Generate Hexagon Geometries:
hex_counts['geometry'] = hex_counts['hex_id'].apply(lambda x: h3.h3_to_geo_boundary(x, geo_json=True))
Purpose: To convert each H3 index back into a polygon that represents the hexagonal cell on a map.
Method: The
h3.h3_to_geo_boundary()
function retrieves the vertices of the hexagon associated with each H3 index, formatted for use in GeoJSON.
Convert Hexagon Geometries to GeoDataFrame:
def hex_to_polygon(hex):
return Polygon([(lon, lat) for lon, lat in hex])
hex_counts['geometry'] = hex_counts['geometry'].apply(hex_to_polygon)
hex_gdf = gpd.GeoDataFrame(hex_counts, geometry='geometry')
Purpose: To transform the list of hexagon vertices into
shapely.Polygon
objects suitable for spatial operations and visualization.Functionality:
hex_to_polygon
: A function that converts a list of (longitude, latitude) tuples into aPolygon
.Applying the conversion function to each hexagon’s vertices.
Creating a
GeoDataFrame
We will learn how to make interactive plots later in this course. But let’s have a quick exploration of our data using plotly:
[4]:
import plotly.express as px
import mapclassify as mc
classifier = mc.NaturalBreaks(hex_gdf['count'], k=10)
hex_gdf['count_class'] = classifier.yb # Assign class labels to the hex_gdf
# Plot with Plotly
fig = px.choropleth_mapbox(data_frame=hex_gdf, geojson=hex_gdf.__geo_interface__,
locations="hex_id", color='count_class',
color_continuous_scale="Viridis",
featureidkey="properties.hex_id",
mapbox_style="carto-positron",
zoom=5, center={"lat": 64.5, "lon": 26.0},
opacity=0.5, labels={'count':'Data Points'},
hover_data={'count': True}
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()