Using Google’s AlphaEarth Foundations Embeddings for Building Analysis: From Satellite Data to Predictive Urban Models

Author

Alex Singleton

Learning Objectives

This training pack demonstrates how to extract and analyse Google’s AlphaEarth Foundations satellite embeddings at the building level, progressing from basic data extraction through advanced predictive modeling. The tutorial uses Liverpool City Region as a case study to illustrate a complete analytic workflow that combines satellite AI with urban datasets. Note: you will need a Google Account to access these data.

All of the data used for this tutorial can be downloaded from the Geographic Data Service. These should be placed in a folder called data in your working directory.

By the end of this training pack you will be able to:

Data Access and Extraction:
- Describe what satellite embeddings are and how they function as numerical “fingerprints” of geographic locations
- Understand the structure and capabilities of Google’s AlphaEarth Foundations model and its 64-dimensional embeddings at 10-meter resolution
- Setup, authenticate and initialise Google Earth Engine API from Python
- Access Google’s AlphaEarth satellite embedding collection for the time periods 2017-2024
- Extract pixel-level embeddings for large geographic regions and specific building locations
Clustering and Pattern Recognition:
- Apply clustergram analysis to determine optimal cluster frequency
- Perform k-means clustering on satellite embedding data to identify building typologies
- Interpret cluster patterns and validate results using external datasets
Advanced Analytic Techniques:
- Use UMAP (Uniform Manifold Approximation and Projection) to visualise high-dimensional embeddings in 2D
- Calculate cosine similarity to identify buildings with similar characteristics
- Integrate satellite embeddings with administrative datasets (Energy Performance Certificates)
- Create propensity indices to characterise cluster distributions across building attributes
Predictive Modeling and Evaluation:
- Build a Random Forest model to predict building characteristics using satellite embeddings
- Evaluate model performance using confusion matrices, classification reports, and feature importance analysis
- Understand the limitations and appropriate applications of embedding-based prediction
Visualisation and Communication:
- Export results to various formats (GeoPackage, Parquet) for use in GIS software
- Generate publication-ready plots and statistical summaries

Conceptual Understanding

AI-powered satellite analysis represents a fundamental shift from traditional remote sensing approaches. Unlike conventional methods that analyse individual satellite images in isolation, systems like AlphaEarth Foundations create unified digital representations by integrating multiple data sources into a number of embeddings. These embeddings aim to capture complex spatial and temporal patterns, with multiple applications of urban environments.

What are AlphaEarth Embeddings?

AlphaEarth Foundations (Brown et al. 2025) is an artificial intelligence model developed by Google DeepMind that functions as a comprehensive Earth observation system. Rather than requiring users to process raw satellite imagery, the model provides pre-computed embeddings¹ that distill complex environmental information into 64 numerical values for every 10×10 meter pixel across Earth’s terrestrial and coastal zones.

Brown, Christopher F., Michal R. Kazmierski, Valerie J. Pasquarella, William J. Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, et al. 2025. “AlphaEarth Foundations: An Embedding Field Model for Accurate and Efficient Global Mapping from Sparse Label Data.” https://arxiv.org/abs/2507.22291.

¹ Satellite embeddings are AI-generated feature representations of satellite imagery. Think of them as numerical “fingerprints” that capture the visual characteristics of a location - things like vegetation patterns, building density, water presence, and land use. Google provides pre-computed embeddings for the entire Earth, saving you from processing raw satellite images yourself.

Setup

We recommend using at least Python 3.11, and there are a number of prerequisites and libraries required for this tutorial.

earthengine-api is the official client library for Google Earth Engine. It allows you to access, process, and analyse geographic datasets hosted on Earth Engine directly from Python code.
pandas is used for data manipulation and analysis and provides data structures like DataFrame.
geopandas makes working with geographic data in pandas easier. It extends pandas DataFrames to support geometry columns (points, lines, polygons), enabling spatial operations, reading/writing GIS file formats, and easy mapping.
pyarrow provides columnar in-memory data platform that enables fast reading and writing of various data formats (especially Parquet files) between different systems, and accelerates analytic operations of large datasets.
scikit-learn is a machine learning library for data mining and analysis.
clustergram helps visualise the most effective partitioning of data using clusters analysis (k means).
umap-learn - is dimensionality reduction technique that creates visualisations of high-dimensional data.
seaborn - a data visualisation library.
plotly - an interactive plotting library.
matplotlib - a fundamental plotting library.
numpy - used for numerical operations and mathematical calculations.
requests - downloading images from Google Earth Engine URLs and other web requests.
PIL (Python Imaging Library) - used for image processing and display.
duckdb - querying multiple files as if they were a single database.
keplergl - creates the interactive maps.

!pip install earthengine-api pyarrow scikit-learn pandas geopandas clustergram keplergl umap-learn seaborn shap duckdb plotly # install packages

import io # Import for handling input/output operations
from io import StringIO # allows you to treat strings as file-like objects
import ee # Imports the Earth Engine API functionality
import pandas as pd # Import the pandas functionality
import geopandas as gpd # Import the geopandas functionality
from sklearn.cluster import KMeans # Import KMeans
import matplotlib.pyplot as plt # Import for plotting
from clustergram import Clustergram # Import for visualising cluster structure
import numpy as np # Import for numerical operations
from IPython.display import Image, display # Import image and display utilities
import duckdb # Enables duckdb functionality
import plotly.express as px # Import for interactive plotting and visualisation
import umap # Import for dimensionality reduction using UMAP
import seaborn as sns # Import for statistical data visualisation
import requests # Import for making HTTP requests
from PIL import Image as PILImage # Import PIL Image functionality
import os # Import for operating system interface
from sklearn.ensemble import RandomForestClassifier # Import Random Forest classifier
from sklearn.model_selection import train_test_split # Import for splitting datasets
from sklearn.metrics import classification_report, confusion_matrix # Import for model evaluation metrics

Next we need to initiate the authentication process for the Google Earth Engine (GEE) Python API. When you run this command, it prompts you to log in to your Google account and grant the necessary permissions for the script or application to access your Earth Engine resources.

ee.Authenticate() # Follow the prompts to log in with your Google account

Setup and Pixel Level Analysis

The first part of this tutorial will guide you through the various steps to extract embeddings for Liverpool City Region (LCRCA) and examine the structure of these data using a simple cluster analysis.

Step 1: Initialise Earth Engine

The following code initialises the Google Earth Engine API.

# Initialise Earth Engine (this project is called - alpha-tutorial-477812)
ee.Initialize(project='alpha-tutorial-477812') # You may also need to specify your project, e.g. ee.Initialize(project='your_project_ID_here')

Step 2: Import the Location Data

The following code creates a GeoPandas GeoDataFrame with the location of each property and reprojects all geometries into the WGS84 (EPSG:4326).

# Read the GeoParquet file containing building locations
buildings = gpd.read_parquet('./data/lcrca_buildings.parquet')

# Convert buildings to WGS84 coordinate system (EPSG:4326)
buildings = buildings.to_crs(epsg=4326)
buildings["lon"] = buildings.geometry.x
buildings["lat"] = buildings.geometry.y

Step 3: Access Google’s Satellite Embedding Collection

The following code queries Google Earth Engine’s satellite embedding collection to check if satellite image embeddings are available for a specified year (2017-2024). The processing loop iterates through each available year from 2017 to 2024, applying a temporal filter to isolate embeddings for the specified 12-month period.

The year_images dictionary serves as a container where each year maps to its corresponding embedding image, making it straightforward to access specific years’ data later in the workflow.

# Read GeoPackage into a GeoDataFrame and convert to WGS84 (EPSG:4326)
lcrca = gpd.read_file('./data/lcrca.gpkg')

# Reproject to WGS84
lcrca = lcrca.to_crs(epsg=4326)

# Get the boundary coordinates
coords = lcrca.geometry.iloc[0].__geo_interface__

# Convert to Earth Engine geometry
lcrca_geometry = ee.Geometry(coords)

# Access the embedding collection
embedding_collection = ee.ImageCollection('GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL')

# Work with each year separately
years = range(2017, 2025) 
year_images = {}

# Loop through each year from 2017 to 2024 that overlap with the LCRCA boundary
for year in years:
  year_filtered = embedding_collection \
    .filterDate(f'{year}-01-01', f'{year}-12-31') \
    .filterBounds(lcrca_geometry)
  
  # Check if any images exist for this year (avoid processing empty collections)
  if year_filtered.size().getInfo() > 0:
    # Create a mosaic from all filtered images and clip to the LCRCA boundary
    year_images[year] = year_filtered.mosaic().clip(lcrca_geometry)

Step 4: Explore the Structure of the Embeddings at the Pixel Level

Using Earth Engine, we can quantify the scale of data within our study region by calculating the total number of 10-meter pixels. For the Liverpool City Region in 2024, this analysis reveals approximately 12.26 million pixels within the boundary. Since each pixel contains 64 embedding values, this represents a dataset with over 784 million individual data points (12.26 million × 64 dimensions). This scale illustrates both the opportunity and challenge of working with satellite embeddings.

# Select 2024 as the year
emb_img = year_images[2024]

# Calculate the total number of pixels in the region
pixel_count = emb_img.select(0).reduceRegion(
  reducer=ee.Reducer.count(),      # Count the number of valid pixels
  geometry=lcrca_geometry,         # Within the LCRCA boundary
  scale=10,                        # At 10-meter resolution
  maxPixels=1e9                    # Allow up to 1 billion pixels to be processed
)

# Get and print the count
total_pixels = pixel_count.getInfo()
total_count = list(total_pixels.values())[0]

print(f"Total pixels in region at 10m scale: {total_count:,}")

Total pixels in region at 10m scale: 12,259,151

The following code block demonstrates how to extract pixel-level embeddings directly from Google Earth Engine, but we do not run this code in the tutorial. If executed, this function would:

Randomly sample 50,000 pixels from within the Liverpool City Region boundary
Extract all 64 embedding values for each sampled pixel location
Export the results as CSV files to your Google Drive (in a folder called “EE_exports”)

Why we skip this step: With 12.26 million pixels in the region, processing even a sample of 50,000 pixels through Earth Engine’s export system is time-consuming and requires managing multiple large CSV files.

What we provide instead: To keep this tutorial focused on analysis techniques rather than data processing, we provide the sampled pixel data as a ready-to-use Parquet file that you can load directly.

When you might use this code: If you’re working with your own study area or need a different sample size, you can adapt this function for your specific requirements. The code is included here for transparency and to support your independent applications.

def export_in_batches(image, geometry, total_pixels, batch_size=50000):
    """Export data in smaller batches"""
    num_batches = (total_pixels + batch_size - 1) // batch_size
    
    for i in range(num_batches):
        # Use different seeds for each batch to get different samples
        samples = image.sample(
            region=geometry,
            scale=10,
            numPixels=min(batch_size, total_pixels - i * batch_size),
            seed=42 + i,  # Different seed for each batch
            geometries=True
        )
        
        export_task = ee.batch.Export.table.toDrive(
            collection=samples,
            description=f'sampled_data_export_batch_{i+1}',
            folder='EE_exports',
            fileNamePrefix=f'lcrca_samples_batch_{i+1}',
            fileFormat='CSV'
        )
        
        export_task.start()
        print(f"Started batch {i+1}/{num_batches}")

# Usage
export_in_batches(emb_img, lcrca_geometry, 1200000, batch_size=50000)

Data Consolidation Process (Not Executed in Tutorial) Once the CSV export batches are complete, they need to be consolidated into a single dataset for analysis. The code below demonstrates how DuckDB can efficiently combine multiple CSV files into one DataFrame and export the result as a Parquet file.

LCRCA_Samples = duckdb.sql(f"SELECT * FROM read_csv_auto('./data/samples/lcrca_samples_batch_*.csv')").df() # Create a single DataFrame
LCRCA_Samples = LCRCA_Samples.iloc[:, 1:-1] # Remove the first and last column
LCRCA_Samples.to_parquet('./data/LCRCA_Samples.parquet') # Export to a parquet

Why DuckDB is useful here: DuckDB treats multiple CSV files as if they were a single database table, eliminating the need to manually loop through files or manage memory for large datasets. This approach is particularly valuable when working with the batch export system from Google Earth Engine, which creates multiple files that need to be recombined.

Note: Like the export code above, this consolidation step is not executed in the tutorial since we provide the pre-processed Parquet file. However, this pattern is essential for real-world applications where you generate your own export batches.

We can read the parquet file into a DataFrame as follows and continue with our analysis.

LCRCA_Samples = pd.read_parquet('./data/LCRCA_Samples.parquet')

Using the sample we can examine the structure of the embeddings by fitting a Clustergram. This is a visualisation technique that shows how cluster assignments change as you increase the number of clusters (k). This helps you to understand the structure in very high-dimensional space in the following ways:

Optimal k selection: Helps you to determine the right number of clusters by visualising how cleanly clusters separate
Cluster stability: Shows which clusters persist across different k values (stable long lines) vs. those which are artifacts of over-clustering (short, erratic lines)
Split patterns: Reveals the natural hierarchy in the data by showing how clusters subdivide

From the diagram, a five cluster solutions looks to be an optimal initial choice.

# Create clustergram to visualise optimal number of clusters
cgram = Clustergram(range(1, 12), init="random", n_init=10, random_state=42, verbose=False)
cgram.fit(LCRCA_Samples)

cgram.plot(figsize=(8, 5))

With the clustergram indicating that five clusters provide an optimal partitioning of the data, we can now apply K-means clustering to visualise the spatial distribution of building types. The following approach implements clustering directly within Google Earth Engine, which offers significant computational advantages over downloading the full dataset locally².

² This server-side approach is computationally efficient because it processes the high-dimensional data where it resides (on Google’s servers) rather than transferring millions of 64-dimensional vectors to your local machine. The result is a single-band image where each pixel is assigned a cluster label (0-4), which can then be visualised and exported as needed.

The code uses Earth Engine’s built-in clustering capabilities to: 1. Sample 100,000 training pixels from across the region to build the clustering model 2. Apply the trained K-means classifier to all 12.26 million pixels in the study area 3. Return only the final cluster assignments rather than the raw embedding values

# Sample pixels across the region - this builds an initial training model which is then applied to all pixels
training = emb_img.sample(
  region=lcrca_geometry,        # Sample within the LCRCA boundary
  scale=10,                     # Use 10-meter pixel resolution (matches embedding data)
  numPixels=100000,             # Extract 100,000 random pixels for training
  seed=42                       # Set random seed for reproducible sampling
)

# Train K-Means with 5 clusters
clusterer = ee.Clusterer.wekaKMeans(5).train(training)

# Apply to the image
result = emb_img.cluster(clusterer).clip(lcrca_geometry)

# Define the same colours for visualisation
cluster_colours = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00']

# Create a palette for the clusters (GEE format)
palette = cluster_colours

# Set visualisation parameters
vis_params = {
    'min': 0,
    'max': 4,
    'palette': palette
}

# Export as static image
url = result.getThumbURL({
    'dimensions': 1024,  # Image width/height in pixels
    'region': lcrca_geometry,
    'min': 0,
    'max': 4,
    'palette': palette,
    'format': 'png',
    'crs': 'EPSG:27700'
})

# Download the image locally
response = requests.get(url)
response.raise_for_status()  # Raise an exception for bad status codes

# Create a local filename
local_filename = "./images/lcrca_clusters.png"

# Save the image locally
with open(local_filename, 'wb') as f:
    f.write(response.content)

# Display the local image
display(Image(filename=local_filename, width=750))

Building Cluster Analysis

The following section extends the analysis to illustrate how geographic locations (in this case buildings) can be appended with pixel level embeddings across a regional extent. In this example we are going to use a database of TOID (Topographic Identifier) from the Ordnance Survey³.

³ A TOID (Topographic Identifier) is a unique and persistent identifier for each and every feature found in OS MasterMap products. Because the raw dataset shows unique identifiers for a wide range of landscape and built environment features, these were limited to only those where the TOID had an associated UPRN, which are an authoritative identifier used to uniquely identify addressable locations.

Step 5: Extract Pixel Values for Each Building

Having imported the building locations earlier, we can now extract pixel-level embeddings for each individual building. This process requires careful handling of the large dataset—with over 640,000 building points, as attempting to process all locations simultaneously would exceed Google Earth Engine processing limits.

Batch Processing Strategy:

The code below implements a batch processing approach that: 1. Divides the buildings into manageable batches of 1,000 points each 2. Converts each batch into an Earth Engine FeatureCollection for server-side processing 3. Uses reduceRegions() with a first() reducer to extract the embedding values at each building’s coordinate 4. Combines results from all batches into a single dataset

Each building point extracts the embedding value from the single 10×10 meter pixel that contains that coordinate. This means the “building” embedding actually represents the spectral characteristics of whatever land cover exists within that pixel, which may include the building roof, surrounding vegetation, pavement, or other features depending on the building size and the precise coordinate location within the pixel.

This spatial relationship is crucial to understand when interpreting results: smaller buildings may be represented by pixels that capture significant non-building features, while larger buildings are more likely to have pixels dominated by roof characteristics.

# Parameters
year = 2024
batch_size = 1000   # batch size

# Get embedding image
embedding_collection = ee.ImageCollection('GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL')
embedding_image = embedding_collection.filterDate(f"{year}-01-01", f"{year}-12-31").mosaic()

# Storage for results
all_results = []

# Split into batches
n_batches = int(np.ceil(len(buildings) / batch_size))
print(f"Total points: {len(buildings)}, processing in {n_batches} batches of {batch_size}")

for i in range(n_batches):
    # Select the current batch of buildings for processing
    start = i * batch_size
    end = min((i + 1) * batch_size, len(buildings))
    batch = buildings.iloc[start:end]

    print(f"Processing batch {i+1}/{n_batches} with {len(batch)} points...")

    # Build FeatureCollection
    features = []  # Initialise an empty list to store Earth Engine features
    for idx, row in batch.iterrows():  # Iterate over each row
      point = ee.Geometry.Point([row['lon'], row['lat']])  # Create a Point geometry
      feature = ee.Feature(point, {"index": idx})  # Create an Earth Engine Feature
      features.append(feature)  # Add the feature to the list

    # Convert the list of features to an Earth Engine FeatureCollection
    fc = ee.FeatureCollection(features)

    # Extract embeddings for each building point in the batch
    extracted = embedding_image.reduceRegions(
      collection=fc,                # The FeatureCollection of building points
      reducer=ee.Reducer.first(),   # Extract the first value for each embedding band at each point
      scale=10,                     # Use 10-meter pixel resolution (matches embedding data)
      tileScale=4                   # Increase tileScale to handle larger regions and avoid memory errors
    )
  
    # Convert to DataFrame
    results = extracted.getInfo()  # Retrieve the results into a Python dictionary
    embedding_dict = {}  # Initialise an empty dictionary
    for feature in results['features']:  # Iterate over each feature (building point) in the results
      props = feature['properties']  # Extract the properties (embedding values and index) for this feature
      idx = props.pop("index")  # Remove and get the 'index' property (original building index)
      band_values = {k: v for k, v in props.items() if not k.startswith("system:")}  # Filter out system fields, keep only embedding bands
      embedding_dict[idx] = band_values  # Store the embedding values in the dictionary, keyed by building index

    # Convert the embedding_dict (indexed by building index) to a DataFrame
    batch_df = pd.DataFrame.from_dict(embedding_dict, orient="index")
    batch_df.index.name = "index"  # Set the index name for merging with the batch

    # Merge with original batch
    merged = batch.reset_index().merge(batch_df, left_on="index", right_index=True, how="left").drop(columns="index")
    all_results.append(merged)

# Combine all batches
result = pd.concat(all_results, ignore_index=True)

# Output parquet file
result.to_parquet('./data/buildings_with_embeddings.parquet', index=False)

Step 6: Identify a Cluster Frequency and fit K-Means Clustering to the Buildings

The batch processing code above was executed to create a comprehensive dataset linking each building location with its corresponding satellite embeddings. The results are stored in a Parquet which are loaded as follows.

buildings_with_embeddings = pd.read_parquet('./data/buildings_with_embeddings.parquet')

To determine whether the five-cluster solution identified from regional pixel sampling also applies to our building-specific dataset, we apply the same clustergram analysis to the building embeddings. This comparison is important because building locations represent a spatial subset of the broader landscape and may exhibit different clustering patterns than randomly sampled pixels.

# Create clustergram to visualise optimal number of clusters
embedding_cols = [f"A{str(i).zfill(2)}" for i in range(64)]
cgram = Clustergram(range(1, 12), n_init=10, random_state=42, verbose=False)
cgram.fit(buildings_with_embeddings[embedding_cols])

cgram.plot(figsize=(8, 5))

Results: The Clustergram analysis confirms that five clusters remain an appropriate partitioning for the building-level data, with cluster assignments showing stability across different values of k and clear separation patterns similar to those observed in the regional pixel analysis. This consistency suggests that the embedding patterns we identified from random sampling effectively capture the dominant building types present in the Liverpool City Region.

Given the smaller size of the building data we will cluster locally within Python. K means is therefore used directly on buildings_with_embeddings, setting 1000 random initital seeds with an objective of finding 5 clusters. The repeated initialisations helps ensure we find the global optimum rather than getting trapped in local minima, which is particularly important given the 64-dimensional embedding space where K-means can be sensitive to initial centroid placement.

# Fit K-Means with 5 clusters and 1000 initialisations
kmeans = KMeans(n_clusters=5, init="random", n_init=1000, random_state=42)
labels = kmeans.fit_predict(buildings_with_embeddings[embedding_cols])

# Map numeric labels to letters
label_map = {i: chr(65 + i) for i in range(5)}  # 0->A, 1->B, etc.
letter_labels = [label_map[label] for label in labels]

# Create output table with TOID and assigned cluster letter
output_table = buildings_with_embeddings[['TOID']].copy()
output_table['cluster'] = letter_labels

output_table.to_parquet('./data/output_table.parquet', index=False) # Export the clustering results

The output from the cluster analysis can then be used to create a geographic files that can be mapped.

# Create a GeoDataFrame with the clustering results
gdf = gpd.GeoDataFrame(
  buildings_with_embeddings[['TOID', 'lon', 'lat']].copy(),
  geometry=gpd.points_from_xy(buildings_with_embeddings['lon'], buildings_with_embeddings['lat']),
  crs="EPSG:4326"
)

# Merge with output_table on TOID
gdf = gdf.merge(output_table, on='TOID', how='left')

Step 7: Map the Results

The following code creates a GeoPackage file that can be mapped in QGIS.

gdf.to_file('./data/buildings_with_clusters.gpkg', layer='buildings_clusters', driver='GPKG')

We can also visualise these results on the following interactive map:

Explore the Structure of the Embeddings and Clusters using UMAP

UMAP (Uniform Manifold Approximation and Projection)⁴ is a dimensionality reduction technique that converts high-dimensional data into 2D or 3D representations for visualisation (McInnes, Healy, and Melville 2020). In our case, UMAP takes the 64-dimensional embedding vectors and projects them onto a 2D plot while preserving the relationships between data points—locations that are similar in the original 64-dimensional space will appear close together in the 2D visualisation.

⁴ For a really nice visual explanation of UMAP V t-SNE which is an alternative method, see this blog.

McInnes, Leland, John Healy, and James Melville. 2020. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” https://arxiv.org/abs/1802.03426.

This technique is particularly valuable for understanding cluster quality and separation. Well-separated clusters in the UMAP plot indicate that the clustering algorithm has identified genuinely distinct building types, while overlapping clusters suggest that some building types may be too similar to distinguish reliably using the satellite embeddings.

First we create a new DataFrame that contains the building embeddings and the assigned clusters that will be used in the UMAP visualisation.

# Prepare a DataFrame for UMAP visualisation:
umap_DF = buildings_with_embeddings[[f"A{str(i).zfill(2)}" for i in range(64)] + ["TOID"]].merge(
  gdf[["TOID", "cluster"]], on="TOID", how="left"
)

Step 8: Fit UMAP and Visualise the Results

The first stage is to apply UMAP using a number of different parameters that control various aspects of how the model is applied:

n_neighbors - Controls the balance between local and global structure in the data. Small values emphasise local clusters wheras large values (e.g. 50+) preserve more global structure, but may blur small clusters.
min_dist - The minimum distance between points in the low-dimensional space. Low values (close to 0) allow for very tight clusters, whereas higher values (e.g. 0.5) enforce more spread-out embeddings.
n_components- The number of output dimensions (i.e. 2 for 2D, 3 for 3D)
random_state - Sets a random seed for reproducibility which ensures that running UMAP multiple times gives the same layout.
metric - The distance metric used to measure similarity between data points in the input space. Cosine works well for embeddings (text/image) since it cares about direction, not magnitude.
init - How to initialise the embedding.
spectral Sets a Laplacian Eigenmap (Belkin and Niyogi 2003) initialisation⁵
n_epochs - The number of training iterations, with more epochs producing more more stable embedding, but taking a longer runtime.
spread - This works with min_dist to control how spread-out the clusters are.Larger spread makes clusters take more space in the embedding.
verbose - If True, this prints progress information while running.

Belkin, Mikhail, and Partha Niyogi. 2003. “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.” Neural Computation 15 (6): 1373–96. https://doi.org/10.1162/089976603321780317.

⁵ Laplacian eigenmap initialisation works as follows in simple terms:

Build a graph - Connect each data point to its nearest neighbors, creating a network where nearby points are linked
Create a Laplacian matrix - This captures the structure of the neighbourhood graph in a mathematical form
Find the eigenvectors - Compute the smallest eigenvectors of this Laplacian matrix
Use as coordinates - These eigenvectors become the initial coordinates for the data points in the lower-dimensional space

This works because the eigenvectors corresponding to the smallest eigenvalues naturally arrange points so that neighbors in the original high-dimensional space remain close together in the new low-dimensional representation.

# Extract the 64 embedding columns
embedding_cols = [f"A{str(i).zfill(2)}" for i in range(64)]
embeddings_data = umap_DF[embedding_cols].values

# Apply UMAP to reduce 64 dimensions to 2D
umap_reducer = umap.UMAP(
    n_neighbors=30,        # Numbers of neighbours
    min_dist=0.0,          # Allow points to be closer together
    n_components=2,        # Reduce to 2D for visualsation
    random_state=42,       # For reproducible results
    metric='cosine',       # Cosine similarity works well for embeddings
    init='random',         # Use random initialisation
    n_epochs=500,          # More epochs for better convergence
    spread=1.0,            # Controls how tightly points are packed
    verbose=False          # Show progress
)

# Fit UMAP and transform the data
umap_embedding = umap_reducer.fit_transform(embeddings_data)

# Create a DataFrame with UMAP coordinates and existing cluster labels
umap_results = pd.DataFrame({
    'UMAP1': umap_embedding[:, 0],
    'UMAP2': umap_embedding[:, 1],
    'Cluster': umap_DF['cluster'],  # Use existing cluster column
    'TOID': umap_DF['TOID']        # Use existing TOID column
})

# Save the UMAP results
umap_results.to_parquet('./data/buildings_umap_results.parquet', index=False)

We can then visualise the results on an interactive plot:

# Define colours for each cluster - match the earlier map
colours = {
    'A': '#8dd3c7',
    'B': '#ffffb3', 
    'C': '#bebada',
    'D': '#fb8072',
    'E': '#80b1d3'
}

# Create interactive plot
fig_interactive = px.scatter(
    umap_results, 
    x='UMAP1', 
    y='UMAP2', 
    color='Cluster',
    color_discrete_map=colours,
    hover_data=['TOID']
)
fig_interactive.update_traces(marker=dict(size=2, opacity=0.7))
fig_interactive.show()

The spatial arrangement of clusters in the UMAP visualisation provides insights into clustering quality. Well-separated clusters indicate distinct building types, while overlapping clusters suggest similar characteristics that may be difficult to distinguish. To quantify these patterns, we can calculate several statistics:

Cluster Size Distribution: Shows how many buildings fall into each cluster, revealing the relative prevalence of different building types across the study area.
Cluster Centroids: The average position of each cluster in the 2D UMAP space, which helps measure the distances between different building types and identify which clusters are most similar or distinct.

# Cluster statistics
cluster_stats = (
    umap_results['Cluster']
    .value_counts()
    .sort_index()
    .rename_axis("Cluster")
    .reset_index(name="Count")
)
cluster_stats["Percentage"] = (cluster_stats["Count"] / len(umap_results) * 100).round(1)

# Cluster separation in UMAP space
cluster_centres_umap = (
    umap_results.groupby("Cluster")[["UMAP1", "UMAP2"]]
    .mean()
    .reset_index()
    .rename(columns={"UMAP1": "Centre UMAP1", "UMAP2": "Centre UMAP2"})
    .round(2)
)

# Merge both summaries into one table
cluster_summary = pd.merge(cluster_stats, cluster_centres_umap, on="Cluster")

# Display the results
cluster_summary.style.hide(axis='index').format({
    'Percentage': '{:.1f}',
    'Centre UMAP1': '{:.2f}',
    'Centre UMAP2': '{:.2f}',
    'Count': '{:d}'  # Keep count as integer
})

Cluster	Count	Percentage	Centre UMAP1	Centre UMAP2
A	118903	18.5	2.73	7.11
B	164750	25.6	3.43	3.35
C	161882	25.2	6.21	1.77
D	138818	21.6	7.29	6.27
E	58364	9.1	0.71	9.67

The most separated clusters is Cluster E (0.71, 9.67) which is isolated in top-left and very distinct from all other types. This cluster tends to represent larger more specialised buildings such as industrial or other commercial uses.

Cluster D (7.29, 6.27) in the top-right position is the second most unique building type and tends to represent high-density residential buildings mostly in terraced form. Less dense and larger terraces are often found in Cluster C (bottom right).

Clusters B & C (3.43, 3.35) & (6.21, 1.77) are reasonably close together and might be considered for combination into a single cluster. However, when exploring these locations Cluster B trends towards larger properties, many of which are semi detached or detached.

Matching Building Characteristics to Describe the Clusters

A challenge when analysing buildings at a TOID level is that there are few ancillary attributes available for this geography that can be linked directly. As mentioned earlier, UPRN: which are address locations link to a TOID. For property such as a single occupant detached or semi detached houses, there would likely be only a single UPRN within these TOID. However, for other types of property, such as houses of multiple occupancy or flats, there can be multiple UPRN within a TOID. Additionally, commercial buildings will often have multiple UPRN within them; and for some TOID, these may have both a residential and commercial use (e.g. a flat about a high street shop).

An advantage of UPRN geography is that there are numerous external data that enable us to characterise the linked TOID. A good source of attributes for UPRN are collected when EPC certificates (Energy Performance Certificates)⁶ are generated for the address. You might be thinking why we haven’t just been using UPRN geography rather than TOID all along: they are after all geographically referenced. The problem is that these tend to be less accurate locations for the building, so it is better to use TOID locations if wishing to attribute attributes at the building level. This is something to watch out for as a lot of applications are georeferenced with UPRN!

⁶ EPC are standardised documents that rate the energy efficiency of buildings on a scale from A (most efficient) to G (least efficient). EPC contain a detailed technical profile of the building that energy assessors use to calculate both current performance and improvement potential. The certificate essentially provides a census of the property’s physical characteristics that influence energy consumption.

⁷ These are derived data that have been precompiled for this tutorial. The source data were EPC records, with a link made between UPRN and TOID using LIDS, with postcodes appended to UPRN from NSUL. For each TOID, the most prevalent category of assigned values from the EPC and NSUL were ascribed to the TOID.

The following code imports a Parquet that contains a number of EPC derived attributes for the TOID⁷, which were then appended to the cluster results. We also need to re-code some of the values to make them more useful.

# Load EPC-derived building characteristics for each TOID
TOID_EPC_characteristics = pd.read_parquet('./data/TOID_EPC_characteristics.parquet') 

# Recode BUILT_FORM values
TOID_EPC_characteristics['BUILT_FORM'] = TOID_EPC_characteristics['BUILT_FORM'].replace({
  'Mid-Terrace': 'Terrace',
  'End-Terrace': 'Terrace',
  'Enclosed Mid-Terrace': 'Terrace',
  'Enclosed End-Terrace': 'Terrace',
  'NO DATA!': None
})

# Simple recoding of CONSTRUCTION_AGE_BAND column
TOID_EPC_characteristics['CONSTRUCTION_AGE_BAND'] = TOID_EPC_characteristics['CONSTRUCTION_AGE_BAND'].replace({
    # No data
    'NO DATA!': None,
    'INVALID!': None,
    
    # Pre-1900
    'before 1900': '1900 or earlier',
  
    # All 1900-1929
    '1900': '1900-1929',
    '1900-1929': '1900-1929',
    
    # All 1983-1995
    '1983-1990': '1983-1995',
    '1991-1995': '1983-1995',
    
    # 2003 onwards
    '2003-2006': '2003 onwards',
    '2007': '2003 onwards',
    '2007-2011': '2003 onwards', 
    '2008': '2003 onwards',
    '2007 onwards': '2003 onwards',
    '2009': '2003 onwards',
    '2010': '2003 onwards',
    '2012': '2003 onwards',
    '2012 onwards': '2003 onwards',
    '2013': '2003 onwards',
    '2014': '2003 onwards',
    '2015': '2003 onwards',
    '2016': '2003 onwards',
    '2017': '2003 onwards',
    '2018': '2003 onwards',
    '2019': '2003 onwards',
    '2020': '2003 onwards',
    '2021': '2003 onwards',
    '2022': '2003 onwards',
    '2023': '2003 onwards',
    '2024': '2003 onwards',
    '2025': '2003 onwards'
})

# Merge EPC-derived building characteristics into UMAP results using TOID
output_table = output_table.merge(TOID_EPC_characteristics, on="TOID", how="left")

Step 9: Plot Built Form

The heatmap below shows propensity indices for built form categories across the five clusters. The propensity index compares each cluster’s proportion of a building type to the overall average-values. A score above 1.0 indicates above-average concentrations, while values below 1.0 indicate below-average concentrations.

Cluster B shows the strongest association with detached houses (propensity index > 1.0) and moderate over-representation of semi-detached properties.
Clusters C and E both exhibit higher concentrations of terraced housing compared to the regional average.
Clusters A and D show more balanced distributions across built form categories.

# Calculate overall proportion for each built form
overall_props = output_table['BUILT_FORM'].value_counts(normalize=True)

# Calculate cluster-specific proportions
cluster_props = output_table.groupby('cluster')['BUILT_FORM'].value_counts(normalize=True).unstack(fill_value=0)

# Calculate propensity index (cluster proportion / overall proportion)
propensity_index = cluster_props.div(overall_props, axis=1)

# Plot heatmap of propensity index
plt.figure(figsize=(8, 5))
sns.heatmap(propensity_index, annot=True, fmt='.2f', cmap='RdBu_r', center=1, 
            cbar_kws={'label': 'Propensity Index (1.0 = average)'})
plt.xlabel('Built Form')
plt.ylabel('Cluster')
plt.tight_layout()
plt.show()

Step 10: Plot Property Type

We can also explore the property types, which relate to the categories of House, Flat, Bungalow, Maisonette and Park home.

# Calculate overall proportion for each property type
overall_props = output_table['PROPERTY_TYPE'].value_counts(normalize=True)

# Calculate cluster-specific proportions
cluster_props = output_table.groupby('cluster')['PROPERTY_TYPE'].value_counts(normalize=True).unstack(fill_value=0)

# Calculate propensity index (cluster proportion / overall proportion)
propensity_index = cluster_props.div(overall_props, axis=1)

# Plot heatmap of propensity index
plt.figure(figsize=(8, 5))
sns.heatmap(propensity_index, annot=True, fmt='.2f', cmap='RdBu_r', center=1, 
            cbar_kws={'label': 'Propensity Index (1.0 = average)'})
plt.xlabel('Property Type')
plt.ylabel('Cluster')
plt.tight_layout()
plt.show()

The distinctive patterns observed for Maisonette and Park Home properties should be interpreted with caution, as their counts are very low relative to other property types. This is particularly relevant for Maisonettes, which typically represent internal subdivisions within buildings that might otherwise be classified as Houses. In contrast, Flat and House properties showed no prevalent distribution toward any particular cluster, while Bungalow properties followed a distribution pattern similar to that of Detached properties in the previous analysis.

# Count and percentage table for PROPERTY_TYPE
property_type_counts = output_table['PROPERTY_TYPE'].value_counts(dropna=False)
property_type_percent = output_table['PROPERTY_TYPE'].value_counts(normalize=True, dropna=False) * 100

property_type_summary = pd.DataFrame({
  'Count': property_type_counts,
  'Percentage': property_type_percent.round(2)
})

display(property_type_summary)

	Count	Percentage
PROPERTY_TYPE
House	356181	55.42
None	226796	35.29
Flat	29718	4.62
Bungalow	28622	4.45
Maisonette	1383	0.22
Park home	17	0.00

Step 11: Plot Property Age

We can also explore the building age from these data. Cluster C trends towards properties that are older than the 1930s. Previously this Cluster was also highlighted as having a higher propensity for terraced properties, and represents many of those terraced areas of Liverpool that were built during the Victorian and Edwardian era. Cluster E which also has a higher propensity for Terraced property have younger building ages from the 1950s-70s. Cluster A and B also have reasonably distinctive patterns, with A over indexing in the 1950-1966 group, and B from the 1950s to early 2000s.

# Calculate overall proportion for each built age
overall_props = output_table['CONSTRUCTION_AGE_BAND'].value_counts(normalize=True)

# Calculate cluster-specific proportions
cluster_props = output_table.groupby('cluster')['CONSTRUCTION_AGE_BAND'].value_counts(normalize=True).unstack(fill_value=0)

# Calculate propensity index (cluster proportion / overall proportion)
propensity_index = cluster_props.div(overall_props, axis=1)

# Plot heatmap of propensity index
plt.figure(figsize=(8, 5))
sns.heatmap(propensity_index, annot=True, fmt='.2f', cmap='RdBu_r', center=1, 
            cbar_kws={'label': 'Propensity Index (1.0 = average)'})
plt.xlabel('Built Form')
plt.ylabel('Cluster')
plt.tight_layout()
plt.show()

The building age variable are differentiated reasonably well by the Clusters which is likely because of the different roof construction materials prevail during different time periods. As such, it is logical that different time periods should have reasonably distinctive spectral response.

Step 12: Plot TOID Containing a Commercial UPRN

We also have a column that identifies if a commercial UPRN (defined as having a commercial EPC) was found within the TOID. This plot illustrates that there is the highest concentration of TOID with non domestic UPRN in Cluster C. However, there are also higher concentrations for these non-domestic UPRN to be found within Cluster D and E.

# Calculate overall proportion
overall_props = output_table['NON_DOMESTIC'].value_counts(normalize=True)

# Calculate cluster-specific proportions
cluster_props = output_table.groupby('cluster')['NON_DOMESTIC'].value_counts(normalize=True).unstack(fill_value=0)

# Calculate propensity index (cluster proportion / overall proportion)
propensity_index = cluster_props.div(overall_props, axis=1)

# Plot heatmap of propensity index
plt.figure(figsize=(7, 5))
sns.heatmap(propensity_index, annot=True, fmt='.2f', cmap='RdBu_r', center=1, 
            cbar_kws={'label': 'Propensity Index (1.0 = average)'})
plt.xlabel('Contains a Non Domestic UPRN')
plt.ylabel('Cluster')
plt.tight_layout()
plt.show()

The Descriptive and Predictive Potential of Embeddings

In this final section we will take a slightly different look at the embeddings to examine how they can be used to make predictions. First we will append the building characteristics onto the embeddings.

# Merge EPC-derived building characteristics onto buildings_with_embeddings using TOID
buildings_with_embeddings = buildings_with_embeddings.merge(
    TOID_EPC_characteristics, on="TOID", how="left"
)

Step 13: Mapping Cosine Similarity

In the following analysis we can identify those TOID that were categorised as having a building age older than 1930, which involved identifying those TOID which had a CONSTRUCTION_AGE_BAND falling into either ‘1900 or earlier’, or ‘1900-1929’. The average embedding vector was calculated from all buildings constructed before 1930, creating a 64-dimensional “profile” representing the typical satellite characteristics of pre-1930 buildings. Cosine similarity⁸ was then used to compare each building’s embedding vector against this average profile, producing a similarity score between 0 and 1 for every building in the dataset.

⁸ Cosine similarity in this context measures how similar two building locations are by looking at the angle between their 64-dimensional embedding vectors (lists of numbers). A score of 1 means that the pixels have identical land characteristics; and a score of 0 that they are completely unrelated.

from shapely.geometry import Point

# Get embedding columns
embedding_cols = [f'A{str(i).zfill(2)}' for i in range(64)]  # A00 to A63

# Filter for buildings before 1930 (both "1900 or earlier" and "1900-1929")
pre_1930_buildings = buildings_with_embeddings[
    buildings_with_embeddings['CONSTRUCTION_AGE_BAND'].isin(['1900 or earlier', '1900-1929'])
]

# Calculate average embedding for pre-1930 buildings
avg_embedding_pre_1930 = pre_1930_buildings[embedding_cols].mean().values

# Calculate cosine similarity for all buildings to the average
# Get all embeddings as numpy array
all_embeddings = buildings_with_embeddings[embedding_cols].values

# Normalise the average embedding
avg_norm = avg_embedding_pre_1930 / np.linalg.norm(avg_embedding_pre_1930)

# Calculate cosine similarity for each building
similarities = []
for embedding in all_embeddings:
    # Normalise each embedding
    embedding_norm = embedding / np.linalg.norm(embedding)
    # Calculate cosine similarity
    similarity = np.dot(embedding_norm, avg_norm)
    similarities.append(similarity)

# Add similarity scores to dataframe
buildings_with_embeddings['similarity_to_pre_1930'] = similarities

# Results sorted by similarity
df_sorted = buildings_with_embeddings[['TOID', 'lon','lat','CONSTRUCTION_AGE_BAND', 'similarity_to_pre_1930']].sort_values(
    'similarity_to_pre_1930', ascending=False
)

You might want to export these results to a gpkg to view within a GIS.

# Export df_sorted as a GeoPackage using lon/lat
gdf_sorted = gpd.GeoDataFrame(
    df_sorted,
    geometry=gpd.points_from_xy(df_sorted['lon'], df_sorted['lat']),
    crs="EPSG:4326"
)
gdf_sorted.to_file('./data/buildings_similarity_pre_1930.gpkg', layer='similarity_pre_1930', driver='GPKG')

Alternatively, you can explore the results on the following interactive map.

The intensity of red indicates how closely each building’s embeddings match the average profile of buildings constructed before 1930. Darker red signifies higher similarity. While the map reveals some meaningful differentiation, inconsistencies emerge, even at street level. Although such visualisations offer valuable insights into the spatial patterns of such scores, they provide limited evidence of the embeddings’ actual predictive capabilities.

Step 14: Building a Predictive Model Using the Embeddings

The following code fits a basic Random Forest ⁹ model to predict the construction age band of buildings based on their embeddings.

⁹ A random forest is a popular machine learning model that makes predictions by combining the results of many decision trees. A decision tree is like a flowchart of yes/no questions that splits data into groups to make a prediction. A forest means we grow lots of these trees, each a little different (because they see random parts of the data and random subsets of the features). Each tree gives an answer, and the forest takes a majority vote (for classification) or an average (for regression).

The first stage was to prepare the data. Of the 642,717 buildings, 389,445 had an age band so 253,272 rows were removed. The remaining data were split into training and test sets in a 80/20 ratio.

# Remove rows with missing CONSTRUCTION_AGE_BAND
buildings_clean = buildings_with_embeddings[buildings_with_embeddings['CONSTRUCTION_AGE_BAND'].notna()].copy()

# Prepare data
X = buildings_clean[embedding_cols].values
y = buildings_clean['CONSTRUCTION_AGE_BAND']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

The next step was to run the random forest classifier model. The n_estimators refers to the number of trees, where more trees usually means more stable predictions, but also more computation. 100 is quite a common default. The random_state ensures reproducibility and class_weight as balanced is to account for the different class sizes in the data ¹⁰.

¹⁰ Some age bands have much more data than others. For example, 1900–1929 has ~89k rows, but 1996–2002 only ~15k. Random Forest doesn’t automatically balance this, so it tends to be biased toward the larger classes which can effect recall success.

# Train the model
rf = RandomForestClassifier(n_estimators=100, random_state=42,class_weight="balanced")
rf.fit(X_train, y_train) # X_train (embeddings) and y_train (construction age band labels)

We can then explore the results in the following table, comprising these columns:

Precision: How often is the TOID class right (1 = 100% etc).
Recall: How many TOID of the correct class did the model find (1 = 100% etc).
F1-score: The balance between precision and recall (0 = worst; 1 = perfect).
Support: The count of TOID in the test data.

The following aggregate statistics are also provided:

Accuracy: What percentage of all predictions were correct?
Macro Average: This treats all classes as equally important, and is useful when rarer classes matter as much as common ones.
Weighted Average: Accounts for how many examples of each class there are in the data, so classes with more examples count more in the average. This gives an idea of the overall system performance in production reflects what users will actually experience.

# Predict and evaluate
y_pred = rf.predict(X_test)

# Get report as dict
report_dict = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report_dict).transpose()

# Round only precision, recall, f1-score
for col in ["precision", "recall", "f1-score"]:
    if col in report_df.columns:
        report_df[col] = report_df[col].round(3)

# Ensure support stays as int
if "support" in report_df.columns:
    report_df["support"] = report_df["support"].astype(int)

display(report_df)

	precision	recall	f1-score	support
1900 or earlier	0.305	0.152	0.203	4400
1900-1929	0.615	0.703	0.656	17874
1930-1949	0.423	0.526	0.469	14797
1950-1966	0.409	0.626	0.495	15097
1967-1975	0.501	0.316	0.387	7971
1976-1982	0.425	0.181	0.254	3844
1983-1995	0.597	0.315	0.412	6114
1996-2002	0.566	0.196	0.292	3009
2003 onwards	0.796	0.480	0.599	4783
accuracy	0.494	0.494	0.494	0
macro avg	0.515	0.388	0.419	77889
weighted avg	0.508	0.494	0.480	77889

The Random Forest model achieves 49% accuracy in predicting building age periods, which is reasonable considering it has to choose between 9 different time periods. The model performs best on buildings from 1900-1929 (correctly identifying 70% of them) and struggles most with the oldest buildings from before 1900 (only finding 15% of them). There’s a clear pattern where the model does better on time periods with more training examples - the three largest categories (1900-1929, 1930-1949, and 1950-1966) all have decent performance, while smaller categories like 1976-1982 and 1996-2002 are often missed. Interestingly, modern buildings from 2003 onwards are predicted very accurately when the model does identify them (80% precision), but it only finds about half of them.

The model tends to confuse adjacent time periods, which is most obvious when we plot a confusion matrix that shows how well the random Forest model identifies each building age period. This makes sense since construction materials evolve gradually rather than changing abruptly at decade boundaries. Each row shows what happened to all buildings labelled as being from that actual time period. The numbers show the percentage (recall) - darker red means more buildings ended up in that category. Perfect performance would show dark red only on the diagonal.

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Create labels list
labels = ['1900 or earlier', '1900-1929', '1930-1949', '1950-1966', 
          '1967-1975', '1976-1982', '1983-1995', '1996-2002', '2003 onwards']

# Normalise by row (true labels) to show recall percentages
cm_normalised = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

plt.figure(figsize=(8, 5))
sns.heatmap(cm_normalised, annot=True, fmt='.2f', cmap='YlOrRd',
            xticklabels=labels, yticklabels=labels,
            vmin=0, vmax=1)
plt.xlabel('Predicted Age Band', fontsize=12)
plt.ylabel('True Age Band', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)

# Add colourbar label
cbar = plt.gca().collections[0].colorbar
cbar.set_label('Recall Rate', rotation=270, labelpad=20)

plt.tight_layout()
plt.show()

The model performs best on buildings from 1900-1929 (70% correctly identified), although it also frequently misclassifies pre-1900 buildings as being from this period - showing it cannot distinguish the oldest buildings from early 20th century ones. There is a clear pattern of “adjacent period confusion” where the model mixes up neighbouring time periods, particularly visible in the orange bands along the diagonal. For example, 1950-1966 buildings are often confused with both 1930-1949 (24%) and 1967-1975 periods. The four periods from 1967-2002 show especially high confusion with each other and with 1950-1966, suggesting that mid-to-late 20th century features are too similar for the embeddings to distinguish effectively. Modern buildings (2003 onwards) are however moderately well-identified at 48%, but are sometimes confused with earlier periods.

Another useful output from the model are feature importance scores which tell you how important each of the embeddings are to making the predictions.

# Create feature importance dataframe
feature_importance = pd.DataFrame({
   'feature': [f'embedding_{i}' for i in range(len(embedding_cols))],
   'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

# Display as a formatted dataframe with 6 decimal places
display(feature_importance.head(20).round(6).style.hide(axis="index"))

feature	importance
embedding_16	0.020781
embedding_27	0.020634
embedding_38	0.020539
embedding_40	0.019738
embedding_8	0.019170
embedding_33	0.018493
embedding_9	0.018037
embedding_31	0.017856
embedding_43	0.017059
embedding_61	0.017056
embedding_45	0.016799
embedding_58	0.016705
embedding_32	0.016687
embedding_21	0.016665
embedding_59	0.016644
embedding_6	0.016638
embedding_44	0.016627
embedding_4	0.016372
embedding_3	0.016364
embedding_53	0.016346

The top 20 most important features each contribute only around 1.6-2.1% to the model’s decisions. Such a uniform importance distribution is typical of neural network embeddings, which encode complex information across many dimensions.

Conclusion

This tutorial has demonstrated the practical application of Google’s AlphaEarth Foundations embeddings for building-level analysis, showing both the potential and limitations of these new data.

The cluster analysis presented five distinct building types that align meaningfully with real-world urban morphology patterns. Cluster E emerged as the most distinctive group, representing larger specialised buildings such as industrial and commercial facilities. Cluster D captured high-density residential areas, predominantly terraced housing, while Clusters B and C represented different scales of residential development, with B trending toward detached and semi-detached properties.

The analysis of building ages showed some clear associations with different construction periods. Cluster C was strongly associated with Victorian and Edwardian terraces (pre-1930), while Cluster E represented post-war development from the 1950s-70s.

The Random Forest model’s 49% accuracy in predicting building age from embeddings demonstrated moderate predictive power when distinguishing between the nine time periods. While this represents a meaningful improvement over random chance (11%), the results highlight important limitations. The model performs best on the most common construction periods (1900-1929, 1930-1949, and 1950-1966) but struggles with less common periods and shows systematic confusion between adjacent time periods. This pattern reflects both the gradual evolution of construction materials and techniques, and the challenges inherent in using 10-meter resolution imagery to capture subtle architectural differences.