!pip install earthengine-api pyarrow scikit-learn pandas geopandas clustergram keplergl umap-learn seaborn shap duckdb plotly # install packagesUsing Google’s AlphaEarth Foundations Embeddings for Building Analysis: From Satellite Data to Predictive Urban Models
Learning Objectives
This training pack demonstrates how to extract and analyse Google’s AlphaEarth Foundations satellite embeddings at the building level, progressing from basic data extraction through advanced predictive modeling. The tutorial uses Liverpool City Region as a case study to illustrate a complete analytic workflow that combines satellite AI with urban datasets. Note: you will need a Google Account to access these data.
All of the data used for this tutorial can be downloaded from the Geographic Data Service. These should be placed in a folder called data in your working directory.
By the end of this training pack you will be able to:
- Data Access and Extraction:
- Describe what satellite embeddings are and how they function as numerical “fingerprints” of geographic locations
- Understand the structure and capabilities of Google’s AlphaEarth Foundations model and its 64-dimensional embeddings at 10-meter resolution
- Setup, authenticate and initialise Google Earth Engine API from Python
- Access Google’s AlphaEarth satellite embedding collection for the time periods 2017-2024
- Extract pixel-level embeddings for large geographic regions and specific building locations
- Clustering and Pattern Recognition:
- Apply clustergram analysis to determine optimal cluster frequency
- Perform k-means clustering on satellite embedding data to identify building typologies
- Interpret cluster patterns and validate results using external datasets
- Advanced Analytic Techniques:
- Use UMAP (Uniform Manifold Approximation and Projection) to visualise high-dimensional embeddings in 2D
- Calculate cosine similarity to identify buildings with similar characteristics
- Integrate satellite embeddings with administrative datasets (Energy Performance Certificates)
- Create propensity indices to characterise cluster distributions across building attributes
- Predictive Modeling and Evaluation:
- Build a Random Forest model to predict building characteristics using satellite embeddings
- Evaluate model performance using confusion matrices, classification reports, and feature importance analysis
- Understand the limitations and appropriate applications of embedding-based prediction
- Visualisation and Communication:
- Export results to various formats (GeoPackage, Parquet) for use in GIS software
- Generate publication-ready plots and statistical summaries
Conceptual Understanding
AI-powered satellite analysis represents a fundamental shift from traditional remote sensing approaches. Unlike conventional methods that analyse individual satellite images in isolation, systems like AlphaEarth Foundations create unified digital representations by integrating multiple data sources into a number of embeddings. These embeddings aim to capture complex spatial and temporal patterns, with multiple applications of urban environments.
What are AlphaEarth Embeddings?
AlphaEarth Foundations (Brown et al. 2025) is an artificial intelligence model developed by Google DeepMind that functions as a comprehensive Earth observation system. Rather than requiring users to process raw satellite imagery, the model provides pre-computed embeddings1 that distill complex environmental information into 64 numerical values for every 10×10 meter pixel across Earth’s terrestrial and coastal zones.
1 Satellite embeddings are AI-generated feature representations of satellite imagery. Think of them as numerical “fingerprints” that capture the visual characteristics of a location - things like vegetation patterns, building density, water presence, and land use. Google provides pre-computed embeddings for the entire Earth, saving you from processing raw satellite images yourself.
Setup
We recommend using at least Python 3.11, and there are a number of prerequisites and libraries required for this tutorial.
- earthengine-api is the official client library for Google Earth Engine. It allows you to access, process, and analyse geographic datasets hosted on Earth Engine directly from Python code.
- pandas is used for data manipulation and analysis and provides data structures like DataFrame.
- geopandas makes working with geographic data in pandas easier. It extends pandas DataFrames to support geometry columns (points, lines, polygons), enabling spatial operations, reading/writing GIS file formats, and easy mapping.
- pyarrow provides columnar in-memory data platform that enables fast reading and writing of various data formats (especially Parquet files) between different systems, and accelerates analytic operations of large datasets.
- scikit-learn is a machine learning library for data mining and analysis.
- clustergram helps visualise the most effective partitioning of data using clusters analysis (k means).
- umap-learn - is dimensionality reduction technique that creates visualisations of high-dimensional data.
- seaborn - a data visualisation library.
- plotly - an interactive plotting library.
- matplotlib - a fundamental plotting library.
- numpy - used for numerical operations and mathematical calculations.
- requests - downloading images from Google Earth Engine URLs and other web requests.
- PIL (Python Imaging Library) - used for image processing and display.
- duckdb - querying multiple files as if they were a single database.
- keplergl - creates the interactive maps.
import io # Import for handling input/output operations
from io import StringIO # allows you to treat strings as file-like objects
import ee # Imports the Earth Engine API functionality
import pandas as pd # Import the pandas functionality
import geopandas as gpd # Import the geopandas functionality
from sklearn.cluster import KMeans # Import KMeans
import matplotlib.pyplot as plt # Import for plotting
from clustergram import Clustergram # Import for visualising cluster structure
import numpy as np # Import for numerical operations
from IPython.display import Image, display # Import image and display utilities
import duckdb # Enables duckdb functionality
import plotly.express as px # Import for interactive plotting and visualisation
import umap # Import for dimensionality reduction using UMAP
import seaborn as sns # Import for statistical data visualisation
import requests # Import for making HTTP requests
from PIL import Image as PILImage # Import PIL Image functionality
import os # Import for operating system interface
from sklearn.ensemble import RandomForestClassifier # Import Random Forest classifier
from sklearn.model_selection import train_test_split # Import for splitting datasets
from sklearn.metrics import classification_report, confusion_matrix # Import for model evaluation metricsNext we need to initiate the authentication process for the Google Earth Engine (GEE) Python API. When you run this command, it prompts you to log in to your Google account and grant the necessary permissions for the script or application to access your Earth Engine resources.
ee.Authenticate() # Follow the prompts to log in with your Google accountSetup and Pixel Level Analysis
The first part of this tutorial will guide you through the various steps to extract embeddings for Liverpool City Region (LCRCA) and examine the structure of these data using a simple cluster analysis.
Step 1: Initialise Earth Engine
The following code initialises the Google Earth Engine API.
# Initialise Earth Engine (this project is called - alpha-tutorial-477812)
ee.Initialize(project='alpha-tutorial-477812') # You may also need to specify your project, e.g. ee.Initialize(project='your_project_ID_here')Step 2: Import the Location Data
The following code creates a GeoPandas GeoDataFrame with the location of each property and reprojects all geometries into the WGS84 (EPSG:4326).
# Read the GeoParquet file containing building locations
buildings = gpd.read_parquet('./data/lcrca_buildings.parquet')
# Convert buildings to WGS84 coordinate system (EPSG:4326)
buildings = buildings.to_crs(epsg=4326)
buildings["lon"] = buildings.geometry.x
buildings["lat"] = buildings.geometry.yStep 3: Access Google’s Satellite Embedding Collection
The following code queries Google Earth Engine’s satellite embedding collection to check if satellite image embeddings are available for a specified year (2017-2024). The processing loop iterates through each available year from 2017 to 2024, applying a temporal filter to isolate embeddings for the specified 12-month period.
The year_images dictionary serves as a container where each year maps to its corresponding embedding image, making it straightforward to access specific years’ data later in the workflow.
# Read GeoPackage into a GeoDataFrame and convert to WGS84 (EPSG:4326)
lcrca = gpd.read_file('./data/lcrca.gpkg')
# Reproject to WGS84
lcrca = lcrca.to_crs(epsg=4326)
# Get the boundary coordinates
coords = lcrca.geometry.iloc[0].__geo_interface__
# Convert to Earth Engine geometry
lcrca_geometry = ee.Geometry(coords)
# Access the embedding collection
embedding_collection = ee.ImageCollection('GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL')
# Work with each year separately
years = range(2017, 2025)
year_images = {}
# Loop through each year from 2017 to 2024 that overlap with the LCRCA boundary
for year in years:
year_filtered = embedding_collection \
.filterDate(f'{year}-01-01', f'{year}-12-31') \
.filterBounds(lcrca_geometry)
# Check if any images exist for this year (avoid processing empty collections)
if year_filtered.size().getInfo() > 0:
# Create a mosaic from all filtered images and clip to the LCRCA boundary
year_images[year] = year_filtered.mosaic().clip(lcrca_geometry)Step 4: Explore the Structure of the Embeddings at the Pixel Level
Using Earth Engine, we can quantify the scale of data within our study region by calculating the total number of 10-meter pixels. For the Liverpool City Region in 2024, this analysis reveals approximately 12.26 million pixels within the boundary. Since each pixel contains 64 embedding values, this represents a dataset with over 784 million individual data points (12.26 million × 64 dimensions). This scale illustrates both the opportunity and challenge of working with satellite embeddings.
# Select 2024 as the year
emb_img = year_images[2024]
# Calculate the total number of pixels in the region
pixel_count = emb_img.select(0).reduceRegion(
reducer=ee.Reducer.count(), # Count the number of valid pixels
geometry=lcrca_geometry, # Within the LCRCA boundary
scale=10, # At 10-meter resolution
maxPixels=1e9 # Allow up to 1 billion pixels to be processed
)
# Get and print the count
total_pixels = pixel_count.getInfo()
total_count = list(total_pixels.values())[0]
print(f"Total pixels in region at 10m scale: {total_count:,}")Total pixels in region at 10m scale: 12,259,151
The following code block demonstrates how to extract pixel-level embeddings directly from Google Earth Engine, but we do not run this code in the tutorial. If executed, this function would:
- Randomly sample 50,000 pixels from within the Liverpool City Region boundary
- Extract all 64 embedding values for each sampled pixel location
- Export the results as CSV files to your Google Drive (in a folder called “EE_exports”)
Why we skip this step: With 12.26 million pixels in the region, processing even a sample of 50,000 pixels through Earth Engine’s export system is time-consuming and requires managing multiple large CSV files.
What we provide instead: To keep this tutorial focused on analysis techniques rather than data processing, we provide the sampled pixel data as a ready-to-use Parquet file that you can load directly.
When you might use this code: If you’re working with your own study area or need a different sample size, you can adapt this function for your specific requirements. The code is included here for transparency and to support your independent applications.
def export_in_batches(image, geometry, total_pixels, batch_size=50000):
"""Export data in smaller batches"""
num_batches = (total_pixels + batch_size - 1) // batch_size
for i in range(num_batches):
# Use different seeds for each batch to get different samples
samples = image.sample(
region=geometry,
scale=10,
numPixels=min(batch_size, total_pixels - i * batch_size),
seed=42 + i, # Different seed for each batch
geometries=True
)
export_task = ee.batch.Export.table.toDrive(
collection=samples,
description=f'sampled_data_export_batch_{i+1}',
folder='EE_exports',
fileNamePrefix=f'lcrca_samples_batch_{i+1}',
fileFormat='CSV'
)
export_task.start()
print(f"Started batch {i+1}/{num_batches}")
# Usage
export_in_batches(emb_img, lcrca_geometry, 1200000, batch_size=50000)Data Consolidation Process (Not Executed in Tutorial) Once the CSV export batches are complete, they need to be consolidated into a single dataset for analysis. The code below demonstrates how DuckDB can efficiently combine multiple CSV files into one DataFrame and export the result as a Parquet file.
LCRCA_Samples = duckdb.sql(f"SELECT * FROM read_csv_auto('./data/samples/lcrca_samples_batch_*.csv')").df() # Create a single DataFrame
LCRCA_Samples = LCRCA_Samples.iloc[:, 1:-1] # Remove the first and last column
LCRCA_Samples.to_parquet('./data/LCRCA_Samples.parquet') # Export to a parquetWhy DuckDB is useful here: DuckDB treats multiple CSV files as if they were a single database table, eliminating the need to manually loop through files or manage memory for large datasets. This approach is particularly valuable when working with the batch export system from Google Earth Engine, which creates multiple files that need to be recombined.
Note: Like the export code above, this consolidation step is not executed in the tutorial since we provide the pre-processed Parquet file. However, this pattern is essential for real-world applications where you generate your own export batches.
We can read the parquet file into a DataFrame as follows and continue with our analysis.
LCRCA_Samples = pd.read_parquet('./data/LCRCA_Samples.parquet')Using the sample we can examine the structure of the embeddings by fitting a Clustergram. This is a visualisation technique that shows how cluster assignments change as you increase the number of clusters (k). This helps you to understand the structure in very high-dimensional space in the following ways:
- Optimal k selection: Helps you to determine the right number of clusters by visualising how cleanly clusters separate
- Cluster stability: Shows which clusters persist across different k values (stable long lines) vs. those which are artifacts of over-clustering (short, erratic lines)
- Split patterns: Reveals the natural hierarchy in the data by showing how clusters subdivide
From the diagram, a five cluster solutions looks to be an optimal initial choice.
# Create clustergram to visualise optimal number of clusters
cgram = Clustergram(range(1, 12), init="random", n_init=10, random_state=42, verbose=False)
cgram.fit(LCRCA_Samples)
cgram.plot(figsize=(8, 5))
With the clustergram indicating that five clusters provide an optimal partitioning of the data, we can now apply K-means clustering to visualise the spatial distribution of building types. The following approach implements clustering directly within Google Earth Engine, which offers significant computational advantages over downloading the full dataset locally2.
2 This server-side approach is computationally efficient because it processes the high-dimensional data where it resides (on Google’s servers) rather than transferring millions of 64-dimensional vectors to your local machine. The result is a single-band image where each pixel is assigned a cluster label (0-4), which can then be visualised and exported as needed.
The code uses Earth Engine’s built-in clustering capabilities to: 1. Sample 100,000 training pixels from across the region to build the clustering model 2. Apply the trained K-means classifier to all 12.26 million pixels in the study area 3. Return only the final cluster assignments rather than the raw embedding values
# Sample pixels across the region - this builds an initial training model which is then applied to all pixels
training = emb_img.sample(
region=lcrca_geometry, # Sample within the LCRCA boundary
scale=10, # Use 10-meter pixel resolution (matches embedding data)
numPixels=100000, # Extract 100,000 random pixels for training
seed=42 # Set random seed for reproducible sampling
)
# Train K-Means with 5 clusters
clusterer = ee.Clusterer.wekaKMeans(5).train(training)
# Apply to the image
result = emb_img.cluster(clusterer).clip(lcrca_geometry)
# Define the same colours for visualisation
cluster_colours = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00']
# Create a palette for the clusters (GEE format)
palette = cluster_colours
# Set visualisation parameters
vis_params = {
'min': 0,
'max': 4,
'palette': palette
}
# Export as static image
url = result.getThumbURL({
'dimensions': 1024, # Image width/height in pixels
'region': lcrca_geometry,
'min': 0,
'max': 4,
'palette': palette,
'format': 'png',
'crs': 'EPSG:27700'
})
# Download the image locally
response = requests.get(url)
response.raise_for_status() # Raise an exception for bad status codes
# Create a local filename
local_filename = "./images/lcrca_clusters.png"
# Save the image locally
with open(local_filename, 'wb') as f:
f.write(response.content)
# Display the local image
display(Image(filename=local_filename, width=750))
Building Cluster Analysis
The following section extends the analysis to illustrate how geographic locations (in this case buildings) can be appended with pixel level embeddings across a regional extent. In this example we are going to use a database of TOID (Topographic Identifier) from the Ordnance Survey3.
3 A TOID (Topographic Identifier) is a unique and persistent identifier for each and every feature found in OS MasterMap products. Because the raw dataset shows unique identifiers for a wide range of landscape and built environment features, these were limited to only those where the TOID had an associated UPRN, which are an authoritative identifier used to uniquely identify addressable locations.
Step 5: Extract Pixel Values for Each Building
Having imported the building locations earlier, we can now extract pixel-level embeddings for each individual building. This process requires careful handling of the large dataset—with over 640,000 building points, as attempting to process all locations simultaneously would exceed Google Earth Engine processing limits.
Batch Processing Strategy:
The code below implements a batch processing approach that: 1. Divides the buildings into manageable batches of 1,000 points each 2. Converts each batch into an Earth Engine FeatureCollection for server-side processing 3. Uses reduceRegions() with a first() reducer to extract the embedding values at each building’s coordinate 4. Combines results from all batches into a single dataset
Each building point extracts the embedding value from the single 10×10 meter pixel that contains that coordinate. This means the “building” embedding actually represents the spectral characteristics of whatever land cover exists within that pixel, which may include the building roof, surrounding vegetation, pavement, or other features depending on the building size and the precise coordinate location within the pixel.
This spatial relationship is crucial to understand when interpreting results: smaller buildings may be represented by pixels that capture significant non-building features, while larger buildings are more likely to have pixels dominated by roof characteristics.
# Parameters
year = 2024
batch_size = 1000 # batch size
# Get embedding image
embedding_collection = ee.ImageCollection('GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL')
embedding_image = embedding_collection.filterDate(f"{year}-01-01", f"{year}-12-31").mosaic()
# Storage for results
all_results = []
# Split into batches
n_batches = int(np.ceil(len(buildings) / batch_size))
print(f"Total points: {len(buildings)}, processing in {n_batches} batches of {batch_size}")
for i in range(n_batches):
# Select the current batch of buildings for processing
start = i * batch_size
end = min((i + 1) * batch_size, len(buildings))
batch = buildings.iloc[start:end]
print(f"Processing batch {i+1}/{n_batches} with {len(batch)} points...")
# Build FeatureCollection
features = [] # Initialise an empty list to store Earth Engine features
for idx, row in batch.iterrows(): # Iterate over each row
point = ee.Geometry.Point([row['lon'], row['lat']]) # Create a Point geometry
feature = ee.Feature(point, {"index": idx}) # Create an Earth Engine Feature
features.append(feature) # Add the feature to the list
# Convert the list of features to an Earth Engine FeatureCollection
fc = ee.FeatureCollection(features)
# Extract embeddings for each building point in the batch
extracted = embedding_image.reduceRegions(
collection=fc, # The FeatureCollection of building points
reducer=ee.Reducer.first(), # Extract the first value for each embedding band at each point
scale=10, # Use 10-meter pixel resolution (matches embedding data)
tileScale=4 # Increase tileScale to handle larger regions and avoid memory errors
)
# Convert to DataFrame
results = extracted.getInfo() # Retrieve the results into a Python dictionary
embedding_dict = {} # Initialise an empty dictionary
for feature in results['features']: # Iterate over each feature (building point) in the results
props = feature['properties'] # Extract the properties (embedding values and index) for this feature
idx = props.pop("index") # Remove and get the 'index' property (original building index)
band_values = {k: v for k, v in props.items() if not k.startswith("system:")} # Filter out system fields, keep only embedding bands
embedding_dict[idx] = band_values # Store the embedding values in the dictionary, keyed by building index
# Convert the embedding_dict (indexed by building index) to a DataFrame
batch_df = pd.DataFrame.from_dict(embedding_dict, orient="index")
batch_df.index.name = "index" # Set the index name for merging with the batch
# Merge with original batch
merged = batch.reset_index().merge(batch_df, left_on="index", right_index=True, how="left").drop(columns="index")
all_results.append(merged)
# Combine all batches
result = pd.concat(all_results, ignore_index=True)
# Output parquet file
result.to_parquet('./data/buildings_with_embeddings.parquet', index=False)Step 6: Identify a Cluster Frequency and fit K-Means Clustering to the Buildings
The batch processing code above was executed to create a comprehensive dataset linking each building location with its corresponding satellite embeddings. The results are stored in a Parquet which are loaded as follows.
buildings_with_embeddings = pd.read_parquet('./data/buildings_with_embeddings.parquet')To determine whether the five-cluster solution identified from regional pixel sampling also applies to our building-specific dataset, we apply the same clustergram analysis to the building embeddings. This comparison is important because building locations represent a spatial subset of the broader landscape and may exhibit different clustering patterns than randomly sampled pixels.
# Create clustergram to visualise optimal number of clusters
embedding_cols = [f"A{str(i).zfill(2)}" for i in range(64)]
cgram = Clustergram(range(1, 12), n_init=10, random_state=42, verbose=False)
cgram.fit(buildings_with_embeddings[embedding_cols])
cgram.plot(figsize=(8, 5))
Results: The Clustergram analysis confirms that five clusters remain an appropriate partitioning for the building-level data, with cluster assignments showing stability across different values of k and clear separation patterns similar to those observed in the regional pixel analysis. This consistency suggests that the embedding patterns we identified from random sampling effectively capture the dominant building types present in the Liverpool City Region.
Given the smaller size of the building data we will cluster locally within Python. K means is therefore used directly on buildings_with_embeddings, setting 1000 random initital seeds with an objective of finding 5 clusters. The repeated initialisations helps ensure we find the global optimum rather than getting trapped in local minima, which is particularly important given the 64-dimensional embedding space where K-means can be sensitive to initial centroid placement.
# Fit K-Means with 5 clusters and 1000 initialisations
kmeans = KMeans(n_clusters=5, init="random", n_init=1000, random_state=42)
labels = kmeans.fit_predict(buildings_with_embeddings[embedding_cols])
# Map numeric labels to letters
label_map = {i: chr(65 + i) for i in range(5)} # 0->A, 1->B, etc.
letter_labels = [label_map[label] for label in labels]
# Create output table with TOID and assigned cluster letter
output_table = buildings_with_embeddings[['TOID']].copy()
output_table['cluster'] = letter_labels
output_table.to_parquet('./data/output_table.parquet', index=False) # Export the clustering resultsThe output from the cluster analysis can then be used to create a geographic files that can be mapped.
# Create a GeoDataFrame with the clustering results
gdf = gpd.GeoDataFrame(
buildings_with_embeddings[['TOID', 'lon', 'lat']].copy(),
geometry=gpd.points_from_xy(buildings_with_embeddings['lon'], buildings_with_embeddings['lat']),
crs="EPSG:4326"
)
# Merge with output_table on TOID
gdf = gdf.merge(output_table, on='TOID', how='left')Step 7: Map the Results
The following code creates a GeoPackage file that can be mapped in QGIS.
gdf.to_file('./data/buildings_with_clusters.gpkg', layer='buildings_clusters', driver='GPKG')We can also visualise these results on the following interactive map:
Explore the Structure of the Embeddings and Clusters using UMAP
UMAP (Uniform Manifold Approximation and Projection)4 is a dimensionality reduction technique that converts high-dimensional data into 2D or 3D representations for visualisation (McInnes, Healy, and Melville 2020). In our case, UMAP takes the 64-dimensional embedding vectors and projects them onto a 2D plot while preserving the relationships between data points—locations that are similar in the original 64-dimensional space will appear close together in the 2D visualisation.
4 For a really nice visual explanation of UMAP V t-SNE which is an alternative method, see this blog.
This technique is particularly valuable for understanding cluster quality and separation. Well-separated clusters in the UMAP plot indicate that the clustering algorithm has identified genuinely distinct building types, while overlapping clusters suggest that some building types may be too similar to distinguish reliably using the satellite embeddings.
First we create a new DataFrame that contains the building embeddings and the assigned clusters that will be used in the UMAP visualisation.
# Prepare a DataFrame for UMAP visualisation:
umap_DF = buildings_with_embeddings[[f"A{str(i).zfill(2)}" for i in range(64)] + ["TOID"]].merge(
gdf[["TOID", "cluster"]], on="TOID", how="left"
)Step 8: Fit UMAP and Visualise the Results
The first stage is to apply UMAP using a number of different parameters that control various aspects of how the model is applied:
n_neighbors- Controls the balance between local and global structure in the data. Small values emphasise local clusters wheras large values (e.g. 50+) preserve more global structure, but may blur small clusters.min_dist- The minimum distance between points in the low-dimensional space. Low values (close to 0) allow for very tight clusters, whereas higher values (e.g. 0.5) enforce more spread-out embeddings.n_components- The number of output dimensions (i.e. 2 for 2D, 3 for 3D)random_state- Sets a random seed for reproducibility which ensures that running UMAP multiple times gives the same layout.metric- The distance metric used to measure similarity between data points in the input space. Cosine works well for embeddings (text/image) since it cares about direction, not magnitude.init- How to initialise the embedding.spectralSets a Laplacian Eigenmap (Belkin and Niyogi 2003) initialisation5n_epochs- The number of training iterations, with more epochs producing more more stable embedding, but taking a longer runtime.spread- This works withmin_distto control how spread-out the clusters are.Larger spread makes clusters take more space in the embedding.verbose- If True, this prints progress information while running.
5 Laplacian eigenmap initialisation works as follows in simple terms:
- Build a graph - Connect each data point to its nearest neighbors, creating a network where nearby points are linked
- Create a Laplacian matrix - This captures the structure of the neighbourhood graph in a mathematical form
- Find the eigenvectors - Compute the smallest eigenvectors of this Laplacian matrix
- Use as coordinates - These eigenvectors become the initial coordinates for the data points in the lower-dimensional space
This works because the eigenvectors corresponding to the smallest eigenvalues naturally arrange points so that neighbors in the original high-dimensional space remain close together in the new low-dimensional representation.
# Extract the 64 embedding columns
embedding_cols = [f"A{str(i).zfill(2)}" for i in range(64)]
embeddings_data = umap_DF[embedding_cols].values
# Apply UMAP to reduce 64 dimensions to 2D
umap_reducer = umap.UMAP(
n_neighbors=30, # Numbers of neighbours
min_dist=0.0, # Allow points to be closer together
n_components=2, # Reduce to 2D for visualsation
random_state=42, # For reproducible results
metric='cosine', # Cosine similarity works well for embeddings
init='random', # Use random initialisation
n_epochs=500, # More epochs for better convergence
spread=1.0, # Controls how tightly points are packed
verbose=False # Show progress
)
# Fit UMAP and transform the data
umap_embedding = umap_reducer.fit_transform(embeddings_data)
# Create a DataFrame with UMAP coordinates and existing cluster labels
umap_results = pd.DataFrame({
'UMAP1': umap_embedding[:, 0],
'UMAP2': umap_embedding[:, 1],
'Cluster': umap_DF['cluster'], # Use existing cluster column
'TOID': umap_DF['TOID'] # Use existing TOID column
})
# Save the UMAP results
umap_results.to_parquet('./data/buildings_umap_results.parquet', index=False)We can then visualise the results on an interactive plot:
# Define colours for each cluster - match the earlier map
colours = {
'A': '#8dd3c7',
'B': '#ffffb3',
'C': '#bebada',
'D': '#fb8072',
'E': '#80b1d3'
}
# Create interactive plot
fig_interactive = px.scatter(
umap_results,
x='UMAP1',
y='UMAP2',
color='Cluster',
color_discrete_map=colours,
hover_data=['TOID']
)
fig_interactive.update_traces(marker=dict(size=2, opacity=0.7))
fig_interactive.show()