Skip to content

GeoCroissant Metadata

GeoCroissant GeoMind

GeoCroissant is the geospatial extension of the MLCommons Croissant standard for describing machine learning datasets. GeoMind generates GeoCroissant-compliant JSON-LD metadata for every Sentinel-2 scene it retrieves, making satellite imagery immediately ready for AI/ML pipelines - with a single natural-language prompt.


Introduction & Overview

Croissant is a metadata standard designed to describe datasets in a structured, machine-actionable way, improving how data is discovered, understood, and consumed by automated tools and AI/ML pipelines. Building on this foundation, GeoCroissant extends Croissant with GeoAI-specific concepts, including:

  • Spatial and temporal extent
  • Coordinate reference systems (CRS)
  • Tiling and grids
  • Geospatial assets (multi-band rasters, Zarr groups, STAC items)

These additions better support GeoAI and Earth Observation (EO) workflows, where interoperability and precise geospatial context are essential.

With the growing availability of EO data, there is increasing emphasis on making datasets FAIR (Findable, Accessible, Interoperable, Reusable) and machine-actionable for automated discovery and reuse. Consistently structured metadata is central to this goal, capturing key geospatial characteristics (spatial/temporal extent, CRS, resolution) alongside provenance and licensing to enable interoperability across platforms.

Geospatial initiatives such as the Group on Earth Observations (GEO) Data Sharing Principles, as well as broader frameworks like the EOSC, reinforce transparent, responsible data stewardship and cross-platform interoperability.

Prior standardization efforts - DCAT, schema.org/Dataset, and CSV on the Web - established important foundations. GeoCroissant builds on all of them while adding the geospatial semantics that Earth observation data demands.


Why Metadata Matters

Metadata plays a critical role in making data meaningful and actionable. Without it, datasets can be challenging to interpret, leading to misunderstandings or underutilisation. Metadata provides context, ensures users understand the origin, purpose, and structure of data, enables efficient search and discovery, and fosters collaboration across different platforms and systems.

Metadata in ML-Ready Datasets - How Croissant Helps

Datasets are the foundation of Machine Learning. However, the lack of standardisation in how ML datasets are described has created significant challenges:

  • Only a tiny fraction of popular datasets are widely used - others remain undiscovered.
  • Without clear descriptions, loading datasets across frameworks (PyTorch, TensorFlow, JAX) requires custom glue code.
  • Reproducibility, portability, and responsible AI practices are hard to enforce without structured metadata.

Croissant addresses this by providing a standardised vocabulary based on schema.org that:

  • Streamlines dataset loading across popular ML frameworks
  • Improves discoverability - search engines index Croissant metadata automatically
  • Supports portability and reproducibility through stable, versioned schema definitions
  • Promotes Responsible AI (RAI) practices via the Croissant RAI extension, covering bias, provenance, and licensing

Geospatial AI (GeoAI)

Geospatial Artificial Intelligence (GeoAI) applies AI techniques to geospatial data for location-based analysis, mapping, and decision-making. GeoAI leverages diverse data streams from satellites, airborne platforms, in-situ sensors, and ground observations - resulting in rich, high-volume datasets with complex spatio-temporal structures.

Several considerations are critical in GeoAI-ready datasets:

Consideration Why It Matters
Accurate location Geolocation errors or coarse annotations can directly compromise model predictions
Sampling strategy With petabyte-scale datasets, careful sampling avoids class imbalance and regional bias
Data lifecycle Temporally mismatched data reduces model relevance and generalisability
Cloud-based access Cloud-optimised formats enable efficient training and collaborative, scalable computation
End-to-end integration Metadata-rich formats like GeoCroissant allow seamless ingestion into modern AI workflows

Croissant and GeoAI Datasets

While Croissant introduces a strong foundation for ML metadata, it lacks specific support for the unique characteristics of geospatial datasets. Earth observation data exhibits high dimensionality, temporal complexity, and heterogeneity across formats (raster, vector, point cloud) - and demands spatial context, quality indicators, and privacy-aware attributes. GeoCroissant addresses these gaps:

Geospatial Dataset Type Gap vs. Generic Datasets How GeoCroissant Addresses It
EO imagery (multi-band, optical/SAR) Band semantics and modality-specific acquisition parameters Standardised sensor/band descriptors and ML-task metadata
Spatiotemporal time-series Time indexing + spatiotemporal coverage consistency Consistent temporal modelling and time-series support
Complex geo formats (NetCDF / HDF5 / Zarr) Nested variables, chunking, multiple assets per sample Clear mapping from raw containers to AI-ready datasets
Mixed geometry data (vector, raster, point clouds) Heterogeneous geometry types and spatial reference handling Uniform spatial semantics and discovery/query support
Human-labelled / crowdsourced datasets Spatial representativeness and sampling bias Explicit provenance and spatial bias documentation via Croissant RAI

What Is GeoCroissant?

GeoCroissant is Croissant + geospatial semantics. It enhances the core Croissant framework with metadata elements essential for Geo-ML datasets:

  • Coordinate Reference Systems (CRS) - tells consumers which projection the data is in
  • Spatial resolution - the ground sampling distance per pixel
  • Band configuration - ordered band names and total band count
  • Spectral band metadata - center wavelength and bandwidth per band
  • Spatial and temporal coverage - bounding box and acquisition time
  • Record endpoints - URLs for programmatic data access

These extensions make datasets easily discoverable and accessible with ML frameworks such as PyTorch, TensorFlow, Keras, and HuggingFace for tasks such as land cover classification, climate modelling, and extreme weather forecasting.

The vocabulary is defined under the namespace geocr: <http://mlcommons.org/croissant/geo/> in the file geocroissant.ttl included in this repository.


Pre-requisites & Namespaces

The GeoCroissant vocabulary builds on schema.org/Dataset and uses the following namespaces:

Prefix IRI Description
sc http://schema.org/ The schema.org namespace
cr http://mlcommons.org/croissant/ MLCommons Croissant base namespace
geocr http://mlcommons.org/croissant/geo/ GeoCroissant extension namespace
dct http://purl.org/dc/terms/ Dublin Core Terms

The GeoCroissant specification is versioned. The current version URI is:

http://mlcommons.org/croissant/geo/1.0

Datasets that conform to GeoCroissant must declare conformance at the dataset level:

"dct:conformsTo": [
  "http://mlcommons.org/croissant/1.1",
  "http://mlcommons.org/croissant/geo/1.0"
]

Stable Vocabulary URIs

While the specification is versioned, the geocr: namespace itself is not. Vocabulary terms retain stable URIs as the specification evolves, supporting machine-actionable FAIR compliance.


Why Generate GeoCroissant Metadata?

Without GeoCroissant With GeoCroissant
Scene URL + manual inspection Fully described dataset in one JSON file
CRS and resolution buried in headers geocr:coordinateReferenceSystem and geocr:spatialResolution at dataset level
No standard citation format citeAs BibTeX entry auto-generated
Dataset loaders break on structure changes Stable cr:RecordSet and cr:Field schema
Manual ML cataloguing Validated by mlcroissant validate in seconds
Undiscoverable in search engines Indexed by any Croissant-aware catalogue or search engine
Band info locked in binary headers Explicit geocr:SpectralBand entries with wavelength + bandwidth

Any ML framework, data loader, or search engine that understands Croissant automatically understands your satellite dataset.


GeoMind + GeoCroissant: End-to-End

GeoMind is the AI agent layer on top of GeoCroissant. It removes every manual step between a plain-English request and a validated, standards-compliant metadata file:

User: "Get me a recent Iceland Sentinel-2 GeoCroissant file"
  GeoMind Agent (LLM + tool orchestration)
         ├─ list_recent_imagery("Iceland")   ← geocodes + queries STAC
         └─ generate_croissant_metadata(item_id, "Iceland")
                  ├─ Fetches STAC item (bbox, datetime, assets)
                  ├─ Builds JSON-LD with Croissant + GeoCroissant context
                  ├─ Maps Zarr asset groups → cr:FileObject + cr:Field
                  ├─ Injects geocr:* properties (CRS, resolution, bands)
                  └─ Saves + validates → outputs/croissant_*.json

What you get in one prompt:

  • A scene selected by location and recency - no manual STAC querying
  • A fully structured JSON-LD file conforming to Croissant 1.1 and GeoCroissant 1.0
  • Every Sentinel-2 band mapped to a typed cr:Field with its Zarr extraction path
  • Spatial coverage, CRS, resolution, and temporal coverage all populated
  • A citeAs BibTeX block auto-generated
  • A file that passes mlcroissant validate with zero errors

This is the core value proposition: GeoMind turns natural language into FAIR, ML-ready EO metadata.


GeoCroissant Vocabulary (geocroissant.ttl)

The GeoMind repository ships the full GeoCroissant vocabulary as a Turtle (RDF) file. The namespace is:

@prefix geocr: <http://mlcommons.org/croissant/geo/> .

Classes

Class Description
geocr:BandConfiguration Raster band organisation and semantics (band count + ordered names)
geocr:SpectralBand Per-band spectral metadata entry (center wavelength, bandwidth)
geocr:MultiWavelengthConfiguration Multi-wavelength channel config for Space Weather / heliophysics datasets
geocr:SolarInstrumentCharacteristics Solar/heliophysics instrument and observatory characteristics

Dataset-Level Properties

Property Range Description
geocr:coordinateReferenceSystem schema:Text CRS identifier, e.g. "EPSG:4326"
geocr:spatialResolution schema:QuantitativeValue or Text Nominal ground sampling distance
geocr:bandConfiguration geocr:BandConfiguration Band count and ordered band names
geocr:spectralBandMetadata geocr:SpectralBand Per-band center wavelength and bandwidth
geocr:recordEndpoint schema:URL Programmatic access endpoint (e.g. STAC endpoint)
geocr:spatialIndex schema:Text Spatial index token (DGGS cell ID, geohash)
geocr:spatialBias schema:Text Spatial representativeness limitations
geocr:samplingStrategy schema:Text Chip/window selection strategy description
geocr:temporalResolution schema:QuantitativeValue or Text Temporal cadence of the dataset
geocr:multiWavelengthConfiguration geocr:MultiWavelengthConfiguration Multi-wavelength config for heliophysics
geocr:solarInstrumentCharacteristics geocr:SolarInstrumentCharacteristics Solar instrument characteristics

RecordSet-Level Properties

Property Range Description
geocr:spatialResolution schema:QuantitativeValue Resolution when it varies per record
geocr:spatialIndex schema:Text Spatial index per record
geocr:temporalResolution schema:QuantitativeValue Cadence when it varies per record
geocr:timeSeriesIndex cr:Field Field used to index time series observations

BandConfiguration Properties

Property Range Description
geocr:totalBands schema:Integer Total number of bands
geocr:bandNamesList schema:Text Ordered list of band names

SpectralBand Properties

Property Range Description
geocr:centerWavelength schema:QuantitativeValue Center wavelength (µm or nm)
geocr:bandwidth schema:QuantitativeValue Spectral bandwidth

How GeoMind Builds GeoCroissant Metadata

When you ask GeoMind for GeoCroissant metadata, it runs the generate_croissant_metadata tool:

flowchart LR
    A["User query\ne.g. 'Iceland geocroissant'"] --> B["Agent calls\nlist_recent_imagery"]
    B --> C["STAC returns\nitem_id + asset URLs"]
    C --> D["Agent calls\ngenerate_croissant_metadata(item_id)"]
    D --> E["Fetch item details\nfrom STAC API"]
    E --> F["Build JSON-LD\nskeleton"]
    F --> G["Map assets →\ncr:FileObject + cr:Field"]
    G --> H["Add geocr:*\nproperties"]
    H --> I["Save to\noutputs/croissant_*.json"]

Steps in detail:

  1. Fetch item details - calls the EODC STAC API for the given item ID (bbox, datetime, assets)
  2. Build JSON-LD context - maps all Croissant and GeoCroissant prefixes
  3. Set dataset-level fields - name, description, spatialCoverage, temporalCoverage, license, citeAs
  4. Map assets to cr:FileObject - the product Zarr URL becomes the single distribution file
  5. Map bands to cr:Field - each asset sub-path becomes a typed field with source extraction path
  6. Add geocr: extensions - coordinateReferenceSystem, spatialResolution, bandConfiguration
  7. Validate and save - output written to outputs/croissant_<item_id>_<id>.json

Output Structure

A GeoMind GeoCroissant file is valid JSON-LD conforming to both Croissant 1.1 and GeoCroissant 1.0:

{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "cr":    "http://mlcommons.org/croissant/",
    "geocr": "http://mlcommons.org/croissant/geo/",
    "dct":   "http://purl.org/dc/terms/",
    "sc":    "https://schema.org/"
    ...
  },
  "@type": "sc:Dataset",
  "@id":   "S2B_MSIL2A_20260301T125259_N0512_R138_T27WXN_20260301T163056",
  "name":  "S2B_MSIL2A_20260301T125259_N0512_R138_T27WXN_20260301T163056",
  "description": "Sentinel-2 L2A Imagery for Iceland",
  "conformsTo": [
    "http://mlcommons.org/croissant/1.1",
    "http://mlcommons.org/croissant/geo/1.0"
  ],
  "spatialCoverage": {
    "@type": "Place",
    "geo": {
      "@type": "GeoShape",
      "box": "64.77 -18.89 65.80 -16.41"
    }
  },
  "geocr:coordinateReferenceSystem": "EPSG:4326",
  "geocr:spatialResolution": {
    "@type": "QuantitativeValue",
    "value": 10.0,
    "unitText": "meters"
  },
  "geocr:bandConfiguration": {
    "@type": "geocr:BandConfiguration",
    "geocr:totalBands": 16,
    "geocr:bandNamesList": ["SR_10m", "SR_20m", "SR_60m", ...]
  },
  "temporalCoverage": "2026-03-01T12:52:59Z/2026-03-01T12:52:59Z",
  "distribution": [
    {
      "@type":          "cr:FileObject",
      "@id":            "asset_product",
      "contentUrl":     "https://objects.eodc.eu/.../S2B...zarr",
      "encodingFormat": "application/vnd+zarr"
    }
  ],
  "recordSet": [
    {
      "@type": "cr:RecordSet",
      "@id":   "record_set_imagery_bands",
      "field": [
        {
          "@type":    "cr:Field",
          "@id":      "field_B02_10m",
          "name":     "B02_10m",
          "description": "Blue (band 2) - 10m",
          "dataType": "sc:Float",
          "source": {
            "fileObject": { "@id": "asset_product" },
            "extract":    { "column": "measurements/reflectance/r10m/b02" }
          }
        }
      ]
    }
  ]
}

Sentinel-2 Bands in GeoCroissant

Each Sentinel-2 band maps to a cr:Field in the imagery_bands RecordSet. GeoMind includes all available bands from the EODC Zarr product:

Field ID Band Description Resolution
field_B01_20m B01 Coastal aerosol 20 m
field_B02_10m B02 Blue 10 m
field_B03_10m B03 Green 10 m
field_B04_10m B04 Red 10 m
field_B05_20m B05 Red Edge 1 20 m
field_B06_20m B06 Red Edge 2 20 m
field_B07_20m B07 Red Edge 3 20 m
field_B08_10m B08 NIR 10 m
field_B8A_20m B8A Narrow NIR 20 m
field_B09_60m B09 Water Vapour 60 m
field_B11_20m B11 SWIR 1 20 m
field_B12_20m B12 SWIR 2 20 m
field_SR_10m - Surface Reflectance composite 10m 10 m
field_SR_20m - Surface Reflectance composite 20m 20 m
field_SR_60m - Surface Reflectance composite 60m 60 m
field_AOT_10m - Aerosol Optical Thickness 10 m
field_SCL_20m - Scene Classification Map 20 m
field_TCI_10m - True Colour Image 10 m

Querying and Validation

For step-by-step interactive queries, Python API usage, and mlcroissant validation walkthroughs, please see the GeoCroissant Metadata Examples in our Usage Examples.


Standards Conformance

Standard Version Namespace
MLCommons Croissant 1.1 http://mlcommons.org/croissant/
GeoCroissant 1.0 http://mlcommons.org/croissant/geo/
Schema.org - https://schema.org/
Dublin Core - http://purl.org/dc/terms/
W3C JSON-LD 1.1 -
EPSG - CRS identifiers (e.g. EPSG:4326)