GeoCroissant Metadata¶
GeoCroissant is the geospatial extension of the MLCommons Croissant standard for describing machine learning datasets. GeoMind generates GeoCroissant-compliant JSON-LD metadata for every Sentinel-2 scene it retrieves, making satellite imagery immediately ready for AI/ML pipelines - with a single natural-language prompt.
Introduction & Overview¶
Croissant is a metadata standard designed to describe datasets in a structured, machine-actionable way, improving how data is discovered, understood, and consumed by automated tools and AI/ML pipelines. Building on this foundation, GeoCroissant extends Croissant with GeoAI-specific concepts, including:
- Spatial and temporal extent
- Coordinate reference systems (CRS)
- Tiling and grids
- Geospatial assets (multi-band rasters, Zarr groups, STAC items)
These additions better support GeoAI and Earth Observation (EO) workflows, where interoperability and precise geospatial context are essential.
With the growing availability of EO data, there is increasing emphasis on making datasets FAIR (Findable, Accessible, Interoperable, Reusable) and machine-actionable for automated discovery and reuse. Consistently structured metadata is central to this goal, capturing key geospatial characteristics (spatial/temporal extent, CRS, resolution) alongside provenance and licensing to enable interoperability across platforms.
Geospatial initiatives such as the Group on Earth Observations (GEO) Data Sharing Principles, as well as broader frameworks like the EOSC, reinforce transparent, responsible data stewardship and cross-platform interoperability.
Prior standardization efforts - DCAT, schema.org/Dataset, and CSV on the Web - established important foundations. GeoCroissant builds on all of them while adding the geospatial semantics that Earth observation data demands.
Why Metadata Matters¶
Metadata plays a critical role in making data meaningful and actionable. Without it, datasets can be challenging to interpret, leading to misunderstandings or underutilisation. Metadata provides context, ensures users understand the origin, purpose, and structure of data, enables efficient search and discovery, and fosters collaboration across different platforms and systems.
Metadata in ML-Ready Datasets - How Croissant Helps¶
Datasets are the foundation of Machine Learning. However, the lack of standardisation in how ML datasets are described has created significant challenges:
- Only a tiny fraction of popular datasets are widely used - others remain undiscovered.
- Without clear descriptions, loading datasets across frameworks (PyTorch, TensorFlow, JAX) requires custom glue code.
- Reproducibility, portability, and responsible AI practices are hard to enforce without structured metadata.
Croissant addresses this by providing a standardised vocabulary based on schema.org that:
- Streamlines dataset loading across popular ML frameworks
- Improves discoverability - search engines index Croissant metadata automatically
- Supports portability and reproducibility through stable, versioned schema definitions
- Promotes Responsible AI (RAI) practices via the Croissant RAI extension, covering bias, provenance, and licensing
Geospatial AI (GeoAI)¶
Geospatial Artificial Intelligence (GeoAI) applies AI techniques to geospatial data for location-based analysis, mapping, and decision-making. GeoAI leverages diverse data streams from satellites, airborne platforms, in-situ sensors, and ground observations - resulting in rich, high-volume datasets with complex spatio-temporal structures.
Several considerations are critical in GeoAI-ready datasets:
| Consideration | Why It Matters |
|---|---|
| Accurate location | Geolocation errors or coarse annotations can directly compromise model predictions |
| Sampling strategy | With petabyte-scale datasets, careful sampling avoids class imbalance and regional bias |
| Data lifecycle | Temporally mismatched data reduces model relevance and generalisability |
| Cloud-based access | Cloud-optimised formats enable efficient training and collaborative, scalable computation |
| End-to-end integration | Metadata-rich formats like GeoCroissant allow seamless ingestion into modern AI workflows |
Croissant and GeoAI Datasets¶
While Croissant introduces a strong foundation for ML metadata, it lacks specific support for the unique characteristics of geospatial datasets. Earth observation data exhibits high dimensionality, temporal complexity, and heterogeneity across formats (raster, vector, point cloud) - and demands spatial context, quality indicators, and privacy-aware attributes. GeoCroissant addresses these gaps:
| Geospatial Dataset Type | Gap vs. Generic Datasets | How GeoCroissant Addresses It |
|---|---|---|
| EO imagery (multi-band, optical/SAR) | Band semantics and modality-specific acquisition parameters | Standardised sensor/band descriptors and ML-task metadata |
| Spatiotemporal time-series | Time indexing + spatiotemporal coverage consistency | Consistent temporal modelling and time-series support |
| Complex geo formats (NetCDF / HDF5 / Zarr) | Nested variables, chunking, multiple assets per sample | Clear mapping from raw containers to AI-ready datasets |
| Mixed geometry data (vector, raster, point clouds) | Heterogeneous geometry types and spatial reference handling | Uniform spatial semantics and discovery/query support |
| Human-labelled / crowdsourced datasets | Spatial representativeness and sampling bias | Explicit provenance and spatial bias documentation via Croissant RAI |
What Is GeoCroissant?¶
GeoCroissant is Croissant + geospatial semantics. It enhances the core Croissant framework with metadata elements essential for Geo-ML datasets:
- Coordinate Reference Systems (CRS) - tells consumers which projection the data is in
- Spatial resolution - the ground sampling distance per pixel
- Band configuration - ordered band names and total band count
- Spectral band metadata - center wavelength and bandwidth per band
- Spatial and temporal coverage - bounding box and acquisition time
- Record endpoints - URLs for programmatic data access
These extensions make datasets easily discoverable and accessible with ML frameworks such as PyTorch, TensorFlow, Keras, and HuggingFace for tasks such as land cover classification, climate modelling, and extreme weather forecasting.
The vocabulary is defined under the namespace geocr: <http://mlcommons.org/croissant/geo/> in the file geocroissant.ttl included in this repository.
Pre-requisites & Namespaces¶
The GeoCroissant vocabulary builds on schema.org/Dataset and uses the following namespaces:
| Prefix | IRI | Description |
|---|---|---|
sc |
http://schema.org/ |
The schema.org namespace |
cr |
http://mlcommons.org/croissant/ |
MLCommons Croissant base namespace |
geocr |
http://mlcommons.org/croissant/geo/ |
GeoCroissant extension namespace |
dct |
http://purl.org/dc/terms/ |
Dublin Core Terms |
The GeoCroissant specification is versioned. The current version URI is:
Datasets that conform to GeoCroissant must declare conformance at the dataset level:
"dct:conformsTo": [
"http://mlcommons.org/croissant/1.1",
"http://mlcommons.org/croissant/geo/1.0"
]
Stable Vocabulary URIs
While the specification is versioned, the geocr: namespace itself is not. Vocabulary terms retain stable URIs as the specification evolves, supporting machine-actionable FAIR compliance.
Why Generate GeoCroissant Metadata?¶
| Without GeoCroissant | With GeoCroissant |
|---|---|
| Scene URL + manual inspection | Fully described dataset in one JSON file |
| CRS and resolution buried in headers | geocr:coordinateReferenceSystem and geocr:spatialResolution at dataset level |
| No standard citation format | citeAs BibTeX entry auto-generated |
| Dataset loaders break on structure changes | Stable cr:RecordSet and cr:Field schema |
| Manual ML cataloguing | Validated by mlcroissant validate in seconds |
| Undiscoverable in search engines | Indexed by any Croissant-aware catalogue or search engine |
| Band info locked in binary headers | Explicit geocr:SpectralBand entries with wavelength + bandwidth |
Any ML framework, data loader, or search engine that understands Croissant automatically understands your satellite dataset.
GeoMind + GeoCroissant: End-to-End¶
GeoMind is the AI agent layer on top of GeoCroissant. It removes every manual step between a plain-English request and a validated, standards-compliant metadata file:
User: "Get me a recent Iceland Sentinel-2 GeoCroissant file"
│
▼
GeoMind Agent (LLM + tool orchestration)
│
├─ list_recent_imagery("Iceland") ← geocodes + queries STAC
│
└─ generate_croissant_metadata(item_id, "Iceland")
│
├─ Fetches STAC item (bbox, datetime, assets)
├─ Builds JSON-LD with Croissant + GeoCroissant context
├─ Maps Zarr asset groups → cr:FileObject + cr:Field
├─ Injects geocr:* properties (CRS, resolution, bands)
└─ Saves + validates → outputs/croissant_*.json
What you get in one prompt:
- A scene selected by location and recency - no manual STAC querying
- A fully structured JSON-LD file conforming to Croissant 1.1 and GeoCroissant 1.0
- Every Sentinel-2 band mapped to a typed
cr:Fieldwith its Zarr extraction path - Spatial coverage, CRS, resolution, and temporal coverage all populated
- A
citeAsBibTeX block auto-generated - A file that passes
mlcroissant validatewith zero errors
This is the core value proposition: GeoMind turns natural language into FAIR, ML-ready EO metadata.
GeoCroissant Vocabulary (geocroissant.ttl)¶
The GeoMind repository ships the full GeoCroissant vocabulary as a Turtle (RDF) file. The namespace is:
Classes¶
| Class | Description |
|---|---|
geocr:BandConfiguration |
Raster band organisation and semantics (band count + ordered names) |
geocr:SpectralBand |
Per-band spectral metadata entry (center wavelength, bandwidth) |
geocr:MultiWavelengthConfiguration |
Multi-wavelength channel config for Space Weather / heliophysics datasets |
geocr:SolarInstrumentCharacteristics |
Solar/heliophysics instrument and observatory characteristics |
Dataset-Level Properties¶
| Property | Range | Description |
|---|---|---|
geocr:coordinateReferenceSystem |
schema:Text |
CRS identifier, e.g. "EPSG:4326" |
geocr:spatialResolution |
schema:QuantitativeValue or Text |
Nominal ground sampling distance |
geocr:bandConfiguration |
geocr:BandConfiguration |
Band count and ordered band names |
geocr:spectralBandMetadata |
geocr:SpectralBand |
Per-band center wavelength and bandwidth |
geocr:recordEndpoint |
schema:URL |
Programmatic access endpoint (e.g. STAC endpoint) |
geocr:spatialIndex |
schema:Text |
Spatial index token (DGGS cell ID, geohash) |
geocr:spatialBias |
schema:Text |
Spatial representativeness limitations |
geocr:samplingStrategy |
schema:Text |
Chip/window selection strategy description |
geocr:temporalResolution |
schema:QuantitativeValue or Text |
Temporal cadence of the dataset |
geocr:multiWavelengthConfiguration |
geocr:MultiWavelengthConfiguration |
Multi-wavelength config for heliophysics |
geocr:solarInstrumentCharacteristics |
geocr:SolarInstrumentCharacteristics |
Solar instrument characteristics |
RecordSet-Level Properties¶
| Property | Range | Description |
|---|---|---|
geocr:spatialResolution |
schema:QuantitativeValue |
Resolution when it varies per record |
geocr:spatialIndex |
schema:Text |
Spatial index per record |
geocr:temporalResolution |
schema:QuantitativeValue |
Cadence when it varies per record |
geocr:timeSeriesIndex |
cr:Field |
Field used to index time series observations |
BandConfiguration Properties¶
| Property | Range | Description |
|---|---|---|
geocr:totalBands |
schema:Integer |
Total number of bands |
geocr:bandNamesList |
schema:Text |
Ordered list of band names |
SpectralBand Properties¶
| Property | Range | Description |
|---|---|---|
geocr:centerWavelength |
schema:QuantitativeValue |
Center wavelength (µm or nm) |
geocr:bandwidth |
schema:QuantitativeValue |
Spectral bandwidth |
How GeoMind Builds GeoCroissant Metadata¶
When you ask GeoMind for GeoCroissant metadata, it runs the generate_croissant_metadata tool:
flowchart LR
A["User query\ne.g. 'Iceland geocroissant'"] --> B["Agent calls\nlist_recent_imagery"]
B --> C["STAC returns\nitem_id + asset URLs"]
C --> D["Agent calls\ngenerate_croissant_metadata(item_id)"]
D --> E["Fetch item details\nfrom STAC API"]
E --> F["Build JSON-LD\nskeleton"]
F --> G["Map assets →\ncr:FileObject + cr:Field"]
G --> H["Add geocr:*\nproperties"]
H --> I["Save to\noutputs/croissant_*.json"]
Steps in detail:
- Fetch item details - calls the EODC STAC API for the given item ID (bbox, datetime, assets)
- Build JSON-LD context - maps all Croissant and GeoCroissant prefixes
- Set dataset-level fields -
name,description,spatialCoverage,temporalCoverage,license,citeAs - Map assets to
cr:FileObject- the product Zarr URL becomes the single distribution file - Map bands to
cr:Field- each asset sub-path becomes a typed field with source extraction path - Add
geocr:extensions -coordinateReferenceSystem,spatialResolution,bandConfiguration - Validate and save - output written to
outputs/croissant_<item_id>_<id>.json
Output Structure¶
A GeoMind GeoCroissant file is valid JSON-LD conforming to both Croissant 1.1 and GeoCroissant 1.0:
{
"@context": {
"@language": "en",
"@vocab": "https://schema.org/",
"cr": "http://mlcommons.org/croissant/",
"geocr": "http://mlcommons.org/croissant/geo/",
"dct": "http://purl.org/dc/terms/",
"sc": "https://schema.org/"
...
},
"@type": "sc:Dataset",
"@id": "S2B_MSIL2A_20260301T125259_N0512_R138_T27WXN_20260301T163056",
"name": "S2B_MSIL2A_20260301T125259_N0512_R138_T27WXN_20260301T163056",
"description": "Sentinel-2 L2A Imagery for Iceland",
"conformsTo": [
"http://mlcommons.org/croissant/1.1",
"http://mlcommons.org/croissant/geo/1.0"
],
"spatialCoverage": {
"@type": "Place",
"geo": {
"@type": "GeoShape",
"box": "64.77 -18.89 65.80 -16.41"
}
},
"geocr:coordinateReferenceSystem": "EPSG:4326",
"geocr:spatialResolution": {
"@type": "QuantitativeValue",
"value": 10.0,
"unitText": "meters"
},
"geocr:bandConfiguration": {
"@type": "geocr:BandConfiguration",
"geocr:totalBands": 16,
"geocr:bandNamesList": ["SR_10m", "SR_20m", "SR_60m", ...]
},
"temporalCoverage": "2026-03-01T12:52:59Z/2026-03-01T12:52:59Z",
"distribution": [
{
"@type": "cr:FileObject",
"@id": "asset_product",
"contentUrl": "https://objects.eodc.eu/.../S2B...zarr",
"encodingFormat": "application/vnd+zarr"
}
],
"recordSet": [
{
"@type": "cr:RecordSet",
"@id": "record_set_imagery_bands",
"field": [
{
"@type": "cr:Field",
"@id": "field_B02_10m",
"name": "B02_10m",
"description": "Blue (band 2) - 10m",
"dataType": "sc:Float",
"source": {
"fileObject": { "@id": "asset_product" },
"extract": { "column": "measurements/reflectance/r10m/b02" }
}
}
]
}
]
}
Sentinel-2 Bands in GeoCroissant¶
Each Sentinel-2 band maps to a cr:Field in the imagery_bands RecordSet. GeoMind includes all available bands from the EODC Zarr product:
| Field ID | Band | Description | Resolution |
|---|---|---|---|
field_B01_20m |
B01 | Coastal aerosol | 20 m |
field_B02_10m |
B02 | Blue | 10 m |
field_B03_10m |
B03 | Green | 10 m |
field_B04_10m |
B04 | Red | 10 m |
field_B05_20m |
B05 | Red Edge 1 | 20 m |
field_B06_20m |
B06 | Red Edge 2 | 20 m |
field_B07_20m |
B07 | Red Edge 3 | 20 m |
field_B08_10m |
B08 | NIR | 10 m |
field_B8A_20m |
B8A | Narrow NIR | 20 m |
field_B09_60m |
B09 | Water Vapour | 60 m |
field_B11_20m |
B11 | SWIR 1 | 20 m |
field_B12_20m |
B12 | SWIR 2 | 20 m |
field_SR_10m |
- | Surface Reflectance composite 10m | 10 m |
field_SR_20m |
- | Surface Reflectance composite 20m | 20 m |
field_SR_60m |
- | Surface Reflectance composite 60m | 60 m |
field_AOT_10m |
- | Aerosol Optical Thickness | 10 m |
field_SCL_20m |
- | Scene Classification Map | 20 m |
field_TCI_10m |
- | True Colour Image | 10 m |
Querying and Validation¶
For step-by-step interactive queries, Python API usage, and mlcroissant validation walkthroughs, please see the GeoCroissant Metadata Examples in our Usage Examples.
Standards Conformance¶
| Standard | Version | Namespace |
|---|---|---|
| MLCommons Croissant | 1.1 | http://mlcommons.org/croissant/ |
| GeoCroissant | 1.0 | http://mlcommons.org/croissant/geo/ |
| Schema.org | - | https://schema.org/ |
| Dublin Core | - | http://purl.org/dc/terms/ |
| W3C JSON-LD | 1.1 | - |
| EPSG | - | CRS identifiers (e.g. EPSG:4326) |