AI-ready Dataset Metadata as a Service: Google Summer of Code 2025
August 31, 2025 • Harsh Shinde
Brief Description
The project aims to enhance the ZOO-Project with native support for GeoCroissant metadata, enabling AI-ready geospatial datasets. It will provide tools for metadata generation, validation, and integration with platforms like STAC, Earth Engine, and HuggingFace, along with data-centric AI workflows for improving dataset quality.
State of the Project Before GSoC
While the ZOO-Project already offers solid support for OGC-compliant geoprocessing, it currently doesn't have built-in support for GeoCroissant—a metadata standard designed specifically for AI-ready geospatial datasets. There are no tools available within ZOO to help users create or validate this kind of metadata or to connect easily with existing platforms like STAC, Earth Engine, or machine learning hubs like HuggingFace and Kaggle. It also lacks workflows that can help users check the quality of their training data or fix common issues like annotation errors or bias. This project aims to fill those gaps and bring these much-needed features to the ZOO-Project.
Deliverables
- Integration of GeoCroissant metadata support into OGC API – Processes
- Services for metadata generation, validation, and conversion from STAC, Earth Engine, HuggingFace, and Kaggle
- REST endpoints for metadata hosting and JSON-LD-based service chaining
- Implementation of Data-Centric AI workflows using Cleanlab for label noise and bias detection
- Interoperability tools for STAC, OGC TrainingDML, and MLCommons Croissant formats
- Full test suite, example datasets, and usage tutorials
- Comprehensive documentation and project wiki with deployment guides
Detailed Proposal
Check out the full project proposal: Detailed Proposal Link (GitHub Wiki)
Participants
| Title | Name | GitHub Handle |
|---|---|---|
| 1st Mentor | Chetan Mahajan | @cOsprey |
| 2nd Mentor | Gérald Fenoy | @gfenoy |
| Student Developer | Harsh Shinde | @HarshShinde0 |