Training Sets

This page provides links and guides to data available for training and testing algorithms, and to data that is available and might be of interest for astrobiological applications.

General Purpose Training Sets and Materials

Overview and guide to training data principles and pitfalls.

Google’s Dataset Search tool (akin to Google Scholar).

Large repository at of many different types of industry-used large training sets, from images to news articles to robotics.

Large dataset repository at Kaggle.

The famous(!) MNIST handwritten character image training set.

Research Datasets

Warning: All training data must be carefully curated, the datasets linked to here may or may not be ‘ready to use’.

The NASA Astrobiology Habitable Environments Database (AHED) has a growing archive of data for a wide range of astrobiological purposes.

The NASA Exoplanet Archive is a major repository for data designed to detect or characterize stellar and planetary properties. Includes data from Kepler and TESS missions.

Training data and code for agnostic molecular biosignature research, from project reported by Cleaves et al. (2023).

Dataset QSI biosignatures from terrestrial Mars analogue systems.

The NASA GeneLab Open Science Data Repository containing a wealth of data from space bioscience research.