Data Processing Overview
GeneLab is home to a variety of omics datatypes (epigenomics, genomics, transcriptomics, proteomics, metabolomics, etc.) derived from model organisms that have been exposed to the space environment via spaceflight or here on earth using space analogs. To make these precious data accessible to a broad audience, it is crucial to organize and provide all associated metadata necessary to understand the experimental design, the details of each study organism, how samples were isolated and prepared for omics data generation, and any nuances that arose during the course of each study, a process known as data and metadata curation. Additionally, since omics data in their raw form are only accessible to specialized scientists known as bioinformaticians, it is necessary to process raw omics data into higher level data, such as gene expression changes, and to create a means for visualizing these data to make the data more accessible to a broader audience. Thus, GeneLab is equipped with a data processing team composed of both curators and bioinformaticians to overcome the accessibility issues associated with space-relevant omics data.
GeneLab Curation
Metadata, information used to describe and characterize a set of data, are essential for data interpretation. The process of translating and integrating biological information associated with experimental data that can be used to comprehend, interrogate, enable manual or computational analysis, interpret, and/or aid accessibility is known as curation (more specifically biocuration when in the context of biological data). With the exponential growth of biological data generation and advances in computational technologies, it is necessary to organize and present all the essential information relevant to a dataset in both human- and machine-readable formats. The meticulous curation and organization of data hosted on GeneLab allows users to easily navigate the datasets and identify key information necessary for accurate interpretation of results, and enables users to reuse the data, thereby generating new knowledge about how terrestrial biology is modified by the space environment. GeneLab’s curation objectives are guided by NASA’s Open-Source Science Initiative and the Findable, Accessible, Interoperable, and Reusable (FAIR) data principles.
Data and metadata curation begins with establishing normalization standards based on input from subject matter experts (SMEs) and Analysis Working Group (AWG) members. At GeneLab, such standardization efforts are applied to three main categories: 1) sample level metadata for specific organisms, 2) study level metadata for various study types such as spaceflight, parabolic flight, radiation, altered gravity studies, etc., and 3) assay level metadata for specific assay types such as RNAseq, metagenomics, amplicon sequencing, spatial transcriptomics, etc. The curation team employs community-developed controlled vocabularies and ontologies accepted by the larger scientific community and the Open Biological and Biomedical Ontology (OBO) Foundry. This type of semantic annotation supports the integration of background knowledge, facilitating data discovery and knowledge synthesis. Additionally, GeneLab leverages the Investigation-Study-Assay (ISA) framework which provides rich descriptions of the experimental metadata, enabling reproducibility and reusability. However, considering the nuanced nature of space biology, sometimes terminology must be extended beyond OBO. The addition of new ontology terms is done in collaboration with GeneLab and ALSDA AWGs, and the International Standards for Space Omics Processing (ISSOP) consortium in a manner that supports controlled integration between datasets and metadata sources.
Space-related omics datasets that meet GeneLab’s standards for metadata quality follow a standard curation pipeline that includes programmatic raw data quality checks and manual review by SMEs. Additionally, the curation team (in collaboration with GeneLab’s data systems team) is extending and expanding these guidelines to normalize data and metadata associated with existing datasets in the Open Science Data Repository (OSDR) using a phase-based approach to fill in missing information, remove duplicated fields, and revise terms used for the same factors and table column names. All these metadata normalization efforts enhance search capability within the OSDRs and allow for more efficient and complete cross dataset analyses, further enabling the generation of new knowledge from space-relevant studies.
To aid curation automation, metadata normalization standards have been built into our online submission portal, Biological Data Management Environment (BDME). The BDME provides a guided step-by-step process for submitters to populate the essential sample and assay level metadata for their study by selecting the built-in preferred ontology class search when applicable. A validation check has also been integrated to inform submitters when their study meets the validation criteria prior to submitting it for curation review and ultimately, public release of their study when all data/metadata fields are finalized.
GeneLab Data Processing
To ensure omics data hosted on the GeneLab Data Repository are accessible to and interpretable by a broad scientific audience, a small group of bioinformaticians at GeneLab work in collaboration with the scientific community, via our AWG members, to develop and standardize consensus processing pipelines for each omics data type. Standardizing pipelines is necessary to minimize variation in data processing, which enables the integration of data from the diverse array of spaceflight and analog experiments hosted on GeneLab. Once baselined, the processing pipelines are made publicly available through the GeneLab Data Processing GitHub Repository. All pipelines are created using open-source software and pipelines are packaged into workflows to ensure transparency and reproducibility. Instructions for how to install and run GeneLab workflows are also available on the GitHub repository. Since the bioinformatics field is continuously advancing, the data processing team re-visits each baselined pipeline annually and updates pipeline versions as necessary to ensure the pipelines remain up to date with the scientific community best practices.
Raw data from each dataset hosted on GeneLab is processed through the GeneLab standard processing pipeline for the respective omics data type. All data processing is performed on the GeneLab internal compute cluster, which contains bioinformatics software including GeneLab standard pipelines and respective workflows. Processed data products from each step of the pipelines are published along with the raw data for each GeneLab dataset hosted on the OSDR, under ‘Study Files’, thereby enhancing accessibility of these data to a broader audience. Furthermore, machine-readable files are generated and designed to enable visualization of the processed data through the GeneLab data visualization portal. Interactive data visualization plots and tables allow users at any level to interrogate GeneLab data and generate new hypotheses about the effects of spaceflight on terrestrial biology.
Similar to the services provided by the state-of-the-art Sample Processing Laboratory (SPL), GeneLab also provides data processing services to NASA funded principal investigators (PIs). Utilizing GeneLab data processing services allows eligible PIs to take advantage of GeneLab’s bioinformatics expertise and to harmonize their data by using GeneLab’s standard pipelines when applicable. Interested PIs can request GeneLab’s data processing service by filling out and submitting the GeneLab sequencing and data processing quote request form.
To date, GeneLab has baselined pipelines for the following omics data types (click on a data type link to learn more):