Australian Human Omics Data Discovery Portal

A searchable catalogue for finding human omics datasets with Australian relevance.

The portal indexes public metadata from existing repositories, starting with EGA and dbGaP, and presents those records through a common search and filtering portal based on Overture.bio. In the future, we'll also add in other repos as well as publications that describe controlled access to cohorts that sits outside of EGA and dbGaP.

It helps researchers find datasets, compare records across repositories, understand access constraints, and decide whether a dataset is worth pursuing or whether it's got too many hoops to jump through or doesn't have what you want.

There is a strong focus on being able to understand which countries control and participate in these datasets.

Why this project exists

Human omics datasets are spread across archives, repository platforms, disease-specific data commons, cohort portals, and institutional pages. A researcher may know that a relevant dataset may exist, but still struggle to find it or understand the complexity of access.

In an Australian context, we have three research questions of interest:

Where do Australian controlled-access human genomics datasets end up?
How many datasets are shared through repositories and how many through other means (found via publication)?
How many genomic datasets are produced but not shared, or are shared but not looked after and become inaccessible?

Who this is for

The first audience is researchers looking for controlled-access human genomic or multi-omics datasets that may be relevant to Australian research questions.

The portal is also useful for BioCommons, institutional data stewards, research infrastructure teams, and policy groups that need to understand where human omics datasets are recorded, how they are governed, and whether Australian samples, institutions, or access committees are involved.

Current methodology

The companion omics_portal Python package harvests and normalises metadata for the catalogue.

The current implementation is EGA-first. It mirrors public EGA Metadata API records into a local JSONL store and preserves the raw payload, source endpoint, accession, retrieval timestamp, HTTP status, and payload hash. Derived catalogue rows are generated from those raw records rather than replacing them.

The first indexed EGA record types are:

studies
datasets
DACs
policies
study-to-dataset links
dataset-to-policy links
policy-to-DAC links

Catalogue generation then adds conservative derived fields for discovery. These include identifiers, titles, summaries, linked studies, omics and assay terms, phenotype and disease candidates, country-role annotations, access conditions, DAC information, and source provenance.

The enrichment approach is deliberately conservative:

parse structured fields directly where the repository provides them
preserve the original source value before adding a normalised value
map disease and phenotype terms through a local ontology-backed gazetteer where possible
extract assay, platform, and country-role candidates from controlled rules
attach evidence, method, confidence, and review flags
keep uncertain or conflicting records available for manual review

The aim is not to claim perfect curation. The aim is to expose useful discovery fields while making the evidence trail visible.

Portal software

The portal is being prototyped with Overture components where they fit the discovery use case.

Software	Role in this project
Maestro	Indexes catalogue metadata into Elasticsearch.
Arranger	Provides the search API and faceted search model.
Stage	Provides the React-based portal scaffold and user interface.
omics_portal	Harvests, stores, normalises, and exports repository metadata for ingestion.

This platform has been used by a number of other portals, including: