A searchable catalogue for finding human omics datasets with Australian relevance.
The portal indexes public metadata from existing repositories, starting with EGA and dbGaP, and presents those records through a common search and filtering portal based on Overture.bio. In the future, we'll also add in other repos as well as publications that describe controlled access to cohorts that sits outside of EGA and dbGaP.
It helps researchers find datasets, compare records across repositories, understand access constraints, and decide whether a dataset is worth pursuing or whether it's got too many hoops to jump through or doesn't have what you want.
There is a strong focus on being able to understand which countries control and participate in these datasets.
Human omics datasets are spread across archives, repository platforms, disease-specific data commons, cohort portals, and institutional pages. A researcher may know that a relevant dataset may exist, but still struggle to find it or understand the complexity of access.
In an Australian context, we have three research questions of interest:
The first audience is researchers looking for controlled-access human genomic or multi-omics datasets that may be relevant to Australian research questions.
The portal is also useful for BioCommons, institutional data stewards, research infrastructure teams, and policy groups that need to understand where human omics datasets are recorded, how they are governed, and whether Australian samples, institutions, or access committees are involved.
The companion omics_portal Python package harvests and normalises metadata for the catalogue.
The current implementation is EGA-first. It mirrors public EGA Metadata API records into a local JSONL store and preserves the raw payload, source endpoint, accession, retrieval timestamp, HTTP status, and payload hash. Derived catalogue rows are generated from those raw records rather than replacing them.
The first indexed EGA record types are:
Catalogue generation then adds conservative derived fields for discovery. These include identifiers, titles, summaries, linked studies, omics and assay terms, phenotype and disease candidates, country-role annotations, access conditions, DAC information, and source provenance.
The enrichment approach is deliberately conservative:
The aim is not to claim perfect curation. The aim is to expose useful discovery fields while making the evidence trail visible.
The portal is being prototyped with Overture components where they fit the discovery use case.
| Software | Role in this project |
|---|---|
| Maestro | Indexes catalogue metadata into Elasticsearch. |
| Arranger | Provides the search API and faceted search model. |
| Stage | Provides the React-based portal scaffold and user interface. |
| omics_portal | Harvests, stores, normalises, and exports repository metadata for ingestion. |
This platform has been used by a number of other portals, including: