Data Sources
ICg-CaST defaults to synthetic data. Optional data-source adapters are local-file loaders for future calibration or validation workflows. They do not download files, call remote APIs, or handle controlled-access data.
For the calibration prototype that consumes these adapters to override pieces of the synthetic simulator and theory graph, see docs/calibration.md.
Each adapter returns a DataSourceBundle with:
data: loadedpandas.DataFrame.provenance: source name, version, retrieval date, local file, license notes, citation notes, and SHA-256 digest.metadata: adapter-specific row counts and validation details.
Supported Stub Adapters
Source |
Adapter |
Expected input |
Package behavior |
|---|---|---|---|
AOP-Wiki |
|
CSV/TSV/JSON edge or node export |
Loads local file; can map |
EPA AOP-DB |
|
CSV/TSV/JSON/SQLite-derived table |
Loads local table and records provenance. |
EPA ToxCast/CompTox |
|
Local summary table plus optional mapping table |
Keeps assay summaries separate from KCC mapping metadata. |
COSMIC Mutational Signatures |
|
Local 96-channel SBS matrix |
Validates 96 contexts and non-negative signature columns. |
SigProfilerExtractor |
|
Local activity table |
Loads exported activity matrix; no optional dependency import. |
LINCS L1000 |
|
Local signature table plus optional metadata |
Records perturbagen metadata row counts when provided. |
CTD |
|
Local CTD export table |
Loads local curated relationship table. |
NCI GDC/TCGA/CPTAC |
|
Local manifest or open metadata table |
Warns against committing controlled-access data. |
Provenance
Use materials/provenance_template.json as the minimum payload shape for any
user-supplied source. Runtime validation in icg_cast.data_sources enforces
the same required fields, a local-file path, and a lowercase SHA-256 digest.
The machine-readable schema is
materials/calibration_provenance.schema.json.
icg-cast calibrate writes calibration_provenance.json with
schema_version: "0.1", one provenance record per adapter, and optional
coefficient_updates when --apply-coefficients is used. The maintainer must
fill in source version, retrieval date, license terms, and citation before
using real data in analysis.
Access and License Notes
Public availability does not mean unrestricted reuse. COSMIC, CTD, LINCS, EPA, GDC, and other resources may have distinct citation, license, and access terms. Controlled-access human genomic data must not be committed to this repository.