ICg-Bench
ICg-Bench is the package’s public causal benchmark: a versioned set of synthetic data-generating processes (DGPs) with full ground truth plus four scored tasks. The DGPs are synthetic on purpose — that is the only setting in biology where causal estimands can be evaluated exactly. Real-data calibration is handled separately through the data sources and calibration layers and is out of scope for benchmark scoring.
Source: src/icg_cast/benchmark/.
DGP variants
Registered in src/icg_cast/benchmark/dgp.py
and listed via icg-cast bench list. Each variant has a stable SHA-256 hash
derived from its dataclass fields, recorded on every leaderboard row for
reproducibility.
Variant |
Description |
|---|---|
|
Linear KCC→state coupling, low host heterogeneity, discrete archetypes, full multi-omics observability. The easy baseline. |
|
Non-linear coupling, continuous KCC mixtures, high host heterogeneity. Stresses latent recovery. |
|
Non-linear coupling with random per-subject masking (30%) of transcriptomic and epigenomic modules. Stresses robustness to missing modalities. |
|
Linear KCC→state coupling with a non-linear, multiplicatively-interacting observation operator. Stresses stage-1 latent recovery while keeping stage-2 simple. |
|
|
|
Two simultaneous sign flips. Stress test for whether unconstrained recovery survives multiple prior errors. |
Four scored tasks
Implemented in src/icg_cast/benchmark/tasks.py.
Tasks are deliberately model-agnostic: task_risk_prediction only requires
predict_proba, while task_latent_recovery and
task_intervention_conformity additionally require predict_bottleneck /
intervene (i.e. an MB-CNet-shaped model).
task_risk_prediction
Held-out discrimination plus calibration on a single variant. Returns
auroc, auprc, brier, mean_proba, event_rate.
task_latent_recovery
Per-state R² between the model’s predicted bottleneck and the true qAOP
state (which is known because the DGP wrote it). Returns one r2__<state>
entry per state plus r2_mean and n_states. This task is what makes the
benchmark causal rather than purely predictive: a model can hit a high
AUROC without recovering the latent state, and that asymmetry is exactly the
contribution ICg-Bench tries to make visible.
task_intervention_conformity
For each do_* intervention in the registry, the model is forced through
the intervention via intervene(unit, scale), and the mean change in
predicted risk is compared against the expected sign. The CLI’s bench-run
reports three sub-metrics so prior-fragility is exposed:
prior_conformity— fraction matching the structural-prior direction.dgp_conformity— fraction matching the true DGP direction (this differs from the prior in themisspecified_signs*variants).responsive_dgp_conformity— fraction matching the DGP direction and moving by at least|Δ risk| ≥ responsive_threshold(default0.005). This closes the loophole where a sign constraint can drive a coefficient to zero, registering a technically-correct sign without any intervention response.
task_cross_host_generalization
Source vs. target AUROC under a host-susceptibility distribution shift.
Returns auroc_source, auroc_target, transfer_gap = auroc_source - auroc_target. Re-fitting is forbidden by the task contract.
Scoring and the leaderboard
Aggregation is implemented in src/icg_cast/benchmark/scoring.py and src/icg_cast/benchmark/leaderboard.py.
Each BenchmarkResult is a single (variant, model, package_version) row.
score_summary emits five headline numbers plus a composite that is an
arithmetic mean of (auroc, r2_mean, conformity_score) whenever those are
finite. The composite is explicitly not the canonical metric and is
provided only for ranking convenience.
Leaderboard files are append-only CSV plus a full-history JSON
(leaderboard.csv / leaderboard.json). Every entry carries:
schema_version(currently"0.1"),submitted_at(UTC ISO timestamp),variant_name+variant_hash(first 12 chars of the SHA-256 of the variant’s dataclass fields),model_name,package_version,per-task summary scores,
free-form
notes.
This means a leaderboard entry can be re-run from its CSV row alone.
The leaderboard reader and writers enforce schema_version == "0.1" at
runtime. Future schema changes should add explicit migrations in
migrate_leaderboard_entries; unsupported versions fail closed instead of
being silently reinterpreted.
CLI
icg-cast bench list # registered variants and hashes
icg-cast bench info misspecified_signs # inspect one variant, prints flipped signs
icg-cast bench run --cohort linear_lowhet --variant v0_1 --seed 7 # one (cohort, variant, seed) experiment
icg-cast bench audit --cohort linear_lowhet --variant sign_constrained --seed 7
icg-cast bench sweep --outdir outputs/bottleneck_v0_5 # full 5×3×3 sweep
icg-cast bench plots --inputdir outputs/bottleneck_v0_5 --outdir outputs/figures
bench run includes a wall_clock_seconds block in its JSON output. The
canonical sweep writes the same timing fields (generate_seconds,
fit_seconds, total_seconds, etc.) to per_seed.csv and summarizes total
runtime in summary.csv.
The standalone sweep and plotting scripts accept the same --outdir /
--inputdir paths. Their defaults still point at outputs/..., but directory
creation now checks write access and benchmark/figure outputs fall back to a
temporary directory when the default location is unavailable.
The current canonical sweep lives at outputs/bottleneck_v0_5/ and the figures it drives at outputs/figures/. They are reproducible via the two CLI commands above.
Example script
examples/run_icg_bench.py runs a tiny end-to-end benchmark on three variants and writes a small leaderboard.
python examples/run_icg_bench.py
It uses n = 200 subjects and months = 24 so it finishes in well under a
minute on a laptop and never touches real data. For full reproducibility of
the manuscript-grade numbers run icg-cast bench sweep instead.
Submission
To add a result to the leaderboard, run the benchmark on one or more variants
and write the resulting LeaderboardEntry objects via
benchmark.leaderboard.append_entry(entry, outdir). Entries are intended to
be reviewed via PR so the model name, package version, variant hash, and
notes can all be inspected before merge.
The benchmark is synthetic by construction; entries should not be interpreted as carcinogenicity classifications or as predictions about real human cohorts. See docs/ethics_and_limitations.md.