ICg-Bench

ICg-Bench is the package’s public causal benchmark: a versioned set of synthetic data-generating processes (DGPs) with full ground truth plus four scored tasks. The DGPs are synthetic on purpose — that is the only setting in biology where causal estimands can be evaluated exactly. Real-data calibration is handled separately through the data sources and calibration layers and is out of scope for benchmark scoring.

Source: src/icg_cast/benchmark/.

DGP variants

Registered in src/icg_cast/benchmark/dgp.py and listed via icg-cast bench list. Each variant has a stable SHA-256 hash derived from its dataclass fields, recorded on every leaderboard row for reproducibility.

Variant

Description

linear_lowhet

Linear KCC→state coupling, low host heterogeneity, discrete archetypes, full multi-omics observability. The easy baseline.

nonlinear_mixhost

Non-linear coupling, continuous KCC mixtures, high host heterogeneity. Stresses latent recovery.

partial_observability

Non-linear coupling with random per-subject masking (30%) of transcriptomic and epigenomic modules. Stresses robustness to missing modalities.

nonlinear_obs

Linear KCC→state coupling with a non-linear, multiplicatively-interacting observation operator. Stresses stage-1 latent recovery while keeping stage-2 simple.

misspecified_signs

linear_lowhet base with one flipped sign in the latent_risk DGP relative to the structural prior. Falsification cohort for sign-constrained and intervention-augmented MB-CNet variants.

misspecified_signs_v2

Two simultaneous sign flips. Stress test for whether unconstrained recovery survives multiple prior errors.

Four scored tasks

Implemented in src/icg_cast/benchmark/tasks.py. Tasks are deliberately model-agnostic: task_risk_prediction only requires predict_proba, while task_latent_recovery and task_intervention_conformity additionally require predict_bottleneck / intervene (i.e. an MB-CNet-shaped model).

task_risk_prediction

Held-out discrimination plus calibration on a single variant. Returns auroc, auprc, brier, mean_proba, event_rate.

task_latent_recovery

Per-state R² between the model’s predicted bottleneck and the true qAOP state (which is known because the DGP wrote it). Returns one r2__<state> entry per state plus r2_mean and n_states. This task is what makes the benchmark causal rather than purely predictive: a model can hit a high AUROC without recovering the latent state, and that asymmetry is exactly the contribution ICg-Bench tries to make visible.

task_intervention_conformity

For each do_* intervention in the registry, the model is forced through the intervention via intervene(unit, scale), and the mean change in predicted risk is compared against the expected sign. The CLI’s bench-run reports three sub-metrics so prior-fragility is exposed:

  • prior_conformity — fraction matching the structural-prior direction.

  • dgp_conformity — fraction matching the true DGP direction (this differs from the prior in the misspecified_signs* variants).

  • responsive_dgp_conformity — fraction matching the DGP direction and moving by at least risk| responsive_threshold (default 0.005). This closes the loophole where a sign constraint can drive a coefficient to zero, registering a technically-correct sign without any intervention response.

task_cross_host_generalization

Source vs. target AUROC under a host-susceptibility distribution shift. Returns auroc_source, auroc_target, transfer_gap = auroc_source - auroc_target. Re-fitting is forbidden by the task contract.

Scoring and the leaderboard

Aggregation is implemented in src/icg_cast/benchmark/scoring.py and src/icg_cast/benchmark/leaderboard.py.

Each BenchmarkResult is a single (variant, model, package_version) row. score_summary emits five headline numbers plus a composite that is an arithmetic mean of (auroc, r2_mean, conformity_score) whenever those are finite. The composite is explicitly not the canonical metric and is provided only for ranking convenience.

Leaderboard files are append-only CSV plus a full-history JSON (leaderboard.csv / leaderboard.json). Every entry carries:

  • schema_version (currently "0.1"),

  • submitted_at (UTC ISO timestamp),

  • variant_name + variant_hash (first 12 chars of the SHA-256 of the variant’s dataclass fields),

  • model_name, package_version,

  • per-task summary scores,

  • free-form notes.

This means a leaderboard entry can be re-run from its CSV row alone.

The leaderboard reader and writers enforce schema_version == "0.1" at runtime. Future schema changes should add explicit migrations in migrate_leaderboard_entries; unsupported versions fail closed instead of being silently reinterpreted.

CLI

icg-cast bench list                                                # registered variants and hashes
icg-cast bench info misspecified_signs                             # inspect one variant, prints flipped signs
icg-cast bench run --cohort linear_lowhet --variant v0_1 --seed 7  # one (cohort, variant, seed) experiment
icg-cast bench audit --cohort linear_lowhet --variant sign_constrained --seed 7
icg-cast bench sweep --outdir outputs/bottleneck_v0_5              # full 5×3×3 sweep
icg-cast bench plots --inputdir outputs/bottleneck_v0_5 --outdir outputs/figures

bench run includes a wall_clock_seconds block in its JSON output. The canonical sweep writes the same timing fields (generate_seconds, fit_seconds, total_seconds, etc.) to per_seed.csv and summarizes total runtime in summary.csv.

The standalone sweep and plotting scripts accept the same --outdir / --inputdir paths. Their defaults still point at outputs/..., but directory creation now checks write access and benchmark/figure outputs fall back to a temporary directory when the default location is unavailable.

The current canonical sweep lives at outputs/bottleneck_v0_5/ and the figures it drives at outputs/figures/. They are reproducible via the two CLI commands above.

Example script

examples/run_icg_bench.py runs a tiny end-to-end benchmark on three variants and writes a small leaderboard.

python examples/run_icg_bench.py

It uses n = 200 subjects and months = 24 so it finishes in well under a minute on a laptop and never touches real data. For full reproducibility of the manuscript-grade numbers run icg-cast bench sweep instead.

Submission

To add a result to the leaderboard, run the benchmark on one or more variants and write the resulting LeaderboardEntry objects via benchmark.leaderboard.append_entry(entry, outdir). Entries are intended to be reviewed via PR so the model name, package version, variant hash, and notes can all be inspected before merge.

The benchmark is synthetic by construction; entries should not be interpreted as carcinogenicity classifications or as predictions about real human cohorts. See docs/ethics_and_limitations.md.