Simulation
The synthetic cohort generator lives in src/icg_cast/simulator.py and is
configured through SimConfig.
Data-Generating Process
For each synthetic cohort, ICg-CaST optionally draws one coefficient-prior
realization when SimConfig.coefficient_mode="prior_sample". Then for each
synthetic sample it:
Samples a chemical archetype from the default or user-provided archetype prior.
Draws a noisy ten-dimensional KCC vector around the archetype profile.
Samples a bounded log-normal dose.
Samples host susceptibility factors.
Simulates monthly qAOP-like state dynamics for
monthstime steps.Summarizes each state by final value and time-normalized area under the curve.
Generates transcriptomic, epigenomic, and mutational-signature features from the states, KCCs, and archetype.
Samples the synthetic future event label from the final latent risk.
The cohort table includes generated features, endpoint columns, the
coefficient_seed metadata column, and a high_risk_transition_state label
derived from the latent-risk distribution. Feature-set builders exclude
endpoint-derived columns before training.
Reproducibility
Every cohort is controlled by SimConfig.seed. Coefficient-prior draws are
controlled by SimConfig.coefficient_seed, which defaults to seed when
omitted. Reusing the same configuration and package version should produce the
same cohort table.
SimConfig.validate() rejects invalid public inputs before simulation starts:
non-positive n / months, non-finite dose parameters, negative dose spread,
bad coefficient modes, invalid simulator backends, and malformed
archetype_prior weights. The lower-level simulate_state_trajectory() also
validates KCC vectors, dose, month count, and host-susceptibility values against
the active registry bounds.
Example:
from icg_cast import SimConfig, simulate_cohort
cohort, trajectories = simulate_cohort(SimConfig(n=120, months=72, seed=7))
uncertain, _ = simulate_cohort(
SimConfig(
n=120,
months=72,
seed=7,
coefficient_mode="prior_sample",
coefficient_seed=42,
)
)
For large sensitivity sweeps, SimConfig.simulator_backend="vectorized" runs
the monthly qAOP recurrence across the cohort with NumPy arrays. The default
"python" backend is retained for backward-compatible random streams and
step-by-step inspection.
fast_cohort, _ = simulate_cohort(
SimConfig(n=10_000, months=72, seed=7, simulator_backend="vectorized")
)
Main Outputs
icg-cast simulate writes:
synthetic_icg_cohort.csvsimulation_metadata.jsonexample_state_trajectories.pngunless plots are disabled
Important cohort field groups are documented in
materials/data_dictionary.csv.
CLI output directories are checked for write access before any files are
written. Demo and benchmark helper commands fall back to a temporary output
directory with a warning if the default outputs/... location is unavailable.
Assumptions
The recurrence equations, weights, archetype KCC vectors, and intervention expectations are simulation assumptions. They are designed to be inspectable and configurable, not to encode validated biological truth.