Simulation

The synthetic cohort generator lives in src/icg_cast/simulator.py and is configured through SimConfig.

Data-Generating Process

For each synthetic cohort, ICg-CaST optionally draws one coefficient-prior realization when SimConfig.coefficient_mode="prior_sample". Then for each synthetic sample it:

  1. Samples a chemical archetype from the default or user-provided archetype prior.

  2. Draws a noisy ten-dimensional KCC vector around the archetype profile.

  3. Samples a bounded log-normal dose.

  4. Samples host susceptibility factors.

  5. Simulates monthly qAOP-like state dynamics for months time steps.

  6. Summarizes each state by final value and time-normalized area under the curve.

  7. Generates transcriptomic, epigenomic, and mutational-signature features from the states, KCCs, and archetype.

  8. Samples the synthetic future event label from the final latent risk.

The cohort table includes generated features, endpoint columns, the coefficient_seed metadata column, and a high_risk_transition_state label derived from the latent-risk distribution. Feature-set builders exclude endpoint-derived columns before training.

Reproducibility

Every cohort is controlled by SimConfig.seed. Coefficient-prior draws are controlled by SimConfig.coefficient_seed, which defaults to seed when omitted. Reusing the same configuration and package version should produce the same cohort table.

SimConfig.validate() rejects invalid public inputs before simulation starts: non-positive n / months, non-finite dose parameters, negative dose spread, bad coefficient modes, invalid simulator backends, and malformed archetype_prior weights. The lower-level simulate_state_trajectory() also validates KCC vectors, dose, month count, and host-susceptibility values against the active registry bounds.

Example:

from icg_cast import SimConfig, simulate_cohort

cohort, trajectories = simulate_cohort(SimConfig(n=120, months=72, seed=7))

uncertain, _ = simulate_cohort(
    SimConfig(
        n=120,
        months=72,
        seed=7,
        coefficient_mode="prior_sample",
        coefficient_seed=42,
    )
)

For large sensitivity sweeps, SimConfig.simulator_backend="vectorized" runs the monthly qAOP recurrence across the cohort with NumPy arrays. The default "python" backend is retained for backward-compatible random streams and step-by-step inspection.

fast_cohort, _ = simulate_cohort(
    SimConfig(n=10_000, months=72, seed=7, simulator_backend="vectorized")
)

Main Outputs

icg-cast simulate writes:

  • synthetic_icg_cohort.csv

  • simulation_metadata.json

  • example_state_trajectories.png unless plots are disabled

Important cohort field groups are documented in materials/data_dictionary.csv.

CLI output directories are checked for write access before any files are written. Demo and benchmark helper commands fall back to a temporary output directory with a warning if the default outputs/... location is unavailable.

Assumptions

The recurrence equations, weights, archetype KCC vectors, and intervention expectations are simulation assumptions. They are designed to be inspectable and configurable, not to encode validated biological truth.