# Mechanism-Bottleneck Causal Networks (MB-CNet)

MB-CNet is ICg-CaST's primary methodological contribution. It is a two-stage
model whose risk predictions are forced to flow through a hidden layer pinned
to the qAOP latent state vector. The bottleneck pin is what converts the
package's coherence story from a **post-hoc evaluation** (predict, then ask
whether the explanation matches the biology) into a **by-construction
constraint** (any risk prediction must factor through bottleneck units, so
do-operations on those units are well-defined interventions on the model
itself, not on the input features).

Source: [src/icg_cast/bottleneck.py](../src/icg_cast/bottleneck.py).

## Architecture

```text
stage 1: g_phi : omics_features  ->  hat{qAOP_state}      (multi-output regressor)
stage 2: h_theta: hat{qAOP_state} ->  risk probability    (calibrated classifier)
```

The default bottleneck units are the nine `state_final_*` qAOP states
(`DEFAULT_BOTTLENECK_UNITS` in `bottleneck.py`):

```text
state_final_DNA_adducts
state_final_ROS
state_final_inflammation
state_final_epigenetic_age
state_final_proliferation
state_final_mutation_rate
state_final_clone_fraction
state_final_driver_count_proxy
state_final_immune_clearance
```

Stage 1 is a `MultiOutputRegressor` (default: `RandomForestRegressor`). Pass a
custom sklearn-compatible `stage1_estimator` to swap in a stronger latent-state
recoverer while preserving the bottleneck contract:

```python
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import HistGradientBoostingRegressor

model = MechanismBottleneckClassifier(
    stage1_estimator=MultiOutputRegressor(HistGradientBoostingRegressor()),
)
```

For the default stage-1 pipeline, NaN-valued omics features are imputed with
`SimpleImputer(strategy="mean", add_indicator=True)`. The added indicators make
modality dropout observable to the stage-1 regressor instead of silently
collapsing masked values onto the feature mean. After fitting, call
`model.missingness_report()` to audit per-feature missing fractions by modality.

Stage 2 has three available kinds:

- **`calibrated_logistic`** (v0.1 default) — unconstrained logistic regression
  with isotonic calibration.
- **`sign_constrained`** — `SignConstrainedLogisticRegression`; coefficients
  per bottleneck unit are bounded to the half-line implied by coefficient-card
  `effect_direction` metadata. `STRUCTURAL_SIGNS` remains as a derived
  compatibility export, not the source of truth.
- **`sign_constrained_augmented`** — the same sign constraints, but stage 2
  is additionally trained on intervention-implied synthetic samples drawn
  through `augment_with_interventions` using the simulator's
  `starter_kit_latent_risk` structural equation. This closes the loophole
  where the constraint can drive a coefficient to zero without ever
  responding to the intervention.

The module uses **scikit-learn only** — no `torch`, `jax`, or
`pytorch-geometric`. The differentiable, end-to-end, neural-ODE / UDE version
is explicitly deferred (PLAN.md §7.4).

## Do-operations on bottleneck units

```python
from icg_cast.bottleneck import MechanismBottleneckClassifier

model = MechanismBottleneckClassifier(stage2_kind="sign_constrained_augmented")
model.fit(X_train.join(S_train), y_train)

base = model.predict_proba(X_test)[:, 1]
model.intervene(unit="state_final_DNA_adducts", scale=0.5)   # do(DNA_adducts := 0.5 x)
after = model.predict_proba(X_test)[:, 1]
model.clear_interventions()
```

Because the intervention is applied to the *bottleneck row* rather than to
the input features, the resulting prediction is the model's analogue of a
structural do-operation on the qAOP state. Conformity to the simulator's
DGP-implied direction is what ICg-Bench's
[task_intervention_conformity](../src/icg_cast/benchmark/tasks.py) measures.

## Survival outcomes

[src/icg_cast/survival.py](../src/icg_cast/survival.py) supplies the
time-to-threshold variant of the outcome:

- `time_to_event(trajectory, column="latent_risk", threshold=0.5)` — returns
  `(time_index, event_observed)`, right-censored at the trajectory horizon.
- `add_survival_columns(cohort, trajectories, threshold, horizon)` — appends
  `time_to_high_risk_threshold` and `event_observed` to a cohort.
- `restricted_mean_survival(times, events, horizon)` — RMST via the
  step-function Kaplan-Meier integral. No `lifelines` dependency.
- `counterfactual_rmst_difference(model, cohort, intervention, horizon)` —
  bootstrap-CI estimate of `RMST(after) - RMST(before)` under a callable
  intervention on the cohort.

The binary `future_cancer_transition_event` column from
`simulate_cohort` is preserved unchanged; the survival columns are purely
additive.

## Acceptance criteria

- `bottleneck_recovery.csv` reports per-state recovery R² — see
  [outputs/bottleneck_v0_5/per_state_recovery_r2.csv](../outputs/bottleneck_v0_5/per_state_recovery_r2.csv).
- Mean recovery R² ≥ 0.60 across the 10 qAOP states on the default cohort.
- Intervention-conformity score ≥ 0.85 across the seven `do_*` interventions
  defined in `cli.py` (`_INTERVENTIONS`).
- AUROC within 0.03 of the best unconstrained multi-omics baseline.
- `bottleneck.py` does not import torch, jax, or any heavy ML dependency.
- Survival outcomes reproduce the binary event under threshold equivalence;
  RMST is finite for every archetype on the default cohort.

When a result falls below an acceptance threshold it is reported as
`failed_directionality_test` (or the corresponding flag), not as a software
failure — see PLAN.md §7.3.

## CLI usage

```bash
# train one cohort/variant/seed
icg-cast bench run --cohort linear_lowhet --variant sign_constrained_augmented --seed 7

# audit which structural signs actually bind (relax one at a time)
icg-cast bench audit --cohort linear_lowhet --variant sign_constrained --seed 7

# full sweep + figures (delegated to scripts/bottleneck_proof_of_concept.py)
icg-cast bench sweep
icg-cast bench plots
```

See also [docs/benchmark.md](benchmark.md) for the ICg-Bench scoring tasks
that MB-CNet is the reference participant for.