# Validation

ICg-CaST separates predictive performance from mechanistic coherence.
Synthetic AUROC is useful software evidence, but it is not biological
validation.

The helpers in [src/icg_cast/validation/](../src/icg_cast/validation/) group
the three families of checks below:

```python
from icg_cast.validation import (
    biological_coherence_score,
    calibration_curve,
    expected_calibration_error,
    human_relevance_transfer_index,
    pathway_attribution_consistency,
)
```

## Predictive Metrics

Baseline training and evaluation report:

- ROC AUC.
- Average precision.
- Brier score.
- Event rate.
- Mean predicted risk.
- Calibration bins and expected calibration error.

`validation.calibration` adds two leaner entry points:

- `expected_calibration_error(y, proba, n_bins=10)` returns the ECE scalar.
- `calibration_curve(y, proba, n_bins=10)` returns
  `(mean_predicted, observed_fraction, counts)` arrays for reliability plots.

## Mechanistic Checks

The package includes counterfactual directionality tests for mechanism-linked
feature perturbations. A model can score well predictively while failing a
directionality test. Such failures are reported as biological-coherence
diagnostics, not as software errors.

The biological-coherence score is:

```text
correct_direction_count / tested_intervention_count
```

`validation.biological_coherence` provides:

- `biological_coherence_score(counterfactual_table)` returns the scalar
  directly.
- `pathway_attribution_consistency(importance, pathway_map)` aggregates
  per-feature permutation importance into per-pathway shares, so feature
  weight can be inspected at the modality / pathway level.

For *by-construction* (rather than post-hoc) coherence, see
[docs/bottleneck.md](bottleneck.md) and the
`task_intervention_conformity` task in
[docs/benchmark.md](benchmark.md).

## Cross-Species Human Relevance

`validation.cross_species.human_relevance_transfer_index` implements the HRTI
estimate from PLAN.md §9.4:

```text
HRTI = conserved_human_KE_activation
      / (conserved_human_KE_activation + rodent_specific_KE_activation)
```

It takes an explicit table with `key_event`, `conservation`,
`human_activation`, and `rodent_activation` columns and returns an
`HRTIResult` with the score, contributing counts, and per-key-event reason
strings. It does **not** wrap a classifier and does not look up KE conservation
databases automatically — the caller supplies the conservation labels. This is
intentional: HRTI is a transparent ratio, not a regulatory conclusion.

## Simulator Sanity Checks

Internal consistency checks should focus on synthetic-world expectations:

- Inert controls should usually have lower risk than active archetypes.
- Genotoxic archetypes should elevate DNA-damage and mutational features.
- ROS archetypes should elevate oxidative and inflammatory features.
- Receptor-mediated archetypes should elevate proliferative modules.
- Immunosuppressive archetypes should reduce clearance-related features.

These are checks on the simulator assumptions, not real-world claims.

## External Validation

Real-data validation is future work and requires local files, provenance,
appropriate permissions, and domain review. Human genomic data can be
identifying. Controlled-access datasets require the proper approvals before any
analysis.