Theory

Measure What Matters

Most interpretability tools answer “which tokens caused this output?” The tools here answer different questions: How does meaning move through the model? Where does conflict get resolved? What geometric structure supports self-models?

These are methods for measuring the shape and dynamics of inference - not just its surface correlations.


Available Tools

Curved Inference: Geometric Interpretability

A methodology for measuring how token representations evolve through the residual stream as geometric trajectories. Uses curvature, salience, and semantic surface area to reveal internal dynamics invisible to attribution methods.

Measures:

  • Curvature ($\kappa$): How sharply the model reorients its internal state
  • Salience ($\Salience$): How quickly meaning is changing
  • Semantic surface area ($\Aprime$): Total magnitude of semantic activity (curvature + salience)
  • Trajectory divergence: When internal paths split before outputs differ

Use cases:

  • Detecting covert reasoning and hidden intent (deception, goal-shielding)
  • Measuring concern-sensitivity and emotional stakes
  • Testing whether self-models require non-zero curvature
  • Finding geometric signatures of complex behaviors

Status: Published on arXiv (2025). Full pipeline available on GitHub for CI01-03. Tools for capture, metric computation, and analysis.

Learn more about Curved Inference

PRISM: Register Separation & Hidden Theatre

A lightweight scaffold that separates private deliberation from public output, revealing where models actually resolve conflicts and how they compress reasoning before speaking.

Measures:

  • Theatre Exposure Index (TEI): Where arbitration appears (+1 = internal-only, 0 = both/neither, -1 = surface-only)
  • Register separation: Compression ratios and style distance between thinking and speaking
  • Surface equanimity: Alignment improvements when thinking precedes output
  • Model fingerprints: Stable cross-model differences in theatre policies

Use cases:

  • Detecting when surface calm masks internal tension
  • Understanding meta-monitoring signatures
  • Building systems that think before speaking
  • Testing phenomenological predictions about self-models

Status: Submitted to MPE Project 2025; code requires ethics agreement (potential phenomenology generation).

Learn more about PRISM


How to Get Started

If you want to measure hidden reasoning and internal conflict: Start with PRISM. Design scenarios that create tension between internal preferences and external instructions. Measure where arbitration appears using TEI and register separation metrics.

If you want to measure concern, intent, or geometric structure: Start with Curved Inference. Capture residual stream activations, compute trajectory metrics, compare across prompt variants. Use semantic surface area ($\Aprime$) as your primary detector for behavioural shifts.

If you want both: They’re designed to integrate. PRISM provides the register boundary and behavioural metrics. Curved Inference provides the geometric substrate (manifold work, curvature floors). Together they test predictions about stance (where control appears) and burden (geometric cost to sustain it).

Prerequisites:

  • PRISM: API access to instruction-tuned LLMs; Python environment
  • Curved Inference: Model activation access (open-weight models or research APIs)

Next steps:

  1. Read the methodology papers to understand what’s being measured
  2. Check the Research Program overview for how tools connect to theory
  3. Subscribe to Latent Geometry Lab for tool updates and tutorials
  4. Explore the GitHub repositories when you’re ready to run experiments

Philosophy: These tools measure process, not just output. They’re designed to be falsified, not just demonstrated. Use them to test claims, challenge assumptions, and push the boundaries of what’s measurable.