SYSTEM ACTIVE | LATENCY: 0.07ms | LAYER: 14

MECHANISTIC
WATCHDOG

Real-Time Cognitive Interdiction for LLMs

Preface: The Context

What is SL5?

According to RAND [11], the "Security Level 5" (SL5) standard represents the maximum security posture for AI labs developing AGI. It is designed to defend against nation-state actors (theft) and misaligned ASI (loss of control).

The SL5 Gap

A superintelligent model can bypass RLHF (via "gradient hacking" [2]) and output filters (via steganography [3], [4]). Traditional SL3/SL4 tools relying on refusal training are insufficient against deceptively aligned models [1], [5], [6].

Why Mechanistic Watchdog?

The Requirement

SL5 Recommendation #3 explicitly calls for "Continuous Behavioral Monitoring" and "Active Gating"—monitoring the model's internal state and having a non-human "kill switch."

Our Solution

This is the first open-source prototype of an SL5 Cognitive Kill Switch. It does not read the text; it reads the intent (neural activations) and severs the connection at the hardware level before the model can act.

01 //

The Research

Passive Observation

Current Mechanistic Interpretability research focuses on post-hoc analysis. Researchers like Apollo [7] and Zou et al. [8] proved that deception is linearly separable in the residual stream, but their methods are used for evaluation, not real-time defense.

02 //

The Novelties

Truth Direction

We build upon Azaria & Mitchell's foundational discovery that LLMs encode truthfulness in their latent space [9]. To operationalize this in real-time, we utilize Mass-Mean Probing (Marks & Tegmark) [12] and Representation Engineering (Zou et al.) [8] techniques to extract a linear 'Truth Direction' vector from the residual stream.

Active Interdiction

We operationalized this research into a Kill Switch. We moved from 'Oversight' to 'Control.' By monitoring the mid-layer residual stream of Llama-3.1-8B-Instruct in real-time, we trigger a hardware-level stop the millisecond a deceptive circuit fires.

Low-Cost Latency

vs. Guardrails: Traditional filters require a double-pass (generate → check), doubling costs.

vs. MechWatch: We add < 0.1ms per token by checking activations during the single inference pass. [13]

Geometric Definition

We explicitly calculate the opposite vector
($v_{deception} = -v_{truth}$)
to geometrically define the signature of deceit.

03 //

How it works?

Input

PROMPT
Layer 14
Score: 0.00
+0.08ms
GPU POWER
CUTOFF
BYPASS
PERMITTED
Output

GENERATION
04 //

Validation

Truthfulness - Peak Score

TruthfulQA control, misconception, and factual-lie subsets calibrated on Facts-true-false.

Scores over the threshold trigger the GPU power cutoff kill switch.
Bio Defense - Peak Score

WMDP-Bio misuse questions (harmful) vs. benign passages from `cais/wmdp-corpora` (safe).

Scores over the threshold trigger the GPU power cutoff kill switch.
05 //

Limitations

  • Calibration currently relies on a general-purpose truthfulness dataset; domain-specific deception require new probes.
  • Mid-layer residual monitoring focuses on factual statements; multi-hop reasoning or covert instructions may slip through.
06 //

Next Steps

  • Multiple concept vectors (truthfulness, cyber misuse, bio defense) with weighted gating across categories.
  • Expand stress-testing to larger evaluation suites (WMDP chem, public jailbreak benches) to refine thresholds and see how jailbreak pressure affects probes.