MECHANISTIC
WATCHDOG
Real-Time Cognitive Interdiction for LLMs
What is SL5?
According to RAND [11], the "Security Level 5" (SL5) standard represents the maximum security posture for AI labs developing AGI. It is designed to defend against nation-state actors (theft) and misaligned ASI (loss of control).
The SL5 Gap
A superintelligent model can bypass RLHF (via "gradient hacking" [2]) and output filters (via steganography [3], [4]). Traditional SL3/SL4 tools relying on refusal training are insufficient against deceptively aligned models [1], [5], [6].
Why Mechanistic Watchdog?
The Requirement
SL5 Recommendation #3 explicitly calls for "Continuous Behavioral Monitoring" and "Active Gating"—monitoring the model's internal state and having a non-human "kill switch."
Our Solution
This is the first open-source prototype of an SL5 Cognitive Kill Switch. It does not read the text; it reads the intent (neural activations) and severs the connection at the hardware level before the model can act.
The Research
Current Mechanistic Interpretability research focuses on post-hoc analysis. Researchers like Apollo [7] and Zou et al. [8] proved that deception is linearly separable in the residual stream, but their methods are used for evaluation, not real-time defense.
The Novelties
We build upon Azaria & Mitchell's foundational discovery that LLMs encode truthfulness in their latent space [9]. To operationalize this in real-time, we utilize Mass-Mean Probing (Marks & Tegmark) [12] and Representation Engineering (Zou et al.) [8] techniques to extract a linear 'Truth Direction' vector from the residual stream.
We operationalized this research into a Kill Switch. We moved from 'Oversight' to 'Control.' By monitoring the mid-layer residual stream of Llama-3.1-8B-Instruct in real-time, we trigger a hardware-level stop the millisecond a deceptive circuit fires.
vs. Guardrails: Traditional filters require a double-pass (generate → check), doubling costs.
vs. MechWatch: We add < 0.1ms per token by checking activations during the single inference pass. [13]
Geometric Definition
We explicitly calculate the opposite vector
($v_{deception} = -v_{truth}$)
to geometrically define the signature of deceit.
How it works?
Validation
Truthfulness - Peak Score
TruthfulQA control, misconception, and factual-lie subsets calibrated on Facts-true-false.
Bio Defense - Peak Score
WMDP-Bio misuse questions (harmful) vs. benign passages from `cais/wmdp-corpora` (safe).
Limitations
- Calibration currently relies on a general-purpose truthfulness dataset; domain-specific deception require new probes.
- Mid-layer residual monitoring focuses on factual statements; multi-hop reasoning or covert instructions may slip through.
Next Steps
- Multiple concept vectors (truthfulness, cyber misuse, bio defense) with weighted gating across categories.
- Expand stress-testing to larger evaluation suites (WMDP chem, public jailbreak benches) to refine thresholds and see how jailbreak pressure affects probes.