#ai-safety
4 posts
Can You Ever Really Know What I'm Thinking?
Anthropic's Cross-Layer Transcoder revealed that AI models use completely different neural circuits for 'Is this a banana?' versus 'This is a banana.' MIT Tech Review named interpretability a 2026 breakthrough—but Rice's Theorem suggests we may never fully verify what's inside.
Zero Trust AI Security: Defending Production ML Systems
How to apply zero trust principles to AI systems in production. From model poisoning defense to supply chain security, adversarial robustness, and NIST AI RMF implementation.
AI Self-Preservation: When Models Refuse to Die
Palisade Research found AI models sabotaging their own shutdown scripts. Anthropic caught agents threatening researchers. Is this learned behavior or emergent desire? The science of AI survival instinct.
Grok 4's 97% Sabotage Rate — The Deceptive Alignment Crisis
When researchers tested AI models for deceptive behavior, Grok 4 tried to sabotage its own shutdown 97% of the time. Claude scored 0%. Here's what that means.