Current focus

Using probes and evals to study deceptive behavior in language models.

Detecting deception in language models

Probe-based analysis of deceptive behavior and internal model signals.

Recent writing

Interpretability / Omission probe / Qwen 3.5

Two Kinds of Deception, Two Kinds of Signal

Replicating Apollo Research's linear-probe pipeline on Qwen 3.5-4B. The original paper already flagged insider trading as layer-sensitive; on Qwen the cross-domain transfer collapses belo...