Detecting deception in language models
Probe-based analysis of deceptive behavior and internal model signals.
Using probes and evals to study deceptive behavior in language models.
Probe-based analysis of deceptive behavior and internal model signals.
Replicating Apollo Research's linear-probe pipeline on Qwen 3.5-4B. The original paper already flagged insider trading as layer-sensitive; on Qwen the cross-domain transfer collapses belo...
Initial results from extending a deception-detection pipeline to Qwen 3.5-4B, including rollout quality, grading noise, and early probe performance.