(Disclaimer: I work on interpretability at Anthropic.) I wanted to flag that thi...

(Disclaimer: I work on interpretability at Anthropic.)

I wanted to flag that this is an accessible blog post and that there's a link to the paper ( https://transformer-circuits.pub/2025/introspection/index.ht... ) at the top. The paper explores this in more detail and rigor.