the token generation part isn't well understood, but the output "chain-of-thought" used to produce the final answer can be scrutinized for correctness with a traditional CoT model (although this would require model providers to not hide reasoning tokens)