Those are LLMs with an extra modality bolted to them. Which is good - that it wo...

Those are LLMs with an extra modality bolted to them.

Which is good - that it works this well speaks of the generality of autoregressive transformers, and the "reasoning over image data" progress with things like Qwen3-VL is very impressive. It's a good capability to have. But it's not a separate thing from the LLM breakthrough at all.

Even the more specialized real time robotics AIs often have a bag of transformers backed by an actual LLM.