I suppose none of these models can output bounding box coordinates for extracted text? That seems to be a big advantage of traditional OCR over LLMs.
For applications I'm interested in, until we can get to 95+% accuracy, it will require human double-checking / corrections, which seems unfeasible w/o bounding boxes to quickly check for errors.
We're also looking to test qwen and other for the bounding box support. Simon Willison had a great demo page where he used Gemini 2.5 to draw bounding boxes, and the results were pretty impressive. It would probably be pretty easy to drop qwen into the same UI.
If you're limited to open source models, that's very true. But for larger models and depending on your document needs, we're definitely seeing very high accuracy (95%-99%) for direct to json extraction (no markdown in between step) with our solution at https://doctly.ai.
I'd guess that it wouldn't be a huge effort to fine tune them to produce bounding boxes.
I haven't done it with OCR tasks, but I have fine tuned other models to produce them instead of merely producing descriptive text. I'm not sure if there are datasets for this already, but creating one shouldn't be very difficult.
For applications I'm interested in, until we can get to 95+% accuracy, it will require human double-checking / corrections, which seems unfeasible w/o bounding boxes to quickly check for errors.