I suppose none of these models can output bounding box coordinates for extracted...

michaelt · 2025-04-01T22:47:27 1743547647

qwen2.5-vl-72b-instruct seems perfectly happy outputting bounding boxes in my testing.

There's also a paper https://arxiv.org/pdf/2409.12191 where they explicitly say some of their training included bounding boxes and coordinates.

themanmaran · 2025-04-02T00:41:16 1743554476

We're also looking to test qwen and other for the bounding box support. Simon Willison had a great demo page where he used Gemini 2.5 to draw bounding boxes, and the results were pretty impressive. It would probably be pretty easy to drop qwen into the same UI.

https://simonwillison.net/2025/Mar/25/gemini

chpatrick · 2025-04-02T00:27:34 1743553654

Actually qwen 2.5 is trained to provide bounding boxes

deepsquirrelnet · 2025-04-02T02:11:10 1743559870

Yep, this is true. I was poking around on their github and they have examples in their “cookbooks” section. Eg:

https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/ocr...

kapitalx · 2025-04-01T21:57:17 1743544637

If you're limited to open source models, that's very true. But for larger models and depending on your document needs, we're definitely seeing very high accuracy (95%-99%) for direct to json extraction (no markdown in between step) with our solution at https://doctly.ai.

kapitalx · 2025-04-01T21:57:56 1743544676

In addition, gemini Pro 2.5 does really well with bounding boxes, but yeah not open source :(

_ea1k · 2025-04-01T21:56:40 1743544600

I'd guess that it wouldn't be a huge effort to fine tune them to produce bounding boxes.

I haven't done it with OCR tasks, but I have fine tuned other models to produce them instead of merely producing descriptive text. I'm not sure if there are datasets for this already, but creating one shouldn't be very difficult.