Florence-2 on DeepStream: One Vision Model for Detection, OCR & Captioning

Most vision systems do one thing each: one model detects objects, another reads text, another writes captions. Florence-2 — Microsoft's unified vision model — does all of them with a single model. I got it running live on video with NVIDIA DeepStream, with the ability to switch tasks on the fly while the stream keeps playing.

What one model can do

Task	What you get	FPS (T4, fp16)
Object detection	boxes + class names	15.6
OCR	text + boxes / plain text	7–9
Caption	one line → full paragraph	9–18
Dense region	boxes + short labels	18.6
Region proposal	boxes (no labels)	20.0
Grounding	a box for any phrase you type	19.3

The interesting engineering bit: Florence-2 is autoregressive (encoder–decoder), so it doesn't fit DeepStream's single-pass nvinfer like a normal detector. The pipeline gets split — Gst-nvinfer runs the vision encoder, and the text-generation loop runs right after it on TensorRT with a KV cache. And because the task is just a prompt token, you can swap detection → OCR → captioning mid-stream with no restart.