Most vision systems do one thing each: one model detects objects, another reads text, another writes captions. Florence-2 β Microsoft's unified vision model β does all of them with a single model. I got it running live on video with NVIDIA DeepStream, with the ability to switch tasks on the fly while the stream keeps playing.
What one model can do
| Task | What you get | FPS (T4, fp16) |
|---|---|---|
| Object detection | boxes + class names | 15.6 |
| OCR | text + boxes / plain text | 7β9 |
| Caption | one line β full paragraph | 9β18 |
| Dense region | boxes + short labels | 18.6 |
| Region proposal | boxes (no labels) | 20.0 |
| Grounding | a box for any phrase you type | 19.3 |
The interesting engineering bit: Florence-2 is autoregressive (encoderβdecoder), so it doesn't fit DeepStream's single-pass nvinfer like a normal detector. The pipeline gets split β Gst-nvinfer runs the vision encoder, and the text-generation loop runs right after it on TensorRT with a KV cache. And because the task is just a prompt token, you can swap detection β OCR β captioning mid-stream with no restart.
π Read the full write-up (how it works + quick start)
I've put the complete guide β the architecture, the quick start, live task-switching, RTSP/Kafka output and all the commands β on my blog:
β‘οΈ Florence-2 on DeepStream: the full guide
Code (MIT): github.com/Vishnu-RM-2001/Florence-2-deepstream











