Build an MCP server that exposes describe_image, ocr_document, and compare_images so any MCP host — Cursor, Claude Desktop, Hermes — can understand screenshots, receipts, and UI diffs through one protocol.
Powered by MiniCPM-V 4.6 via Ollama: 1.3B params · 1.6 GB · text + image · 256K context — the smallest vision model in the MiniCPM family, tuned for edge and phone deployment.
The moment: paste a screenshot into Cursor and ask “What changed in this UI?” The agent calls compare_images — no cloud vision API, no API key.
What you’ll learn
- Vision as MCP tools — why describe_image / ocr_document / compare_images beat one-off scripts
- One protocol, many hosts — same server in Cursor, Claude Desktop, and Hermes
- MiniCPM-V 4.6 on Ollama — pull, run, and wire the 1.6 GB multimodal model locally
- Private document OCR — extract receipt and whiteboard text without sending pixels to the cloud
- Before/after UI diffs — compare two screenshots for regression review
- Runnable Python MCP server + an end-to-end agent demo (works offline with OLLAMA_MOCK=1)
Introduction — agents without eyes
Your agent can grep code, run tests, and provision infra — but the moment someone pastes a screenshot , a receipt , or a Figma export , the loop breaks unless you bolt on a vision API. That means API keys, cloud latency, and pixels leaving your machine.
MiniCPM-V 4.6 is built for the opposite: 1.3B parameters , ~1.6 GB on disk, 256K context , and native text + image input via Ollama. Wrap it in MCP and every host discovers the same three vision tools at connect time.
Part 1 — MiniCPM-V 4.6 on Ollama
From the Ollama model page:
ollama pull minicpm-v4.6
ollama run minicpm-v4.6 "Describe this image" --image ./photo.jpg
This guide uses the same model through Ollama’s HTTP API so the MCP server can batch tool calls without spawning a CLI per request.
If you want a ready-to-use environment, TechLatest offers preconfigured Ollama and Open WebUI deployments that allow developers to run MiniCPM-V, Gemma, Qwen, Llama, and DeepSeek models locally without spending time on installation and configuration.
Link: https://techlatest.net/support/multi_llm_gpu_vm_support/
Link: https://techlatest.net/support/multi_llm_vm_support/
Part 2 — Why MCP for vision
You could write a Python script that calls Ollama and paste output into chat. But then Cursor, Claude Desktop, and Hermes each need their own glue.
MCP collapses that. You write one server ; hosts discover tools at capability exchange. Add a fourth tool later, and every host sees it on reconnect — no host-side changes.
If host/client/server isn’t second nature yet, read the MCP Visual Guide first.
Vision becomes even more powerful when connected to agent workflows. Developers can deploy TechLatest’s CrewAI Studio VM and create multi-agent systems where one agent performs OCR, another analyzes screenshots, and a third generates reports. The same MiniCPM-V MCP server can be shared across all agents through MCP, creating a reusable vision layer for complex automation workflows.
Link: https://techlatest.net/support/crewai-support/
Part 3 — Quick start
cd guides/minicpm-v-mcp-server
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
ollama pull minicpm-v4.6
python examples/generate_fixtures.py
python examples/agent_demo.py
generate_fixtures.py
#!/usr/bin/env python3
"""Generate sample images for MCP demos (no external assets required)."""
from __future__ import annotations
from pathlib import Path
from PIL import Image, ImageDraw, ImageFont
FIXTURES = Path( __file__ ).resolve().parent / "fixtures"
def _font(size: int):
for name in ("DejaVuSans.ttf", "/System/Library/Fonts/Supplemental/Arial.ttf"):
try:
return ImageFont.truetype(name, size)
except OSError:
continue
return ImageFont.load_default()
def receipt() -> None:
img = Image.new("RGB", (480, 640), "#fafafa")
d = ImageDraw.Draw(img)
f, fs = _font(22), _font(16)
d.text((40, 40), "COFFEE BEAN Co.", fill="#1a1a1a", font=f)
d.text((40, 90), "123 Main St · San Francisco", fill="#555", font=fs)
lines = [
"Latte (Oat) $5.50",
"Croissant $4.25",
"Tip $1.00",
"─────────────────────────",
"TOTAL $10.75",
"Card **** 4242",
"2026-06-23 09:14 AM",
]
y = 150
for line in lines:
d.text((40, y), line, fill="#222", font=fs)
y += 36
img.save(FIXTURES / "sample_receipt.png")
def diagram_v1() -> None:
img = Image.new("RGB", (640, 400), "#0f172a")
d = ImageDraw.Draw(img)
f = _font(18)
d.rounded_rectangle((40, 80, 200, 160), radius=12, fill="#1e3a5f", outline="#38bdf8")
d.text((70, 110), "API", fill="#e2e8f0", font=f)
d.rounded_rectangle((420, 80, 580, 160), radius=12, fill="#14532d", outline="#4ade80")
d.text((440, 110), "Qdrant", fill="#e2e8f0", font=f)
d.line((200, 120, 420, 120), fill="#94a3b8", width=3)
d.text((250, 200), "v1 — sync pipeline", fill="#94a3b8", font=f)
img.save(FIXTURES / "diagram_v1.png")
def diagram_v2() -> None:
img = Image.new("RGB", (640, 400), "#0f172a")
d = ImageDraw.Draw(img)
f = _font(18)
d.rounded_rectangle((40, 80, 200, 160), radius=12, fill="#1e3a5f", outline="#38bdf8")
d.text((60, 110), "LitServe", fill="#e2e8f0", font=f)
d.rounded_rectangle((250, 60, 410, 140), radius=12, fill="#4c1d95", outline="#a78bfa")
d.text((270, 90), "CrewAI", fill="#e2e8f0", font=f)
d.rounded_rectangle((420, 80, 580, 160), radius=12, fill="#14532d", outline="#4ade80")
d.text((440, 110), "Qdrant", fill="#e2e8f0", font=f)
d.line((200, 120, 250, 100), fill="#94a3b8", width=3)
d.line((410, 100, 420, 120), fill="#94a3b8", width=3)
d.text((200, 200), "v2 — agentic pipeline", fill="#fbbf24", font=f)
img.save(FIXTURES / "diagram_v2.png")
def main() -> None:
FIXTURES.mkdir(parents=True, exist_ok=True)
receipt()
diagram_v1()
diagram_v2()
print(f"Fixtures written to {FIXTURES}")
if __name__ == " __main__":
main()
agent_demo.py
#!/usr/bin/env python3
"""End-to-end demo — MiniCPM-V MCP vision tools (works offline with OLLAMA_MOCK=1).
Simulates what Cursor / Claude Desktop sees when the agent calls vision tools.
"""
from __future__ import annotations
import json
import os
import sys
from pathlib import Path
ROOT = Path( __file__ ).resolve().parent
sys.path.insert(0, str(ROOT))
import vision_backend as vb # noqa: E402
from server import compare_images, describe_image, ocr_document # noqa: E402
FIXTURES = ROOT / "fixtures"
def _banner(title: str) -> None:
print(f"\n{'=' * 60}\n {title}\n{'=' * 60}")
def _parse(result: str) -> dict:
return json.loads(result)
def main() -> None:
os.environ.setdefault("OLLAMA_VISION_MODEL", "minicpm-v4.6")
if not FIXTURES.exists() or not list(FIXTURES.glob("*.png")):
from generate_fixtures import main as gen # noqa: E402
gen()
status = vb.health_check()
_banner("MiniCPM-V 4.6 Vision MCP — Agent Demo")
print(f"Model: {vb.VISION_MODEL} · Mode: {status.get('mode', '?')}")
if not status.get("ok") and not vb.MOCK:
print("\n⚠️ Ollama offline — re-run with OLLAMA_MOCK=1 or start Ollama.\n")
# Scenario 1 — describe screenshot
_banner("Scenario 1 — describe_image")
print('[Tool: describe_image] path=fixtures/diagram_v2.png')
r1 = _parse(describe_image(str(FIXTURES / "diagram_v2.png"), "What services are shown?"))
print("\n## Architecture summary\n")
print(r1.get("result", r1.get("error", r1)))
# Scenario 2 — OCR receipt
_banner("Scenario 2 — ocr_document")
print('[Tool: ocr_document] path=fixtures/sample_receipt.png')
r2 = _parse(ocr_document(str(FIXTURES / "sample_receipt.png")))
print("\n## Receipt OCR\n")
print(r2.get("result", r2.get("error", r2)))
# Scenario 3 — compare before/after
_banner("Scenario 3 — compare_images")
print("[Tool: compare_images] v1 → v2 pipeline diagrams")
r3 = _parse(
compare_images(
str(FIXTURES / "diagram_v1.png"),
str(FIXTURES / "diagram_v2.png"),
focus="new components and labels",
)
)
print("\n## Visual diff\n")
print(r3.get("result", r3.get("error", r3)))
_banner("Done — wire examples/server.py into Cursor MCP settings")
print("See examples/cursor_mcp.json.example")
if __name__ == " __main__":
main()
Part 4 — The vision backend
examples/vision_backend.py encodes images as base64 and POSTs to OLLAMA_HOST/api/chat:
payload = {
"model": VISION_MODEL, # minicpm-v4.6
"messages": [{"role": "user", "content": prompt, "images": images_b64}],
"stream": False,
}
vision_backend.py
"""Ollama vision backend for MiniCPM-V 4.6 — shared by MCP server and demos."""
from __future__ import annotations
import base64
import os
from pathlib import Path
import httpx
OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "http://127.0.0.1:11434").rstrip("/")
VISION_MODEL = os.environ.get("OLLAMA_VISION_MODEL", "minicpm-v4.6")
MOCK = os.environ.get("OLLAMA_MOCK", "0") == "1"
SUPPORTED_SUFFIXES = {".png", ".jpg", ".jpeg", ".webp", ".gif", ".bmp"}
class VisionError(Exception):
pass
def _encode_image(path: Path) -> str:
if not path.is_file():
raise VisionError(f"Image not found: {path}")
if path.suffix.lower() not in SUPPORTED_SUFFIXES:
raise VisionError(f"Unsupported image type: {path.suffix}")
return base64.b64encode(path.read_bytes()).decode("ascii")
def _mock_response(prompt: str, image_count: int) -> str:
return (
f"[mock {VISION_MODEL}] Processed {image_count} image(s).\n"
f"Prompt preview: {prompt[:120]}…\n"
"Set OLLAMA_MOCK=0 and run `ollama pull minicpm-v4.6` for live inference."
)
def chat_vision(prompt: str, image_paths: list[Path], *, timeout: float = 120.0) -> str:
"""Send a vision chat request to Ollama."""
if MOCK:
return _mock_response(prompt, len(image_paths))
images_b64 = [_encode_image(p) for p in image_paths]
payload = {
"model": VISION_MODEL,
"messages": [{"role": "user", "content": prompt, "images": images_b64}],
"stream": False,
}
try:
with httpx.Client(timeout=timeout) as client:
resp = client.post(f"{OLLAMA_HOST}/api/chat", json=payload)
resp.raise_for_status()
data = resp.json()
except httpx.ConnectError as exc:
raise VisionError(
f"Cannot reach Ollama at {OLLAMA_HOST}. Start Ollama and run: ollama pull {VISION_MODEL}"
) from exc
except httpx.HTTPStatusError as exc:
raise VisionError(f"Ollama error {exc.response.status_code}: {exc.response.text[:300]}") from exc
message = data.get("message") or {}
content = message.get("content", "").strip()
if not content:
raise VisionError("Empty response from Ollama")
return content
def health_check() -> dict:
"""Return model + connectivity status for demos."""
if MOCK:
return {"ok": True, "mode": "mock", "model": VISION_MODEL}
try:
with httpx.Client(timeout=5.0) as client:
resp = client.get(f"{OLLAMA_HOST}/api/tags")
resp.raise_for_status()
tags = {m.get("name", "").split(":")[0] for m in resp.json().get("models", [])}
base = VISION_MODEL.split(":")[0]
return {
"ok": base in tags or VISION_MODEL in tags,
"mode": "live",
"model": VISION_MODEL,
"ollama_host": OLLAMA_HOST,
}
except Exception as exc: # noqa: BLE001 — demo helper
return {"ok": False, "mode": "offline", "model": VISION_MODEL, "error": str(exc)}
Environment variables (see .env.example):
.env.example
# Ollama host (default local)
OLLAMA_HOST=http://127.0.0.1:11434
# Vision model — 1.6 GB, text + image, 256K context
OLLAMA_VISION_MODEL=minicpm-v4.6
# Set to 1 to run agent_demo without Ollama (offline smoke test)
OLLAMA_MOCK=0
Part 5 — The three tools
examples/server.py is a FastMCP server.
server.py
#!/usr/bin/env python3
"""MiniCPM-V MCP server — vision tools for Cursor, Claude Desktop, and Hermes.
Exposes three tools over MCP:
describe_image(path, question?) → general image understanding
ocr_document(path) → structured text extraction
compare_images(path_a, path_b, focus?) → side-by-side visual diff
Powered by MiniCPM-V 4.6 via Ollama (~1.6 GB, text + image, 256K context).
Run: python examples/server.py # stdio transport (for MCP hosts)
"""
from __future__ import annotations
import json
from pathlib import Path
from mcp.server.fastmcp import FastMCP
try:
from . import vision_backend as vb
except ImportError: # pragma: no cover
import vision_backend as vb # type: ignore
mcp = FastMCP("minicpm-vision")
DESCRIBE_DEFAULT = (
"Describe this image in detail. Include objects, text visible, layout, "
"colors, and anything notable for a developer reviewing a screenshot."
)
OCR_PROMPT = (
"Extract all readable text from this document or screenshot. "
"Preserve structure with markdown headings and bullet lists where appropriate. "
"If tables are present, format them as markdown tables."
)
COMPARE_DEFAULT = (
"Compare these two images. List similarities and differences. "
"Note UI changes, text changes, and layout shifts."
)
def _resolve(path: str) -> Path:
p = Path(path).expanduser().resolve()
if not p.is_file():
raise FileNotFoundError(f"Not a file: {p}")
return p
def _tool_result(text: str, **meta) -> str:
return json.dumps({"result": text, **meta}, indent=2)
@mcp.tool()
def describe_image(path: str, question: str = "") -> str:
"""Describe or answer questions about a single image using MiniCPM-V 4.6.
Args:
path: Absolute or relative path to a PNG, JPG, WEBP, or GIF file.
question: Optional specific question about the image. Leave empty for
a general description.
Returns JSON with the model's answer and metadata.
"""
try:
img = _resolve(path)
prompt = question.strip() or DESCRIBE_DEFAULT
answer = vb.chat_vision(prompt, [img])
return _tool_result(answer, tool="describe_image", path=str(img), model=vb.VISION_MODEL)
except (FileNotFoundError, vb.VisionError) as exc:
return json.dumps({"error": str(exc)}, indent=2)
@mcp.tool()
def ocr_document(path: str) -> str:
"""OCR a document, receipt, whiteboard photo, or screenshot to markdown text.
Args:
path: Absolute or relative path to the image file.
Returns JSON with extracted text in markdown format.
"""
try:
img = _resolve(path)
answer = vb.chat_vision(OCR_PROMPT, [img])
return _tool_result(answer, tool="ocr_document", path=str(img), model=vb.VISION_MODEL)
except (FileNotFoundError, vb.VisionError) as exc:
return json.dumps({"error": str(exc)}, indent=2)
@mcp.tool()
def compare_images(path_a: str, path_b: str, focus: str = "") -> str:
"""Compare two images and report visual differences.
Args:
path_a: Path to the first image (e.g. before screenshot).
path_b: Path to the second image (e.g. after screenshot).
focus: Optional aspect to focus on (e.g. "navigation bar", "error message").
Returns JSON with a structured comparison.
"""
try:
a, b = _resolve(path_a), _resolve(path_b)
prompt = COMPARE_DEFAULT
if focus.strip():
prompt += f"\n\nFocus especially on: {focus.strip()}"
answer = vb.chat_vision(prompt, [a, b])
return _tool_result(
answer,
tool="compare_images",
path_a=str(a),
path_b=str(b),
model=vb.VISION_MODEL,
)
except (FileNotFoundError, vb.VisionError) as exc:
return json.dumps({"error": str(exc)}, indent=2)
@mcp.resource("minicpm-vision://model")
def model_info() -> str:
"""Capability hint: which vision model and host this server uses."""
status = vb.health_check()
return json.dumps(
{
"model": vb.VISION_MODEL,
"ollama_host": vb.OLLAMA_HOST,
"tools": ["describe_image", "ocr_document", "compare_images"],
"status": status,
},
indent=2,
)
if __name__ == " __main__":
mcp.run(transport="stdio")
describe_image
General-purpose image Q&A. Pass a custom question for targeted queries.
Sample input — architecture diagram the demo describes:
ocr_document
Structured OCR prompt — markdown headings, bullet lists, tables. Ideal for receipts, invoices, and whiteboard photos.
Sample input — coffee shop receipt:
compare_images
Two paths + optional focus (e.g. "navigation bar"). Returns similarities, differences, and UI change notes.
Sample inputs — before and after pipeline:
Each tool returns JSON with result, tool, paths, and model.
Part 6 — Agent demo (terminal walkthrough)
examples/agent_demo.py runs all three scenarios:
python examples/generate_fixtures.py
python examples/agent_demo.py
agent_demo.py
#!/usr/bin/env python3
"""End-to-end demo — MiniCPM-V MCP vision tools (works offline with OLLAMA_MOCK=1).
Simulates what Cursor / Claude Desktop sees when the agent calls vision tools.
"""
from __future__ import annotations
import json
import os
import sys
from pathlib import Path
ROOT = Path( __file__ ).resolve().parent
sys.path.insert(0, str(ROOT))
import vision_backend as vb # noqa: E402
from server import compare_images, describe_image, ocr_document # noqa: E402
FIXTURES = ROOT / "fixtures"
def _banner(title: str) -> None:
print(f"\n{'=' * 60}\n {title}\n{'=' * 60}")
def _parse(result: str) -> dict:
return json.loads(result)
def main() -> None:
os.environ.setdefault("OLLAMA_VISION_MODEL", "minicpm-v4.6")
if not FIXTURES.exists() or not list(FIXTURES.glob("*.png")):
from generate_fixtures import main as gen # noqa: E402
gen()
status = vb.health_check()
_banner("MiniCPM-V 4.6 Vision MCP — Agent Demo")
print(f"Model: {vb.VISION_MODEL} · Mode: {status.get('mode', '?')}")
if not status.get("ok") and not vb.MOCK:
print("\n⚠️ Ollama offline — re-run with OLLAMA_MOCK=1 or start Ollama.\n")
# Scenario 1 — describe screenshot
_banner("Scenario 1 — describe_image")
print('[Tool: describe_image] path=fixtures/diagram_v2.png')
r1 = _parse(describe_image(str(FIXTURES / "diagram_v2.png"), "What services are shown?"))
print("\n## Architecture summary\n")
print(r1.get("result", r1.get("error", r1)))
# Scenario 2 — OCR receipt
_banner("Scenario 2 — ocr_document")
print('[Tool: ocr_document] path=fixtures/sample_receipt.png')
r2 = _parse(ocr_document(str(FIXTURES / "sample_receipt.png")))
print("\n## Receipt OCR\n")
print(r2.get("result", r2.get("error", r2)))
# Scenario 3 — compare before/after
_banner("Scenario 3 — compare_images")
print("[Tool: compare_images] v1 → v2 pipeline diagrams")
r3 = _parse(
compare_images(
str(FIXTURES / "diagram_v1.png"),
str(FIXTURES / "diagram_v2.png"),
focus="new components and labels",
)
)
print("\n## Visual diff\n")
print(r3.get("result", r3.get("error", r3)))
_banner("Done — wire examples/server.py into Cursor MCP settings")
print("See examples/cursor_mcp.json.example")
if __name__ == " __main__":
main()
The terminal shows the same flow your MCP host runs:
[Tool: describe_image] path=fixtures/diagram_v2.png
[Tool: ocr_document] path=fixtures/sample_receipt.png
[Tool: compare_images] v1 → v2 pipeline diagrams
Offline smoke test (no Ollama):
OLLAMA_MOCK=1 python examples/agent_demo.py
Part 7 — Wire into Cursor
Copy examples/cursor_mcp.json.example into Cursor → Settings → MCP. Use absolute paths for cwd.
Restart Cursor — you should see describe_image, ocr_document, compare_images.
Try: “Use ocr_document on /path/to/receipt.png and summarize the total.”
{
"mcpServers": {
"minicpm-vision": {
"command": "python",
"args": ["examples/server.py"],
"cwd": "/absolute/path/to/guides/minicpm-v-mcp-server",
"env": {
"OLLAMA_VISION_MODEL": "minicpm-v4.6",
"OLLAMA_HOST": "http://127.0.0.1:11434"
}
}
}
}
Part 8 — Wire into Claude Desktop
Add the server block from examples/claude_desktop_config.json.example to ~/Library/Application Support/Claude/claude_desktop_config.json on macOS.
Restart Claude Desktop. Vision tools appear alongside your other MCP servers.
{
"mcpServers": {
"minicpm-vision": {
"command": "python",
"args": ["/absolute/path/to/guides/minicpm-v-mcp-server/examples/server.py"],
"env": {
"OLLAMA_VISION_MODEL": "minicpm-v4.6",
"OLLAMA_HOST": "http://127.0.0.1:11434"
}
}
}
}
Conclusion
Give your agent eyes — without giving away your pixels.
Most coding agents are brilliant at text and terrible at images. The usual fix is a cloud vision API: API keys in config, latency on every screenshot, and your receipts, UI mocks, and whiteboard photos leaving your machine.
MiniCPM-V 4.6 flips that. At 1.3B parameters and ~1.6 GB on Ollama, it runs comfortably on a 16 GB Mac and handles text + image input with a 256K context window. Wrap it in a small MCP server, and you get three reusable tools — describe_image, ocr_document, and compare_images — that Cursor, Claude Desktop, and any other MCP host can discover at connect time.
Thank you so much for reading
Like | Follow | Subscribe to the newsletter.
Catch us on
Website: https://www.techlatest.net/
Newsletter: https://substack.com/@parvezmohammed
Twitter: https://twitter.com/TechlatestNet
LinkedIn: https://www.linkedin.com/in/techlatest-net/
YouTube:https://www.youtube.com/@techlatest_net/
Blogs: https://medium.com/@techlatest.net
Reddit Community: https://www.reddit.com/user/techlatest_net/




















