Most AI assistants can read text, write code, and automate workflows. But the moment a user sends a photo — a receipt, a whiteboard snapshot, a product image, or a screenshot — the automation often stops unless you connect a cloud vision API.
That usually means API keys, usage costs, network latency, and private images leaving your machine.
In this guide, you’ll build a private photo assistant that runs entirely on your own infrastructure using OpenClaw and MiniCPM-V 4.6. The assistant can receive images from Telegram, WhatsApp, or a local CLI, analyze them with a lightweight multimodal model running on Ollama, and return structured answers with summaries, OCR results, and suggested replies.
At the center of the workflow is MiniCPM-V 4.6, a compact vision-language model with approximately 1.3 billion parameters and a footprint of around 1.6 GB. By exposing it through a LitServe API and connecting it to OpenClaw through a custom vision-photo skill, you create a reusable vision capability that any OpenClaw agent can invoke on demand.
By the end of this tutorial, you’ll have a fully functional photo assistant that can:
- Describe photos and screenshots
- Extract text from receipts, invoices, and documents
- Answer questions about uploaded images
- Generate structured markdown responses
- Work across Telegram, WhatsApp, and local workflows
- Keep image processing completely private and local
No cloud vision APIs. No external image uploads. Just OpenClaw, MiniCPM-V 4.6, and your own infrastructure.
What you end up with
- OpenClaw Gateway — always-on control plane
- minicpm-v4.6 — conversational + vision model (~1.6 GB)
- vision-photo skill — vision_query.sh → POST /predict on port 8002
- Structured markdown replies — summary, details, OCR text, suggested channel message
Flow
- User sends a photo on Telegram, WhatsApp, or CLI
- MiniCPM-V 4.6 plans and invokes the vision-photo skill
- Skill POSTs to LitServe http://127.0.0.1:8002/predict
- Structured answer returns to the same channel
Prerequisites
Requirement | Check
Node 22.12+ | node -v
Ollama | ollama -v
Python3.10+ | python3 --version
curl + jq | curl --version && jq --version
Part 1 — Pull MiniCPM-V 4.6
From Ollama:
ollama pull minicpm-v4.6
ollama run minicpm-v4.6 "Hello" --image ./photo.jpg
Tag | Size | Input
minicpm-v4.6:latest | 1.6 GB | Text, Image
Don’t want to spend time installing Ollama manually? TechLatest’s Open WebUI + Ollama deployment provides a pre-configured environment for running MiniCPM-V, Gemma, Qwen, Llama, and DeepSeek models locally. Simply launch the instance and start building multimodal AI workflows without additional setup.
Link: https://techlatest.net/support/multi_llm_gpu_vm_support/
Link: https://techlatest.net/support/multi_llm_vm_support/
Part 2 — Vision LitServe API
Terminal A — start the vision server:
cd openclaw-minicpm-v
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
.env
python generate_sample.py
python vision_server.py
generate_sample.py
#!/usr/bin/env python3
from pathlib import Path
from PIL import Image, ImageDraw, ImageFont
out = Path( __file__ ).resolve().parent / "samples"
out.mkdir(exist_ok=True)
img = Image.new("RGB", (480, 640), "#fafafa")
d = ImageDraw.Draw(img)
try:
f = ImageFont.truetype("/System/Library/Fonts/Supplemental/Arial.ttf", 18)
except OSError:
f = ImageFont.load_default()
d.text((40, 40), "COFFEE BEAN Co.", fill="#111", font=f)
d.text((40, 100), "TOTAL $10.75", fill="#222", font=f)
d.text((40, 140), "2026-06-23", fill="#555", font=f)
img.save(out / "receipt.png")
print(f"Wrote {out / 'receipt.png'}")
python vision_server.py
"""LitServe vision API — MiniCPM-V 4.6 photo understanding for OpenClaw."""
from __future__ import annotations
import os
from pathlib import Path
import litserve as ls
from dotenv import load_dotenv
import vision_backend as vb
load_dotenv()
PORT = int(os.getenv("PORT", "8002"))
STRUCTURED_PROMPT = """Analyze the image and answer the user's question.
Return markdown with these sections when relevant:
## Summary
(one paragraph)
## Details
(bullet points)
## Text found
(any visible text, or "none")
## Suggested reply
(a short message suitable for Telegram/WhatsApp)
"""
class VisionPhotoAPI(ls.LitAPI):
def setup(self, device):
self.model = vb.VISION_MODEL
def decode_request(self, request):
return {
"query": (request.get("query") or "What is in this photo?").strip(),
"image_path": (request.get("image_path") or "").strip(),
}
def predict(self, inputs):
path = Path(inputs["image_path"]).expanduser().resolve()
prompt = f"{STRUCTURED_PROMPT}\n\nUser question: {inputs['query']}"
try:
answer = vb.chat_vision(prompt, [path])
return {"output": answer, "model": self.model, "image_path": str(path)}
except vb.VisionError as exc:
return {"error": str(exc), "model": self.model}
def encode_response(self, output):
return output
if __name__ == " __main__":
server = ls.LitServer(VisionPhotoAPI(), accelerator="auto", timeout=False)
print(f"Vision API on http://127.0.0.1:{PORT}/predict (model: {vb.VISION_MODEL})")
server.run(port=PORT)
Server prints: Vision API on http://127.0.0.1:8002/predict
Request shape:
{
"query": "What is the total on this receipt?",
"image_path": "/absolute/path/to/receipt.png"
}
Response:
{
"output": "## Summary\n…",
"model": "minicpm-v4.6",
"image_path": "…"
}
Sample image the API reads:
Test with client.py:
python client.py --image samples/receipt.png --query "OCR this receipt"
client.py
#!/usr/bin/env python3
"""CLI client for the vision photo API."""
from __future__ import annotations
import argparse
import json
import os
import urllib.request
DEFAULT_URL = os.environ.get("VISION_API_URL", "http://127.0.0.1:8002")
def main() -> None:
p = argparse.ArgumentParser(description="Query local MiniCPM-V vision API")
p.add_argument("--image", required=True, help="Path to image file")
p.add_argument("--query", default="Describe this photo in detail.")
p.add_argument("--url", default=f"{DEFAULT_URL.rstrip('/')}/predict")
args = p.parse_args()
body = json.dumps({"query": args.query, "image_path": args.image}).encode()
req = urllib.request.Request(args.url, data=body, headers={"Content-Type": "application/json"})
with urllib.request.urlopen(req, timeout=180) as resp:
data = json.loads(resp.read().decode())
print(data.get("output") or data.get("error") or data)
if __name__ == " __main__":
main()
Expected sections in the output: Summary , Details , Text found , Suggested reply.
Part 3 — Install OpenClaw
Terminal B:
cd guides/openclaw-minicpm-v
source ./use-node22.sh
npm install -g openclaw@latest
openclaw onboard --install-daemon
openclaw models set ollama/minicpm-v4.6
#!/usr/bin/env bash
set -euo pipefail
export NVM_DIR="${NVM_DIR:-$HOME/.nvm}"
if [[-s "$NVM_DIR/nvm.sh"]]; then
. "$NVM_DIR/nvm.sh"
nvm use "$(cat "$(dirname "$0")/.nvmrc")"
else
echo "nvm not found — install Node 22+" >&2
exit 1
fi
echo "Node: $(node -v)"
In openclaw.json sets the primary model and VISION_API_URL.
// Merge into ~/.openclaw/openclaw.json
{
agents: {
defaults: {
model: { primary: "ollama/minicpm-v4.6" },
skills: ["vision-photo"],
},
},
models: {
providers: {
ollama: {
apiKey: "ollama-local",
baseUrl: "http://127.0.0.1:11434",
api: "ollama",
timeoutSeconds: 300,
models: [
{
id: "minicpm-v4.6",
name: "MiniCPM-V 4.6",
reasoning: false,
input: ["text", "image"],
contextWindow: 256000,
maxTokens: 8192,
params: { keep_alive: "15m" },
},
],
},
},
},
skills: {
entries: {
"vision-photo": {
enabled: true,
env: {
VISION_API_URL: "http://127.0.0.1:8002",
},
},
},
},
}
Looking for a faster setup? TechLatest offers a pre-configured OpenClaw environment that includes the gateway, agent runtime, and common dependencies, allowing developers to focus on building skills and automations instead of infrastructure setup.
Link: https://techlatest.net/support/openclaw-support/
Part 4 — Install vision-photo skill
chmod +x install-skill.sh skills/vision-photo/scripts/*.sh
./install-skill.sh
openclaw gateway restart
The skill tells the agent to run:
vision_query.sh "/path/to/image.jpg" "user question"
See skills/vision-photo/SKILL.md.
Part 5 — Telegram / WhatsApp
Follow OpenClaw channels docs for your platform. Keep DM pairing enabled for security.
When a user sends a photo:
- OpenClaw saves media to a local path
- Agent invokes vision-photo with path + caption
- LitServe returns structured markdown
- Agent sends suggested reply to the channel
Example channel reply from the demo receipt:
Your receipt total is _ **$10.75_**
Part 6 — Smoke test
./test-local.sh
Runs: Ollama check → sample image → API health → skill script query.
For a step-by-step walkthrough and complete implementation details, check out the full guide here.
Deploy the Complete Stack on TechLatest
You can deploy the entire private photo assistant stack using TechLatest AI infrastructure:
- Open WebUI + Ollama for local multimodal inference
- OpenClaw for agent orchestration and messaging integrations
- JupyterHub for experimentation and evaluation
- AWS, Azure, and GCP deployment options
This allows developers to build privacy-first vision assistants without spending hours configuring infrastructure and dependencies.
Conclusion
You’ve successfully built a private multimodal assistant that can see, read, and understand images directly from your messaging channels.
Using OpenClaw as the orchestration layer, MiniCPM-V 4.6 as the vision model, and LitServe as the local inference API, you’ve created a workflow where users can send a photo and receive structured, actionable insights without relying on external vision services.
The architecture is intentionally simple:
- OpenClaw handles agent orchestration and channel integrations
- MiniCPM-V 4.6 provides image understanding and OCR capabilities
- LitServe exposes a lightweight local API
- The vision-photo skill connects everything
Because the entire stack runs locally, sensitive screenshots, receipts, documents, and personal photos never leave your infrastructure. Whether you’re building customer support agents, document processing workflows, field inspection tools, or personal AI assistants, the same pattern can be extended with additional skills and automation.
From a single image to a complete conversation, your OpenClaw agent now has eyes.
Thank you so much for reading
Like | Follow | Subscribe to the newsletter.
Catch us on
Website: https://www.techlatest.net/
Newsletter: https://substack.com/@parvezmohammed
Twitter: https://twitter.com/TechlatestNet
LinkedIn: https://www.linkedin.com/in/techlatest-net/
YouTube:https://www.youtube.com/@techlatest_net/
Blogs: https://medium.com/@techlatest.net
Reddit Community: https://www.reddit.com/user/techlatest_net/



















