Build a Private Photo Assistant on Telegram with OpenClaw + MiniCPM-V 4.6

Most AI assistants can read text, write code, and automate workflows. But the moment a user sends a photo — a receipt, a whiteboard snapshot, a product image, or a screenshot — the automation often stops unless you connect a cloud vision API.

That usually means API keys, usage costs, network latency, and private images leaving your machine.

In this guide, you’ll build a private photo assistant that runs entirely on your own infrastructure using OpenClaw and MiniCPM-V 4.6. The assistant can receive images from Telegram, WhatsApp, or a local CLI, analyze them with a lightweight multimodal model running on Ollama, and return structured answers with summaries, OCR results, and suggested replies.

At the center of the workflow is MiniCPM-V 4.6, a compact vision-language model with approximately 1.3 billion parameters and a footprint of around 1.6 GB. By exposing it through a LitServe API and connecting it to OpenClaw through a custom vision-photo skill, you create a reusable vision capability that any OpenClaw agent can invoke on demand.

By the end of this tutorial, you’ll have a fully functional photo assistant that can:

Describe photos and screenshots
Extract text from receipts, invoices, and documents
Answer questions about uploaded images
Generate structured markdown responses
Work across Telegram, WhatsApp, and local workflows
Keep image processing completely private and local

No cloud vision APIs. No external image uploads. Just OpenClaw, MiniCPM-V 4.6, and your own infrastructure.

What you end up with

OpenClaw Gateway — always-on control plane
minicpm-v4.6 — conversational + vision model (~1.6 GB)
vision-photo skill — vision_query.sh → POST /predict on port 8002
Structured markdown replies — summary, details, OCR text, suggested channel message

Flow

User sends a photo on Telegram, WhatsApp, or CLI
MiniCPM-V 4.6 plans and invokes the vision-photo skill
Skill POSTs to LitServe http://127.0.0.1:8002/predict
Structured answer returns to the same channel

Prerequisites

Requirement | Check
Node 22.12+ | node -v
Ollama | ollama -v
Python3.10+ | python3 --version
curl + jq | curl --version && jq --version

Part 1 — Pull MiniCPM-V 4.6

From Ollama:

ollama pull minicpm-v4.6
ollama run minicpm-v4.6 "Hello" --image ./photo.jpg

Tag | Size | Input

minicpm-v4.6:latest | 1.6 GB | Text, Image

Don’t want to spend time installing Ollama manually? TechLatest’s Open WebUI + Ollama deployment provides a pre-configured environment for running MiniCPM-V, Gemma, Qwen, Llama, and DeepSeek models locally. Simply launch the instance and start building multimodal AI workflows without additional setup.

Link: https://techlatest.net/support/multi_llm_gpu_vm_support/

Link: https://techlatest.net/support/multi_llm_vm_support/

Part 2 — Vision LitServe API

Terminal A — start the vision server:

cd openclaw-minicpm-v
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
.env
python generate_sample.py
python vision_server.py

generate_sample.py

#!/usr/bin/env python3
from pathlib import Path
from PIL import Image, ImageDraw, ImageFont

out = Path( __file__ ).resolve().parent / "samples"
out.mkdir(exist_ok=True)
img = Image.new("RGB", (480, 640), "#fafafa")
d = ImageDraw.Draw(img)
try:
    f = ImageFont.truetype("/System/Library/Fonts/Supplemental/Arial.ttf", 18)
except OSError:
    f = ImageFont.load_default()
d.text((40, 40), "COFFEE BEAN Co.", fill="#111", font=f)
d.text((40, 100), "TOTAL $10.75", fill="#222", font=f)
d.text((40, 140), "2026-06-23", fill="#555", font=f)
img.save(out / "receipt.png")
print(f"Wrote {out / 'receipt.png'}")


python vision_server.py

"""LitServe vision API — MiniCPM-V 4.6 photo understanding for OpenClaw."""
from __future__ import annotations

import os
from pathlib import Path

import litserve as ls
from dotenv import load_dotenv

import vision_backend as vb

load_dotenv()

PORT = int(os.getenv("PORT", "8002"))

STRUCTURED_PROMPT = """Analyze the image and answer the user's question.
Return markdown with these sections when relevant:
## Summary
(one paragraph)

## Details
(bullet points)

## Text found
(any visible text, or "none")

## Suggested reply
(a short message suitable for Telegram/WhatsApp)
"""

class VisionPhotoAPI(ls.LitAPI):
    def setup(self, device):
        self.model = vb.VISION_MODEL

    def decode_request(self, request):
        return {
            "query": (request.get("query") or "What is in this photo?").strip(),
            "image_path": (request.get("image_path") or "").strip(),
        }

    def predict(self, inputs):
        path = Path(inputs["image_path"]).expanduser().resolve()
        prompt = f"{STRUCTURED_PROMPT}\n\nUser question: {inputs['query']}"
        try:
            answer = vb.chat_vision(prompt, [path])
            return {"output": answer, "model": self.model, "image_path": str(path)}
        except vb.VisionError as exc:
            return {"error": str(exc), "model": self.model}

    def encode_response(self, output):
        return output

if __name__ == " __main__":
    server = ls.LitServer(VisionPhotoAPI(), accelerator="auto", timeout=False)
    print(f"Vision API on http://127.0.0.1:{PORT}/predict (model: {vb.VISION_MODEL})")
    server.run(port=PORT)

Server prints: Vision API on http://127.0.0.1:8002/predict

Request shape:

{
  "query": "What is the total on this receipt?",
  "image_path": "/absolute/path/to/receipt.png"
}

Response:

{
  "output": "## Summary\n…",
  "model": "minicpm-v4.6",
  "image_path": "…"
}

Sample image the API reads:

Test with client.py:

python client.py --image samples/receipt.png --query "OCR this receipt"

client.py

#!/usr/bin/env python3
"""CLI client for the vision photo API."""
from __future__ import annotations

import argparse
import json
import os
import urllib.request

DEFAULT_URL = os.environ.get("VISION_API_URL", "http://127.0.0.1:8002")

def main() -> None:
    p = argparse.ArgumentParser(description="Query local MiniCPM-V vision API")
    p.add_argument("--image", required=True, help="Path to image file")
    p.add_argument("--query", default="Describe this photo in detail.")
    p.add_argument("--url", default=f"{DEFAULT_URL.rstrip('/')}/predict")
    args = p.parse_args()

    body = json.dumps({"query": args.query, "image_path": args.image}).encode()
    req = urllib.request.Request(args.url, data=body, headers={"Content-Type": "application/json"})
    with urllib.request.urlopen(req, timeout=180) as resp:
        data = json.loads(resp.read().decode())
    print(data.get("output") or data.get("error") or data)

if __name__ == " __main__":
    main()

Expected sections in the output: Summary , Details , Text found , Suggested reply.

Part 3 — Install OpenClaw

Terminal B:

cd guides/openclaw-minicpm-v
source ./use-node22.sh
npm install -g openclaw@latest
openclaw onboard --install-daemon
openclaw models set ollama/minicpm-v4.6

#!/usr/bin/env bash
set -euo pipefail
export NVM_DIR="${NVM_DIR:-$HOME/.nvm}"
if [[-s "$NVM_DIR/nvm.sh"]]; then
  . "$NVM_DIR/nvm.sh"
  nvm use "$(cat "$(dirname "$0")/.nvmrc")"
else
  echo "nvm not found — install Node 22+" >&2
  exit 1
fi
echo "Node: $(node -v)"

In openclaw.json sets the primary model and VISION_API_URL.

// Merge into ~/.openclaw/openclaw.json
{
  agents: {
    defaults: {
      model: { primary: "ollama/minicpm-v4.6" },
      skills: ["vision-photo"],
    },
  },

  models: {
    providers: {
      ollama: {
        apiKey: "ollama-local",
        baseUrl: "http://127.0.0.1:11434",
        api: "ollama",
        timeoutSeconds: 300,
        models: [
          {
            id: "minicpm-v4.6",
            name: "MiniCPM-V 4.6",
            reasoning: false,
            input: ["text", "image"],
            contextWindow: 256000,
            maxTokens: 8192,
            params: { keep_alive: "15m" },
          },
        ],
      },
    },
  },

  skills: {
    entries: {
      "vision-photo": {
        enabled: true,
        env: {
          VISION_API_URL: "http://127.0.0.1:8002",
        },
      },
    },
  },
}

Looking for a faster setup? TechLatest offers a pre-configured OpenClaw environment that includes the gateway, agent runtime, and common dependencies, allowing developers to focus on building skills and automations instead of infrastructure setup.

Link: https://techlatest.net/support/openclaw-support/

Part 4 — Install vision-photo skill

chmod +x install-skill.sh skills/vision-photo/scripts/*.sh
./install-skill.sh
openclaw gateway restart

The skill tells the agent to run:

vision_query.sh "/path/to/image.jpg" "user question"

See skills/vision-photo/SKILL.md.

Part 5 — Telegram / WhatsApp

Follow OpenClaw channels docs for your platform. Keep DM pairing enabled for security.

When a user sends a photo:

OpenClaw saves media to a local path
Agent invokes vision-photo with path + caption
LitServe returns structured markdown
Agent sends suggested reply to the channel

Example channel reply from the demo receipt:

Your receipt total is _ **$10.75_**

Part 6 — Smoke test

./test-local.sh

Runs: Ollama check → sample image → API health → skill script query.

For a step-by-step walkthrough and complete implementation details, check out the full guide here.

Deploy the Complete Stack on TechLatest

You can deploy the entire private photo assistant stack using TechLatest AI infrastructure:

Open WebUI + Ollama for local multimodal inference
OpenClaw for agent orchestration and messaging integrations
JupyterHub for experimentation and evaluation
AWS, Azure, and GCP deployment options

This allows developers to build privacy-first vision assistants without spending hours configuring infrastructure and dependencies.

Conclusion

You’ve successfully built a private multimodal assistant that can see, read, and understand images directly from your messaging channels.

Using OpenClaw as the orchestration layer, MiniCPM-V 4.6 as the vision model, and LitServe as the local inference API, you’ve created a workflow where users can send a photo and receive structured, actionable insights without relying on external vision services.

The architecture is intentionally simple:

OpenClaw handles agent orchestration and channel integrations
MiniCPM-V 4.6 provides image understanding and OCR capabilities
LitServe exposes a lightweight local API
The vision-photo skill connects everything

Because the entire stack runs locally, sensitive screenshots, receipts, documents, and personal photos never leave your infrastructure. Whether you’re building customer support agents, document processing workflows, field inspection tools, or personal AI assistants, the same pattern can be extended with additional skills and automation.

From a single image to a complete conversation, your OpenClaw agent now has eyes.