Let me start with a question my clients ask me a lot:
"Can't we just use ChatGPT for this?"
My answer is always the same: it depends on what "this" is.
When "this" involves client intake forms for a law firm, tax documents for an accounting practice, or patient records for a medical office — the answer is no. And once I explain why, they always get it.
This post is about that explanation, and the toolchain I actually use instead.
The Part Everyone Glosses Over
When you send a prompt to ChatGPT or Claude via the API, that data leaves your network. It travels to a third-party server, gets processed, and comes back. The companies have policies about how they handle it — and you should read them — but the fundamental truth is: you handed your client's sensitive information to someone else.
For a lot of use cases, that's totally fine. Write me a landing page? Sure, use whatever.
But when the prompt contains:
- Attorney-client communications
- Personally Identifiable Information (PII)
- Financial records subject to confidentiality agreements
- Proprietary business logic a client has spent a decade refining
...you're in a different conversation. One that involves client trust, potential legal exposure, and in some industries, real regulatory obligations. HIPAA doesn't care that the AI gave a good answer.
What I Use Instead: Ollama
Ollama is the cleanest tool I've found for running large language models locally. It runs on Mac, Linux, and Windows, wraps model management into a simple CLI, and exposes a local REST API. That API is compatible with the OpenAI format — which means most integrations you'd build against ChatGPT work against Ollama with one line changed.
Getting started takes about five minutes:
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model — llama3.2 is a solid general-purpose starting point
ollama pull llama3.2
# Start the server
ollama serve
Once it's running, you have a local API at http://localhost:11434. No API key. No rate limits. No bill at the end of the month. Here's a basic Python call:
import requests
def ask_local_llm(prompt: str, model: str = "llama3.2") -> str:
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
# Example: summarize a client intake document
intake_text = "Client Jane Doe, referred by attorney Martinez, is seeking..."
summary = ask_local_llm(
f"Summarize this client intake in 3 concise bullet points:\n\n{intake_text}"
)
print(summary)
That's it. The model runs on the local machine. The data never leaves the building.
Real Client Use Cases
Here's where it gets concrete. These are the kinds of deployments I've built:
Law office — client intake summaries
Attorneys were drowning in intake forms and needed quick summaries before consultations. The obvious fix is AI. The blocker: those forms contain PII, case details, and sometimes confidential disclosures that flat-out cannot go to a cloud provider.
Solution: Ollama running on a local machine in their office, a Python script that reads the intake PDF, summarizes it with llama3.2, and outputs a clean brief. Setup time: half a day. Data never leaves their network.
Accounting firm — document Q&A
Staff needed to locate specific information across large financial documents and past filings quickly. Paired Ollama with a basic RAG (retrieval-augmented generation) pipeline — documents get chunked and embedded locally, queries get answered against the local vector store. The client's financial data stays on their server. As a bonus, it's actually faster than cloud solutions for this use case because there's zero round-trip latency.
Small business — proprietary process assistant
This one was less about compliance and more about competitive advantage. The client had a pricing model they'd refined over ten years. They were not interested in that logic ending up anywhere near a third-party training pipeline. Local deployment was the only acceptable option, full stop.
The Honest Trade-offs
I'm not going to oversell this. Here's what you give up going local:
Model capability — llama3.2 is impressive for its size. It is not GPT-4o. For pure reasoning tasks with no sensitivity concerns, the frontier cloud models still have an edge on harder problems.
Hardware requirements — Running a useful model locally needs real resources. I typically recommend at least 16GB of RAM and, ideally, a dedicated GPU. Clients who already have a server are usually fine. Clients on thin hardware turn into a hardware conversation first.
Setup and maintenance overhead — There's no sign-up-and-get-a-key path here. You're managing software, models, and updates. For non-technical clients, that means building something bulletproof or staying on the hook for maintenance.
For the right client, these trade-offs are absolutely worth it.
The Part I Didn't Expect
The clients who care most about local deployment aren't always the most technical. They're often the ones who've been in business long enough to be careful. When I tell them their data stays in-house — no monthly API bill that scales with usage, no third-party terms of service to worry about, they own the whole stack — that lands differently than any feature comparison I could make.
Local AI isn't for everyone. But when the fit is right, it's a genuinely different value proposition than "here's your ChatGPT wrapper with some prompt engineering on top."
If you're building for clients who handle sensitive data, have this conversation before you default to the cloud. You might be surprised how often they've already been thinking about it.
I'm stickytr33 — I build AI integrations, local LLM deployments, and IT infrastructure for small businesses. If this is relevant to what you're working on, find me on GitHub or drop a comment.













