What an AI Automation Engineer Actually Does
The title sounds vague on purpose. Half the people who use it are prompt jockeys, the other half are RPA consultants who learned to say "agent." So let me describe the job the way I actually do it: take a manual, repeatable operation, replace it with a system that runs unattended, and prove on a spreadsheet that it saved time or money. That's the whole loop. Everything else, the model choice, the framework, the cloud, is downstream of that.
The job is operations engineering with LLMs bolted on
An AI automation engineer turns recurring human work into software that runs without supervision and produces a measurable outcome. The "AI" part is just the latest tool in the kit — useful when the input is unstructured (email, PDFs, support tickets, search data, web pages) and rules-based code would be brittle.
Most of my engagements start the same way. Someone has a process — onboarding a customer, triaging tickets, writing weekly briefs, reconciling invoices, publishing content — that eats a known number of hours per week. My job is to:
- Map the process the way it actually runs (not the SOP).
- Find the parts that are deterministic vs. judgment-based.
- Replace the deterministic parts with code, the judgment parts with an LLM call, and the orchestration with a scheduler.
- Add observability so it's debuggable at 2 a.m.
- Report the hours saved per month in a number the business owner trusts.
In one 4-system ecosystem I built last year, the math came out to 73+ hours saved per month and a 192% Year-1 ROI. That number is the actual deliverable. The Lambdas and prompts are just how I got there.
A useful framing from McKinsey's 2024 State of AI report: about 65% of organizations now regularly use generative AI, but only a minority report material EBIT impact at the function level (McKinsey, 2024). The gap between "using AI" and "AI moves the P&L" is exactly where this role lives.
A normal week looks more like plumbing than prompting
People imagine the job is writing clever prompts. In practice, on a typical week I spend my time roughly like this:
| Activity | Share of week |
|---|---|
| Reading logs, fixing edge cases in production | ~30% |
| Wiring integrations (APIs, queues, webhooks, auth) | ~25% |
| Data modeling and storage (Postgres, pgvector, S3) | ~15% |
| Prompt + context engineering | ~10% |
| Evals and regression checks | ~10% |
| Talking to the human whose work I'm automating | ~10% |
The last one matters more than it looks. The person who currently does the job knows the twelve weird cases your spec doesn't mention. If you skip that conversation, you'll ship something that works on the demo and breaks on Tuesday's invoice from the one vendor who emails PDFs upside down.
Prompt engineering is real, but it's a small slice. Most production reliability comes from boring things: idempotent writes, retries with backoff, dead-letter queues, schema validation on LLM outputs, and a place to look when something goes wrong.
What the systems actually look like
Two patterns cover maybe 80% of the work I ship.
Pattern 1: scheduled worker that produces an artifact
Cron or EventBridge fires on a schedule. A small fleet of functions does research, calls an LLM, validates output, writes to a database, and publishes somewhere (CMS, Slack, email, a dashboard). No human in the loop in the happy path.
This is what BizFlowAI ContentStudio is: a content + SEO machine that researches topics, drafts, optimizes, and publishes across multiple sites on its own. The interesting engineering isn't the writing — it's the self-learning loop. Real search performance data flows back in, and the next cycle targets pages that are underperforming and drops topics that aren't ranking.
Shape:
EventBridge (daily)
-> Lambda: pull GSC + analytics data
-> Postgres: update topic scores
-> Lambda: pick next batch of briefs
-> Claude: draft + structure
-> Lambda: validate (schema, internal links, AEO checks)
-> CMS API: publish
-> Slack: report what shipped
Nothing exotic. The value is that it runs every day without me.
Pattern 2: event-driven responder
Something happens in the world — a ticket is created, an email arrives, a form is submitted, a file lands in S3 — and the system reacts inside an SLA window. The Zendesk + AWS integration I built for SLA compliance is in this family. A webhook hits API Gateway, a Lambda classifies and enriches the event, state lands in Postgres, and the right downstream action fires.
The hard part of pattern 2 isn't the AI call. It's:
- Handling webhook replays without double-acting.
- Surviving the third-party API being down for 40 minutes.
- Reconciling state when two events arrive out of order.
- Proving, with timestamps, that the SLA was met.
That last one — "prove it" — is why I keep an audit table on every system I ship. Every decision the LLM made, every input it saw, every action that fired downstream, with a timestamp and a correlation ID. Without that, when leadership asks "did this actually work last month," you're guessing.
Where LLMs earn their keep — and where they don't
LLMs are excellent at:
- Reading unstructured input and producing structured output (JSON with a schema).
- Classification with fuzzy edges (intent, sentiment, urgency, topic).
- Drafting language that a human will lightly edit, or that a downstream system will consume.
- Routing decisions where the rule set would be 200 if-statements.
They are bad, or expensive, at:
- Math. Use code.
- Anything that needs a guaranteed exact answer from a known dataset. Use SQL or a retrieval step, then let the model phrase the answer.
- Long deterministic workflows. Decompose into steps; don't ask one giant prompt to do everything.
- Tasks where a 3% hallucination rate is unacceptable and you have no validation layer.
A practical rule I follow: never let an LLM be the last layer before a side effect. A model can draft an email, but a code-side check decides whether to send. A model can suggest a refund amount, but a rule decides whether to issue it. The model proposes; deterministic code disposes. This single discipline eliminates most of the "AI did something insane in production" stories.
For retrieval-heavy systems, I lean on Postgres + pgvector with hybrid search (BM25 + vector, fused with reciprocal rank fusion). It's cheap, it's one database to operate, and for most B2B SaaS-sized corpora it outperforms throwing everything at a managed vector DB. A 2024 Microsoft Research note on RAG quality found hybrid retrieval with RRF consistently beat pure-vector search on enterprise QA tasks (Microsoft Research, 2024). That matches what I see in production.
How I size a project before I quote it
Before I write a line of code, I want three numbers:
- Hours per month the current process consumes, multiplied by a loaded hourly cost.
- Frequency and variance of the inputs — is it 50 a day or 5,000? Are they similar or wildly different?
- Cost of a wrong answer — is it "we re-send the email" or "we refunded the wrong customer $4,000"?
That third number drives architecture more than anything else. Low cost of error means you can let the model act and review samples. High cost of error means human-in-the-loop, dual validation, or a deterministic fallback path.
A rough sizing table I use as a starting point, not gospel:
| Process volume | Cost of error | What I build |
|---|---|---|
| Low (<100/day) | Low | Single Lambda + cron, no queue |
| Medium (100-10k/day) | Low | Queue + workers + retries, one Postgres |
| Any | High | Human-in-loop UI, audit trail, deterministic guardrails |
| High (>10k/day) | Low | Step Functions or batching, cost controls per token |
If a project doesn't pencil out to at least 10x its build cost over 12 months, I tell the client to skip it. Building automation that saves $400 a month and costs $30,000 to ship is how this field got a reputation problem.
The unglamorous parts that decide whether it works
A few things almost no one talks about, but they're the difference between a demo and a system that's still running a year later.
Schema validation on every model output. I use Zod or Pydantic, parse the JSON, and if it fails, I retry once with the error appended to the prompt. After two failures, the message goes to a DLQ and a human gets pinged. This single pattern removes a category of 3 a.m. pages.
Cost ceilings per run. Each workflow has a max-tokens budget. If a single run blows past it, the run aborts and alerts. LLM APIs will happily let you spend $800 in an afternoon because a prompt got stuck in a loop. Ask me how I know.
Eval sets, even tiny ones. Twenty to fifty real examples with expected outputs, stored as JSON, run on every prompt change. You don't need a fancy eval platform. You need a script and the discipline to run it before you ship.
One observability dashboard. I want to open one page and see: runs today, failures, average latency, average cost per run, last 10 errors. CloudWatch + a small Grafana, or just a Supabase view, is fine. The point is to look at it.
A "kill switch." Every autonomous system has a flag in the database that pauses it. When something weird happens at 11 p.m., I want to stop the world from one SQL update, not redeploy.
What I'd do if I were starting today
If you're trying to break into this work — or hire someone who does it — here's the short version of what I'd focus on:
- Pick one process end-to-end. Don't build "an AI platform." Pick invoice intake, or lead qualification, or weekly reporting. Own it from input to outcome.
- Master one cloud's serverless stack. For me it's AWS — Lambda, EventBridge, SQS, API Gateway, DynamoDB or RDS. You can ship 90% of automation projects with those six services.
- Get fluent in one LLM provider's API and one local option. I default to Claude for production reasoning, and Ollama with a small model for things I want to keep on a private machine. Knowing both stops you from over-engineering.
- Learn enough Postgres to be dangerous. Indexes, JSONB, pgvector, common table expressions. A surprising number of "we need a vector DB" problems are "we need a Postgres index."
- Write the ROI memo before the code. Hours, dollars, payback period. If you can't write that page convincingly, the project shouldn't exist yet.
The engineers who get the best results from AI aren't the ones with the deepest model knowledge. They're the ones who treat models as one component in a system that has to keep running on Sunday at 4 a.m. when no one is watching.
If you're thinking through a process that should be automated and want a second pair of eyes on whether it's worth doing — and how — I'm happy to talk. You can reach me at lazar-milicevic.com/#contact, or read more notes from production over on the blog.

