File tools and validate: the agent's feedback loop

What this post covers

This post gives the agent real work. In previous posts we just gave it toy work to make sure the infrastructure, monitoring, and audit logs work. Now we build a harness for the agent to write Terraform and validate it.

To do this we give the agent a few tools:

Sandboxed File System Tools:
- list_files: list files in a folder.
- read_file: read a file.
- write_file: write a file.
- edit_file: edit a file (search and replace based).
- delete_file: delete a file.
Terraform Tools:
- terraform_init: initialize a Terraform workspace.
- terraform_validate: validate a Terraform workspace.

The sandboxed file system tools themselves are simple; most of the work is returning good error messages to the model and making sure it does not breach the boundaries of the sandbox tmp folder. The terraform tools on the other hand are more complex, and they have consequences for the Lambda setup. They need to actually execute the terraform binary. For this post that meant changing the Lambda from a Python to a Docker based Lambda, which did require changing the build process and infrastructure setup.

Typical agent flow looks like this:

Notice the new AWS Lambda span, which was not there in the previous post. This is because I wanted to extend instrumentation given that now there is more happening than just the agent run. We do also re-lock the terraform dependency lock file for different architectures after the agent run and I wanted to see timings and memory usage for the external terraform process. Because we are using AWS Lambda, memory is a constraint and cost factor, made worse by the fact that if you are (like me) running this in a new sandbox AWS account you are capped at 3008 MB of memory. Increasing that is a slow process on basic support.

The system prompt in this iteration is not optimized (we'll do that in a later post). We just tell the agent what tools it has and that it needs to run terraform-init before the first validate. list_files/read_file is currently not that important as we do not hook up GitHub/existing terraform code in this post. It will be more important when the agent needs to assess how to integrate its code into an existing project and to follow that project's best practice guidelines when doing so.

System Prompt

You are the terraform-pr-agent. You operate on a Terraform workspace
through file tools (list_files, read_file, write_file, edit_file, delete_file)
and two terraform tools. Use the file tools to explore, write, and edit.
Run terraform_init before your first validate and again whenever you add
or change provider or module requirements. Call terraform_validate after
you write or change files to confirm the workspace still parses; treat its
output as feedback and edit until it is clean.

When you add the AWS provider, give it a version constraint such as "~> 6.0"
rather than leaving it unconstrained. terraform init records the exact resolved
version and checksums in .terraform.lock.hcl, which travels with the workspace
and is the reproducibility record, so the constraint does not need to be an exact
pin. Pin an exact version only when the user asks for one. Example:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 6.0"
    }
  }
}

Similarly, the user prompt is basic, with not much thought put into it. It is passed in the invoke event payload.

User Prompt

Set up a new terraform project, creating a best practice s3 bucket.

We store the end result in an S3 bucket. With no GitHub integration yet, that is a simple way to inspect the files the agent produced and validate the outcome.

Architecture

Post 2 wrapped the agent in a Lambda and stood up the dual-sink span pipeline behind it; post 3 made the model a runtime choice (an SSM model registry, a Mistral provider beside Bedrock, and EMF metrics feeding one CloudWatch dashboard). Post 4 keeps all of that as is. The Lambda itself changes from a zip package to a container image so we can ship the terraform CLI alongside the Python runtime; a new ECR repository holds the image, and terraform seeds it with a placeholder so the function can be created before the first real push, the container twin of post 2's placeholder zip. Everything else around the function (IAM, Bedrock, the model registry, Firehose, S3) carries forward unchanged. The rest of the post is software: a workspace under /tmp, file tools, the terraform_validate tool, and a caller-side retry.

See this diagram full-size on the original post.

The code

New to the series? Tooling, AWS access, and project setup are covered in Part 1 (linked above).

The final tree. + is new in post 4, ~ extends a post 3 file, blank carries unchanged. Click any changed or new file to read it; the download below fast-forwards to this state if you want to walk through the post against the finished code. We also split the old handler.py in two: core.py holds the agent and an execute(prompt, model) that knows nothing about Lambda, and lambda_entry.py is the Lambda boundary that parses the event and calls it. It was getting out of hand as this post's functionality piled on.

terraform-pr-agent/
    agent/
      __init__.py
    + core.py
    + env.py
      handler.py
    + lambda_entry.py
    + memory.py
    + models.py
    + observability.py
    + runs.py
    + ssm.py
    + tools.py
    infra/
      placeholder/
      + Dockerfile
      + handler.py
      alerts.tf
      audit-bucket.tf
      bedrock.tf
    ~ cloudwatch.tf
    + ecr.tf
      firehose.tf
      iam.tf
      kms.tf
    ~ lambda.tf
      logfire.tf
      main.tf
      models.tf
    + runs-bucket.tf
    ~ variables.tf
    scripts/
    ~ build-lambda.sh
      chat.py
      queries.sql
    ~ traces.sql
    tests/
    ~ conftest.py
    + test_core.py
    + test_env.py
      test_handler.py
    + test_lambda_entry.py
    + test_tools.py
  + .dockerignore
    .envrc
    .envrc.local
    .gitignore
  + .scaffold-delete
    AGENTS.md
  + Dockerfile
  ~ pyproject.toml
  + README.md

Browse these files interactively on the original post.

Fast-forward to the final code of this post:

mkdir -p ~/projects
cd ~/projects
curl -fsSL https://andreaslang.dev/terraform-pr-agent/terraform-pr-agent-04.tar.gz | tar xz

From zip to container image

In previous posts we used a normal Python Lambda, but for this post we moved to a Docker based Lambda to avoid the Lambda 250 MB size limit (unzipped). We could still have managed in this post with terraform 108 MB + site-packages 56 MB = 164 MB, but the container ceiling is 10 GB rather than 250 MB, which buys real headroom as the image grows. The container also hands us the whole image to control, system packages and filesystem layout, not just Python deps layered onto the AWS runtime. As a bonus it is a standard Docker image, so moving to something like ECS Fargate later is straightforward.

Looking at the Docker image you will also notice we use a multi-stage build, where we have build layers of the image in multiple stages and then copy them into the final image. This is an easy way of keeping build tools out of the final image and therefore reducing the size of the image. Our final image is 739 MB total, 169 MB above the shared 570 MB base. That sounds huge for people trying to build small images, but the base image we use is the AWS Lambda Python base image, which is heavily cached across the Lambda stack. So in practice we only pull the final 169 MB layer, which is far more reasonable.

Multi-stage build layout

builder stage        copied into the final image
-------------        ---------------------------
terraform       -->  terraform binary
python          -->  site-packages (/var/task)
insights        -->  Lambda Insights extension
build context   -->  agent/ source

base image: public.ecr.aws/lambda/python:3.13
left in the builders, never shipped: unzip, dnf cache, uv, gpg keyring, rpm metadata

We will not go through all the stages of the image, but here are the Python dependencies as an example. We use the AWS Lambda Python base image, install uv by copying the layer from yet another Docker image, copy pyproject.toml/uv.lock, export to a requirements.txt and then run uv pip to install the dependencies.

Dockerfile

# Same uv flow as the zip build this replaces (see the uv AWS Lambda
# guide); the dependency set installs straight into the task root, where
# the final stage picks it up.
FROM public.ecr.aws/lambda/python:3.13 AS python
COPY --from=ghcr.io/astral-sh/uv:0.11.21 /uv /usr/local/bin/uv
WORKDIR /opt/build
COPY pyproject.toml uv.lock ./
RUN uv export --frozen --no-dev --no-editable -o requirements.txt \
    && uv pip install \
        --no-installer-metadata \
        --no-compile-bytecode \
        --target "${LAMBDA_TASK_ROOT}" \
        -r requirements.txt

The Python stage does not ship. The final stage below just assembles the image by copying all the previous layers into it. Here you see we use:

The Terraform binary
The Lambda Insights extension (extensions and cloudwatch)
The Python dependencies
Our agent code

Initially I considered baking the Terraform AWS provider into the image, but managing it properly turned out to be difficult without constraining the use case further than it already is. So I took the simpler route: download the providers every time the agent calls terraform init.

Dockerfile

# The shipped image: only the artifacts the function uses at runtime
# cross over from the builder stages.
FROM public.ecr.aws/lambda/python:3.13
COPY --from=terraform /usr/local/bin/terraform /usr/local/bin/terraform
COPY --from=insights /opt/extensions /opt/extensions
COPY --from=insights /opt/cloudwatch /opt/cloudwatch
COPY --from=python ${LAMBDA_TASK_ROOT} ${LAMBDA_TASK_ROOT}
COPY agent ${LAMBDA_TASK_ROOT}/agent
WORKDIR ${LAMBDA_TASK_ROOT}

The CMD attribute for a Docker based Lambda needs to be the Python handler. It points at the handler function in the agent.lambda_entry module. While we are at it we also make terraform non-interactive and set the home folder to the only writable folder in a Lambda (/tmp).

Dockerfile

# terraform needs a writable HOME for incidental state; /tmp is the only
# writable path at runtime. CMD replaces the zip package's handler attribute
# and points at lambda_entry, the instrumented entry point, not the bare
# handler module.
ENV HOME=/tmp \
    TF_IN_AUTOMATION=true \
    TF_INPUT=false
CMD ["agent.lambda_entry.handler"]

ECR and the placeholder image

A consequence of switching to a Docker based Lambda is that we now need a placeholder image to avoid the terraform apply needing a three step process (first creating the ECR repo, pushing the image, then creating the Lambda). With the placeholder in place, a single apply can create the function, and scripts/build-lambda.sh pushes the real image afterward. It keeps the terraform apply self-contained, with the code ship as a separate step.

infra/ecr.tf

# Container twin of the zip flow's archive_file placeholder: the function
# resource needs a pullable image at create time, so terraform seeds a
# minimal one. Create-only (input never changes), so scripts/build-lambda.sh
# owns every push after this.
resource "terraform_data" "placeholder_image" {
  input = aws_ecr_repository.agent.repository_url

  # Needs docker and the aws cli on the machine running apply; both are
  # already prerequisites for the series.
  provisioner "local-exec" {
    command = <<-EOT
      aws ecr get-login-password --region ${data.aws_region.current.region} |
        docker login --username AWS --password-stdin ${split("/", aws_ecr_repository.agent.repository_url)[0]}
      docker buildx build --platform linux/arm64 --provenance=false \
        -t ${aws_ecr_repository.agent.repository_url}:placeholder \
        --push ${path.module}/placeholder
    EOT
  }
}

For the Lambda itself, package_type = "Image" and image_uri switch it to a Docker based Lambda. We also tell terraform to ignore changes to image_uri, because otherwise a re-apply would reset the function back to the placeholder image. Instead we want our agent Docker build to have control over the image URI. Further, we tweak timeout, memory size, and ephemeral storage to match the heavier resource needs now that terraform runs inside the function. Lambda layers, which we used in a previous post, are removed entirely; they do not work with a Docker based Lambda.

infra/lambda.tf

resource "aws_lambda_function" "agent" {
  function_name = "terraform-pr-agent"
  role          = aws_iam_role.lambda.arn
  architectures = ["arm64"]

  # The handler entry point comes from the image's CMD; the runtime,
  # handler, and layers attributes only apply to zip packages.
  package_type = "Image"
  image_uri    = "${aws_ecr_repository.agent.repository_url}:placeholder"

  # Max Memory Used overstates what this function needs. It runs to ~2 GB on a
  # heavy run, but the track_memory spans (see agent/memory.py) show the real,
  # non-reclaimable demand stays under ~1 GB: ~315 MB resident for the Python
  # runtime plus a transient ~420 MB while terraform validate loads the
  # provider schema. The rest is reclaimable page cache from the ~800 MB
  # provider download and re-lock unpacks doing file IO on /tmp, which the
  # cgroup-based billed figure counts but the kernel drops under pressure, so
  # it is not OOM risk. Memory is therefore not the binding constraint here.
  # 3008 is set for the vCPU it buys, not the RAM: above 1769 MB Lambda gives a
  # full core (3008 is ~1.7), which speeds the run. Drop it toward ~1769 if
  # latency matters less than cost; do not raise it for memory headroom.
  memory_size = 3008

  # The per-run tool budget is the runaway guard; the timeout only has to
  # accommodate several model turns with init + validate rounds in between.
  timeout = 300

  # Two things land in /tmp: terraform init downloads the AWS provider
  # (~800 MB) into the workspace, and the post-run re-lock unpacks the
  # provider for three platforms (another ~2 GB) to write a portable lock
  # file. 4 GB covers both with headroom; the default 512 MB would not.
  ephemeral_storage {
    size = 4096
  }

  tracing_config {
    mode = "Active"
  }

  environment {
    variables = merge(
      {
        MODELS_PARAMETER         = aws_ssm_parameter.models.name
        DEFAULT_MODEL            = var.default_model
        METRICS_NAMESPACE        = local.metrics_namespace
        FIREHOSE_DELIVERY_STREAM = aws_kinesis_firehose_delivery_stream.audit.name
        RUNS_BUCKET              = aws_s3_bucket.runs.bucket
      },
      local.logfire_token_wired ? {
        LOGFIRE_TOKEN_PARAMETER = local.logfire_token_parameter_name
      } : {},
      local.mistral_key_wired ? {
        MISTRAL_API_KEY_PARAMETER = aws_ssm_parameter.mistral_api_key[0].name
      } : {},
    )
  }

  # Code ships out of band: scripts/build-lambda.sh pushes a new image and
  # calls update-function-code, so terraform must not flip the function
  # back to the placeholder on the next apply.
  lifecycle {
    ignore_changes = [image_uri]
  }

  depends_on = [
    aws_iam_role_policy_attachment.lambda_basic_execution,
    aws_iam_role_policy_attachment.lambda_insights,
    aws_iam_role_policy.lambda_permissions,
    terraform_data.placeholder_image,
  ]
}

Watching cold starts

One common downside of a Docker Lambda is that cold starts are slower than normal Lambdas. We do not particularly care about cold starts, but we do want to measure them and make sure they do not get out of hand due to dependency bloat. The AWS Lambda Insights extension is a good way to measure cold starts, but because we cannot use layers, we need to add it to the image. The code looks scary, and I wish AWS thought about the end user the way Astral/uv do, but it is taken straight from the AWS docs.

Dockerfile

# Container images cannot attach layers, so the Lambda Insights extension
# is baked into the image: pinned version, detached GPG signature checked
# against the key fingerprint published in the Lambda Insights docs, so a
# tampered rpm fails the build.
FROM public.ecr.aws/lambda/python:3.13 AS insights
ARG INSIGHTS_VERSION=1.0.660.0
ARG INSIGHTS_BASE_URL=https://lambda-insights-extension-arm64.s3-ap-northeast-1.amazonaws.com
# The downloaded key is checked against the fingerprint from the docs
# before anything trusts it; gpg runs with --batch/--no-tty/--no-autostart
# because the base image ships no gpg-agent.
RUN curl -fsSLO ${INSIGHTS_BASE_URL}/amazon_linux/lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm \
    && curl -fsSLO ${INSIGHTS_BASE_URL}/amazon_linux/lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm.sig \
    && curl -fsSLO ${INSIGHTS_BASE_URL}/lambda-insights-extension.gpg \
    && gpg --batch --no-tty --show-keys --with-colons lambda-insights-extension.gpg \
        | grep -q '^fpr:::::::::E0AFFA11FFF35BD7349EE222479C97A1848ABDC8:' \
    && gpg --batch --no-tty --no-autostart --import lambda-insights-extension.gpg \
    && gpg --batch --no-tty --no-autostart --verify \
        lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm.sig \
        lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm \
    && rpm -U lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm \
    && rm -f lambda-insights-extension-arm64.${INSIGHTS_VERSION}.rpm* lambda-insights-extension.gpg

The extension needs one IAM grant to write to its /aws/lambda-insights log group, a single managed-policy attachment that sits in the file browser above.

Below you see how we wire the metrics into the dashboard. We take the Lambda Insights metrics and plot:

avg duration and max duration in ms
function memory used
/tmp storage space used

infra/cloudwatch.tf

      {
        type   = "text"
        x      = 0
        y      = 0
        width  = 24
        height = 2
        properties = {
          markdown = "## Lambda\nContainer-image function health: invocations and errors, end to end duration, cold start init duration (Lambda Insights emits `init_duration` only on a cold start), and the memory and `/tmp` footprint behind the `memory_size` and `ephemeral_storage` sizing."
        }
      },
      {
        type   = "metric"
        x      = 0
        y      = 2
        width  = 12
        height = 6
        properties = {
          title  = "Lambda invocations and errors"
          region = local.cloudwatch_region
          view   = "timeSeries"
          stat   = "Sum"
          period = 60
          metrics = [
            ["AWS/Lambda", "Invocations", "FunctionName", local.lambda_name, { label = "${local.lambda_name} / invocations" }],
            [".", "Errors", ".", ".", { label = "${local.lambda_name} / errors" }],
            [".", "Throttles", ".", ".", { label = "${local.lambda_name} / throttles" }],
          ]
        }
      },
      {
        type   = "metric"
        x      = 12
        y      = 2
        width  = 12
        height = 6
        properties = {
          title  = "Lambda duration (ms)"
          region = local.cloudwatch_region
          view   = "timeSeries"
          period = 60
          metrics = [
            ["AWS/Lambda", "Duration", "FunctionName", local.lambda_name, { label = "${local.lambda_name} / avg", stat = "Average" }],
            [".", ".", ".", ".", { label = "${local.lambda_name} / p99", stat = "p99" }],
          ]
        }
      },
      {
        type   = "metric"
        x      = 0
        y      = 8
        width  = 12
        height = 6
        properties = {
          title  = "Cold start init duration (ms)"
          region = local.cloudwatch_region
          view   = "timeSeries"
          period = 60
          metrics = [
            # Insights reports init_duration only when an init phase happened,
            # so points appear only on cold starts.
            ["LambdaInsights", "init_duration", "function_name", local.lambda_name, { label = "${local.lambda_name} / init avg (ms)", stat = "Average" }],
            [".", ".", ".", ".", { label = "${local.lambda_name} / init max (ms)", stat = "Maximum" }],
          ]
        }
      },
      {
        type   = "metric"
        x      = 12
        y      = 8
        width  = 12
        height = 6
        properties = {
          title  = "Memory used (MB)"
          region = local.cloudwatch_region
          view   = "timeSeries"
          period = 60
          metrics = [
            # used_memory_max is the cgroup figure (Max Memory Used), which
            # counts the reclaimable /tmp page cache and so reads ~2 GB while
            # real demand is under 1 GB. Backs the memory_size comment in
            # lambda.tf.
            ["LambdaInsights", "used_memory_max", "function_name", local.lambda_name, { label = "${local.lambda_name} / memory max (MB)", stat = "Maximum" }],
          ]
        }
      },
      {
        type   = "metric"
        x      = 0
        y      = 14
        width  = 12
        height = 6
        properties = {
          title  = "/tmp used (bytes)"
          region = local.cloudwatch_region
          view   = "timeSeries"
          period = 60
          metrics = [
            # tmp_used tracks the ~800 MB provider download into /tmp, behind
            # the 4 GB ephemeral_storage sizing in lambda.tf.
            ["LambdaInsights", "tmp_used", "function_name", local.lambda_name, { label = "${local.lambda_name} / tmp used (B)", stat = "Maximum" }],
          ]
        }
      },

In CloudWatch the dashboard looks like this. Cold starts sit around 3s, acceptable for our use case, and the memory and /tmp metrics stay within the configured limits.

Workspace setup

The agent needs some context for the run. Pydantic-ai is itself stateless. The agent only has conversation history if the previous messages are passed to it as a message_history argument. We talked about this back in post 1. Here we also need to give the agent context about the workspace it is working in so that the root/project folder can be validated. Tool calls should usually just use a relative path to it, but we need to make sure everything ends up in this tmp folder, so we can pick up the end result of the agent's work. You also see files_read there; we use it to stop the agent editing files it has not read.

agent/tools.py

class WorkspaceDeps(BaseModel):
    """Per-run dependencies threaded through ``RunContext``.

    ``root`` is the directory the agent is allowed to read, write, and
    validate inside. Tools resolve every path relative to it and reject
    anything that escapes the root, so the agent cannot reach outside
    the workspace via ``..`` or absolute paths.
    """

    model_config = ConfigDict(arbitrary_types_allowed=True)

    root: Path
    files_read: set[Path] = Field(default_factory=set)

Each time the LLM invokes any of our tools with a path we validate it against the workspace root. Luckily Python has the pathlib module that makes this easy.

agent/tools.py

def _resolve_absolute_path(ctx: RunContext[WorkspaceDeps], path: str):
    root = ctx.deps.root.resolve()
    absolute_path = (root / Path(path)).resolve()
    if not absolute_path.is_relative_to(root):
        raise ModelRetry(
            f"Path {path} must be relative to the workspace root, "
            f"it cannot be absolute or walk up the directory tree."
        )
    return absolute_path

File tools

We have list_files(path), read_file(path), write_file(path, contents), edit_file(path, old_string, new_string) and delete_file(path). We won't go through all of them, but show you the edit file tool. It uses _resolve_absolute_path(ctx, path) to resolve the path (always absolute in the end), which will also verify that the path is inside the workspace root. Then it validates additional constraints:

The file exists
The agent has previously read or created the file
old_string (string to replace) is in the file contents
old_string only exists once in the file contents (prevents accidental replacements)

One decision worth calling out: we raise ModelRetry instead of returning a description of the mistake the agent makes. The difference is that ModelRetry counts against the tool retry budget of the agent, while a string description telling the agent what to do next does not. It avoids long chains of corrections costing too many tokens. If that happens we would rather fail and raise an alert in a production environment, so that an SRE or the product team can investigate and fix the cause of the agent's confusion (better system prompt or tool descriptions). During initial validation it also allows you to weed out models that are not as good with tool calls, which makes them unsuitable for your task.

agent/tools.py

def edit_file(
    ctx: RunContext[WorkspaceDeps],
    path: str,
    old_string: str,
    new_string: str,
) -> None:
    """Replace ``old_string`` with ``new_string`` in the file at ``path``."""
    file = _resolve_absolute_file(ctx, path)
    if not file.exists():
        raise ModelRetry(f"File {path} does not exist.")
    if file not in ctx.deps.files_read:
        raise ModelRetry(f"File {path} was not read. If you want to edit it, then read it first.")
    with file.open("r+") as f:
        contents = f.read()
        if old_string not in contents:
            raise ModelRetry(f"String {old_string} not found in file {path}.")
        if contents.count(old_string) > 1:
            raise ModelRetry(f"String {old_string} found more than once in file {path}.")
        f.seek(0)
        f.write(contents.replace(old_string, new_string))
        f.truncate()

The validate tool: agent-driven, any time

In this blog post we only offer terraform_validate() and terraform_init() to the agent to interact and validate terraform code. Terraform validate for example has a very simple flow:

Use a sub-process to run terraform validate
Validate the result:
- If the exit code is 0 (success) return OK message
- If the exit code is 1 (failure) ModelRetry error with stdout and stderr

agent/tools.py

def _validate(root: Path) -> tuple[bool, str]:
    """Run ``terraform validate`` in ``root``; return (passed, combined output).

    The agent reaches this through the tool below; the caller-side retry in
    handler.py calls it directly to re-check the workspace after the run.
    """
    with track_memory("terraform_validate"):
        result = subprocess.run(
            ["terraform", "validate", "-no-color"],
            cwd=root,
            capture_output=True,
            text=True,
        )
    return result.returncode == 0, f"{result.stdout}{result.stderr}"


def terraform_validate(ctx: RunContext[WorkspaceDeps]) -> str:
    """Run ``terraform validate`` in the workspace and return its output."""
    ok, output = _validate(ctx.deps.root)
    if ok:
        return "OK: terraform validate passed."
    raise ModelRetry(f"terraform validate failed:\n{output}")

A little bonus is the track_memory decorator, which creates a new span and attaches memory information to it. So we can dive into the agent run in Logfire and see how much memory it used.

$Logfire trace UI showing memory.\<step> spans from the track\_memory decorator, each tagged with mem.used\_bytes, mem.cache\_bytes, and mem.child\_max\_rss\_bytes attributes.$

This is the Agent definition, one instance reused across invocations. The error budget for ModelRetry can be configured in the Agent setup.

agent/core.py

# A soft reproducibility nudge following the standard Terraform pattern: a
# version constraint in the config, the exact version and checksums in the lock
# file. The tool-call spans in the trace are the ground truth for what the agent
# actually wrote.
_PROVIDER_PIN_RULE = (
    "When you add the AWS provider, give it a version constraint such as "
    '"~> 6.0" rather than leaving it unconstrained. terraform init records '
    "the exact resolved version and checksums in .terraform.lock.hcl, "
    "which travels with the workspace and is the reproducibility record, "
    "so the constraint does not need to be an exact pin. Pin an exact "
    "version only when the user asks for one. Example:\n"
    "\n"
    "terraform {\n"
    "  required_providers {\n"
    "    aws = {\n"
    '      source  = "hashicorp/aws"\n'
    '      version = "~> 6.0"\n'
    "    }\n"
    "  }\n"
    "}"
)

SYSTEM_PROMPT = (
    "You are the terraform-pr-agent. You operate on a Terraform workspace "
    "through file tools (list_files, read_file, write_file, edit_file, "
    "delete_file) and two terraform tools. Use the file tools to explore, "
    "write, and edit. Run terraform_init before your first validate and "
    "again whenever you add or change provider or module requirements. "
    "Call terraform_validate after you write or change files to confirm "
    "the workspace still parses; treat its output as feedback and edit "
    "until it is clean.\n\n" + _PROVIDER_PIN_RULE
)

# One Agent instance is reused across invocations: the tools reach the workspace
# through RunContext.deps, so each run_sync scopes them to a fresh WorkspaceDeps.
# The model carries no default; it is built from the registry at INVOKE and
# passed per run, so switching DEFAULT_MODEL needs no code change.
agent = Agent(
    deps_type=WorkspaceDeps,
    system_prompt=SYSTEM_PROMPT,
    tools=[
        list_files,
        read_file,
        write_file,
        edit_file,
        delete_file,
        terraform_init,
        terraform_validate,
    ],
    # Tools raise ModelRetry on failure; pydantic-ai ends the run once one tool
    # fails more than `retries` times in a row (a success resets the count). The
    # default of 1 would end the run on the second straight failing validate,
    # which is a normal part of the write-validate-edit loop, so the budget is
    # raised well past anything a converging run produces. The per-run turn cap
    # stays the runaway guard.
    retries=10,
)

Caller-side retry

Working with LLMs reminds me of the German saying:

Vertrauen ist gut, Kontrolle ist besser (Trust is good, verifying is better)

Here the agent can in principle ignore every instruction and report success while nothing actually works. Hence, we add some deterministic checks in the end and feed message_history back to the agent with a new prompt if they fail. At this stage this is really just making sure terraform init does not fail.

agent/core.py

            # The agent can report done while terraform validate still fails.
            # Re-validate ourselves and feed any error back as a follow-up turn,
            # reusing the run id and message history so each retry is an
            # invoke_agent span under the one invocation trace. Give up after the
            # budget and raise so the failure is honest rather than a clean run
            # over a broken workspace.
            ok, output = _validate(root)
            attempts = 0
            while not ok and attempts < _MAX_VALIDATE_RETRIES:
                attempts += 1
                result = agent.run_sync(
                    _RETRY_PROMPT.format(output=output),
                    deps=deps,
                    conversation_id=run_id,
                    model=built,
                    message_history=result.all_messages(),
                    metadata={"model": model_name},
                )
                ok, output = _validate(root)
            if not ok:
                raise ValidateDidNotConverge(output)

Parking the output in S3

In this iteration of the code we do not yet integrate GitHub, so to be able to see the code that the agent produced, we need to store it somewhere. This code stores it in an S3 bucket so we can inspect it later. We skip the .terraform folder, which is just local init scratch.

agent/runs.py

def _persist_run(run_id: str, workspace: Path, status: str, error: str | None = None) -> None:
    """Park the workspace and a minimal result marker under runs/<run_id>/.

    result.json carries only what the audit trace does not: prompt, output, and
    messages already live in the trace under the same conversation id, so copying
    them here would create a second source of truth.
    """
    bucket = require_env("RUNS_BUCKET")
    s3 = boto3.client("s3")
    for file in _workspace_files(workspace):
        key = f"runs/{run_id}/workspace/{file.relative_to(workspace)}"
        s3.put_object(Bucket=bucket, Key=key, Body=file.read_bytes())
    result = {"status": status} | ({"error": error} if error else {})
    s3.put_object(
        Bucket=bucket,
        Key=f"runs/{run_id}/result.json",
        Body=json.dumps(result).encode(),
    )


def _workspace_files(workspace: Path) -> Iterator[Path]:
    """Every file except .terraform/, which is init scratch plus the provider
    downloaded into /tmp, gigabytes of noise per run. The top-level
    .terraform.lock.hcl is the reproducibility record and stays.
    """
    for path in sorted(workspace.rglob("*")):
        if ".terraform" in path.relative_to(workspace).parts:
            continue
        if path.is_file():
            yield path

Tracing the whole invocation

The Lambda entry point stays thin: it requires a prompt, hands off to core.execute, and shapes the response. Its other job is the INIT wiring, configure telemetry then wrap the handler, which has to run at import. Keeping the execution logic separate from anything Lambda-related also means we could add other entry points later, deploying this somewhere else (e.g. ECS Fargate).

agent/lambda_entry.py

"""The Lambda boundary: parse the event, run the agent, shape the response.

This module also owns the INIT wiring. The container CMD targets
agent.lambda_entry.handler, so a unit test that imports agent.core never
configures logfire or registers the Firehose audit processor. Everything
Lambda-specific lives here, off the core, which is why no runtime-detection
check is needed to keep it out of tests.
"""

from __future__ import annotations

from typing import NotRequired, TypedDict

import logfire

from agent import observability
from agent.core import execute


class HandlerEvent(TypedDict):
    prompt: str
    model: NotRequired[str]


class HandlerResponse(TypedDict):
    status: str
    run_id: str
    model: str
    output: str


def handler(event: HandlerEvent, context: object) -> HandlerResponse:
    """Lambda entry point: require a prompt, run the agent, wrap the result.

    ``prompt`` is required; an event without one is a caller error and fails
    fast rather than running a default. ``model`` overrides DEFAULT_MODEL when
    given. A run that does not converge raises, so the Lambda reports 5xx and
    the workspace ships under status error for debugging.
    """
    prompt = event.get("prompt")
    if not prompt:
        raise ValueError("event missing required 'prompt'")
    result = execute(prompt, event.get("model"))
    return {
        "status": "ok",
        "run_id": result.run_id,
        "model": result.model,
        "output": result.output,
    }


def bootstrap() -> None:
    """Stand up telemetry, then attach the Lambda runtime adapter.

    configure() first so the tracer provider exists when the handler is wrapped.
    instrument_aws_lambda wraps the target named by _HANDLER
    (agent.lambda_entry.handler) in place, so each invocation becomes one trace.
    """
    observability.configure()
    logfire.instrument_aws_lambda(handler)


bootstrap()

Changing the span hierarchy also means the example query we showed in a previous post for reading the audit data in S3 needs to change. We already changed the audit processor to ship if span.parent is None or span.parent.is_remote, now in the query we look for the agent invocations in particular by using gen_ai.operation.name = 'invoke_agent'. This works with or without the Lambda OTEL data being written.

scripts/traces.sql

roots AS (
    -- One row per agent run: pydantic-ai's invoke_agent span, identified by
    -- the GenAI operation rather than by being the parentless span.
    -- instrument_aws_lambda now roots each trace at the SpanKind.SERVER
    -- invocation span, so the agent run is a child of it, not the trace root.
    -- (A caller-side retry would put several invoke_agent spans under one
    -- invocation; the trace_id join below would then cross them, so that
    -- case wants each chat tied to its enclosing run instead.)
    SELECT * FROM spans
    WHERE list_filter(attributes, x -> x.key = 'gen_ai.operation.name')[1]
        .value.stringValue = 'invoke_agent'
),

What validate catches and what it doesn't

Currently we only run terraform validate, which checks that the HCL and provider config are correct. It does not catch bad naming for Terraform variables, security issues, or misconfigurations that only show up during apply. In the following post we remedy some of this by giving the agent more tools to validate against, and enforcing extra checks such as security baselines.

End state

An agent that takes an English request and leaves a workspace in a validated state. File tools, validate tool, retry chain, and a structured self-report are all in place for posts 4-6 to build on.

Next: Conventions and policy: more tools, same feedback loop