Building a Deployment Platform on Self-Managed Infrastructure with k3s

At the beginning of this year, our production workloads were spread across AWS, Contabo and GoDaddy.

Most of the applications had predictable traffic patterns and fairly stable resource requirements. As more products were added, infrastructure costs became harder to ignore, particularly on AWS. The deployment workflows, infrastructure layouts and operational processes supporting those products had also become increasingly inconsistent.

Part of my responsibility became consolidating that environment, reducing unnecessary infrastructure costs and introducing a more consistent deployment workflow.

Over the last few months, that work resulted in a deployment platform built around GitHub Organizations, GitHub Actions, Ansible, k3s, SOPS and an internal application deployment operator written in Go.

This article covers the current version of that platform, the decisions behind it, and some of the problems encountered while building it.

Repository Management

Repositories were moved into a GitHub Organization and branch protection became consistent across projects.

Rulesets were configured to prevent direct pushes to main, require pull request reviews, enforce successful CI checks before merging and restrict who could bypass those requirements.

Since deployments would eventually be tied to merges into main, repository rules became part of the deployment platform itself. A failed CI check blocks a deployment. A missing review blocks a deployment. Direct pushes to main are no longer possible.

The result is that every change reaching production passes through the same validation path regardless of which repository it originated from.

Continuous Integration

The next step was standardizing CI across repositories using GitHub Actions. Most of our services are written in Go, so the workflow was fairly similar:

Run formatting checks
Run linters
Execute tests
Build the application
Build and publish container images

Once CI became consistent across repositories, deployments could assume the same validation process had already taken place regardless of which application was being deployed.

Infrastructure Architecture

Most applications had predictable resource consumption and did not require the elasticity that AWS is particularly good at providing.

A comparison between AWS infrastructure costs and equivalent VPS resources on Contabo made the migration decision fairly straightforward, and several workloads were moved onto VPS infrastructure.

The resulting architecture was built around Contabo's private networking feature. Each server was connected to a private network and cluster communication was configured to use private IP addresses. Kubernetes control plane traffic, pod networking, database connections and application-to-application communication remained on the private network, while public access was limited to services that actually needed internet exposure.

The architecture ended up looking something like this:

Internet
    |
Traefik Ingress
    |
k3s Cluster
    |
+---------------+---------------+---------------+
|               |               |               |
Control Plane   Worker 1      Worker 2      Worker N
     \             |             |             /
      \____________Private Network____________/

Using private networking meant node-to-node communication never needed to traverse the public internet, and new nodes could join the cluster using private addresses rather than public endpoints.

Infrastructure Provisioning

As the number of servers increased, infrastructure provisioning became repetitive.

Every machine required:

User creation
SSH configuration
Hostname configuration
Firewall configuration
Package installation
Cluster bootstrap steps

The first few servers were configured manually. Ansible appeared shortly afterwards.

Most of the provisioning process eventually became playbooks covering server configuration, SSH setup and k3s installation. The value of this became obvious when one of the worker nodes failed.

Replacing the node involved provisioning a replacement server, updating inventory and rerunning the playbooks rather than manually rebuilding configuration.

Deploying k3s

I evaluated kubeadm briefly before choosing k3s.

For the size of infrastructure I was managing, k3s solved most of the problems I cared about without introducing additional operational overhead.

The control plane, service discovery, ingress controller and storage provisioner were already packaged together.

I wasn't particularly interested in assembling Kubernetes components individually if a distribution already existed that solved the same problem.

Within a few hours the cluster was operational and workloads started moving over.

For a small engineering team operating a handful of products, the defaults provided by k3s have been difficult to argue against.

Deployment Automation

Around this period I started getting tired of maintaining Kubernetes manifests.

Most applications required the same collection of resources:

Deployments
Services
Ingresses
ConfigMaps
Secrets

The differences between applications were usually small, but they still resulted in maintaining large amounts of repetitive YAML.

I initially used Helm, but over time I found myself spending more effort maintaining templates and manifests than I wanted to.

Eventually I moved deployment definitions into Go using the Kubernetes client libraries and controller-runtime.

Working directly with the Kubernetes API provided a much deeper understanding of Kubernetes resources than maintaining YAML manifests. Deployments, Services, ConfigMaps and Secrets became resources constructed and managed directly in code.

Writing deployment definitions directly in Go became repetitive over time, so I started building an internal application deployment operator that abstracts much of the Kubernetes resource configuration behind a simpler deployment definition.

The operator is still evolving, but it has become the primary way applications are deployed into the cluster.

I'll cover its design and implementation in a future article.

Secret Management

Keeping environment variables synchronized across applications and environments became increasingly difficult as the platform grew.

I looked at Vault and Infisical, but both introduced another service to operate. For the size of the team and operational requirements, SOPS with Age encryption was a better fit.

Environment files are encrypted with SOPS and committed to Git. During deployment, the files are decrypted and converted into Kubernetes ConfigMaps and Secrets.

The Age private key is stored outside the repository and is only available to trusted deployment environments responsible for decrypting configuration during deployments.

The deployment workflow looks roughly like this:

.env
  |
SOPS + Age Encrypt
  |
Git Repository
  |
Deployment Environment
  |
Age Private Key
  |
Decrypt
  |
Kubernetes Secrets / ConfigMaps

Kubernetes RBAC

Giving GitHub access to deploy applications introduced another problem: authentication to the cluster.

Kubernetes RBAC provides several ways to solve this, typically through service accounts and scoped permissions.

The implementation worked, but managing service account tokens quickly became an operational concern.

At the moment I use a combination of restricted Kubernetes identities and trusted administrative machines for deployment operations.

I am currently evaluating ArgoCD as a GitOps layer on top of the deployment platform, allowing the cluster to reconcile changes from Git rather than accepting deployments pushed directly from GitHub.

Stateful Workloads

Stateful workloads required a different approach from application workloads.

Running databases as StatefulSets works, but using local-path storage introduces an important limitation, the persistent volume is tied to the node where it was created. If that node goes down, Kubernetes can reschedule the pod, but the data does not automatically move with it.

For now, databases still run as StatefulSets on local-path storage, with backups and restore procedures treated as the primary recovery path. This keeps the architecture simple, but it also means node failure recovery depends on either recovering the original node or restoring the database from backup.

Most small teams eventually move toward dedicated database infrastructure, managed database services or distributed storage solutions such as Longhorn as their operational requirements grow.

Current Architecture

Today the platform runs on a k3s cluster hosted on Contabo infrastructure and connected through private networking.

GitHub Actions handles CI, Ansible provisions infrastructure, SOPS manages secrets and applications are deployed through an internal application deployment operator built on top of the Kubernetes API.