I wanted to deploy an LLM inference API without spending $1,200/month on AWS GPU instances. OCI turned out to be significantly cheaper, and the Docker workflow was identical. Here's what I set up.
Why I Looked at OCI for GPU Workloads
I've been building GPU infrastructure tools for a while now (keda-gpu-scaler, otel-gpu-receiver, GPU NUMA scheduling for Volcano), and most of my testing was on AWS. The g5.xlarge instances with A10 GPUs run about $1.01/hr, plus $73/month for the EKS control plane. It adds up fast when you're iterating.
Someone on the Volcano Slack mentioned OCI's GPU pricing and I was skeptical. But when I looked it up, the numbers were real — same A10 GPU, roughly 40% cheaper, and OKE doesn't charge for the Kubernetes control plane at all. So I tried moving a vLLM inference workload over.
OCI GPU Pricing
Here's what OCI actually charges for GPU instances. I had to double-check these because they seemed too low:
| Shape | GPU | GPU Memory | OCPUs | RAM | Price/hr (on-demand) |
|---|---|---|---|---|---|
| VM.GPU.A10.1 | 1x A10 | 24 GB | 15 | 240 GB | ~$1.65 |
| VM.GPU.A10.2 | 2x A10 | 48 GB | 30 | 480 GB | ~$3.30 |
| BM.GPU.A100-v2.8 | 8x A100 | 640 GB | 128 | 2 TB | ~$25.00 |
| BM.GPU.H100.8 | 8x H100 | 640 GB | 112 | 2 TB | ~$38.00 |
| VM.GPU.A10.1 (preemptible) | 1x A10 | 24 GB | 15 | 240 GB | ~$0.50 |
That preemptible A10 price made me do a double-take. $0.50/hr for an A10 GPU. That's $365/year. I was paying more than that per month on AWS for the same hardware.
Building the Inference Image
I used vLLM because it's what I was already running on AWS. The Dockerfile doesn't change at all between clouds, which is the whole reason I'm using containers in the first place.
# Dockerfile.inference
FROM nvidia/cuda:12.4-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y --no-install-recommends \
python3 python3-pip && \
rm -rf /var/lib/apt/lists/*
RUN pip3 install --no-cache-dir \
vllm==0.6.0 \
fastapi \
uvicorn
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "microsoft/Phi-3-mini-4k-instruct", \
"--max-model-len", "4096", \
"--gpu-memory-utilization", "0.9"]
Build and test locally (you'll need an NVIDIA GPU and the NVIDIA Container Toolkit installed):
# Build
docker build -f Dockerfile.inference -t gpu-inference:v1 .
# Run with GPU access
docker run --gpus all -p 8000:8000 gpu-inference:v1
# Test inference
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-3-mini-4k-instruct",
"prompt": "Explain Kubernetes in one sentence:",
"max_tokens": 50
}'
--gpus all is the magic flag. It tells Docker to use the NVIDIA Container Toolkit, which injects the GPU device files and driver libraries into the container at runtime. Your image only needs the CUDA runtime libraries, not the full driver stack.
If You Don't Have a Local GPU
I do most of my development on a Mac, which obviously doesn't have an NVIDIA GPU. Docker Model Runner is what I use to test the LLM interaction pattern locally:
docker model pull ai/phi3-mini
docker model run ai/phi3-mini "Explain Kubernetes in one sentence"
The API is OpenAI-compatible so the client code I write against Model Runner works unchanged against vLLM in production. I've been using this for prompt template iteration and it cut my feedback loop from 20+ minutes (push to registry, wait for K8s pull, test) to about 15 seconds.
Pushing to OCIR
# Login to OCIR
docker login iad.ocir.io -u '<tenancy-namespace>/oracleidentitycloudservice/<email>'
# Tag
docker tag gpu-inference:v1 iad.ocir.io/<tenancy>/gpu-inference/vllm:v1
# Scan before push
docker scout cves gpu-inference:v1 --only-severity critical,high
# Push
docker push iad.ocir.io/<tenancy>/gpu-inference/vllm:v1
Fair warning: GPU images are big. Mine was about 8GB. The first push took a while, but after that Docker's layer caching means only changed layers get uploaded. Most rebuilds push in under a minute.
Setting Up OKE with GPU Nodes
# Create cluster (control plane is free)
oci ce cluster create \
--compartment-id $COMPARTMENT_ID \
--kubernetes-version v1.30.1 \
--name gpu-inference-cluster \
--vcn-id $VCN_ID \
--endpoint-subnet-id $API_SUBNET_ID \
--service-lb-subnet-ids '["'$LB_SUBNET_ID'"]'
# Create GPU node pool
oci ce node-pool create \
--cluster-id $CLUSTER_ID \
--compartment-id $COMPARTMENT_ID \
--kubernetes-version v1.30.1 \
--name gpu-a10-pool \
--node-shape VM.GPU.A10.1 \
--node-config-details '{
"size": 2,
"placementConfigs": [{
"availabilityDomain": "Uocm:US-ASHBURN-AD-1",
"subnetId": "'$WORKER_SUBNET_ID'"
}]
}' \
--node-source-details '{
"imageId": "'$GPU_IMAGE_ID'",
"sourceType": "IMAGE"
}' \
--initial-node-labels '[{
"key": "nvidia.com/gpu",
"value": "present"
}]'
One thing I liked about OKE — the GPU node pools come with the NVIDIA device plugin already installed. On EKS I had to install the device plugin myself via a DaemonSet. Here it just works, and nvidia.com/gpu shows up as a schedulable resource immediately.
Deploying the Inference Service
# inference-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
namespace: inference
spec:
replicas: 1
selector:
matchLabels:
app: vllm-inference
template:
metadata:
labels:
app: vllm-inference
spec:
containers:
- name: vllm
image: iad.ocir.io/<tenancy>/gpu-inference/vllm:v1
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: 1
requests:
cpu: "4"
memory: "16Gi"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 120
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 60
periodSeconds: 10
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
imagePullSecrets:
- name: ocir-secret
---
apiVersion: v1
kind: Service
metadata:
name: vllm-inference
namespace: inference
annotations:
oci.oraclecloud.com/load-balancer-type: "lb"
service.beta.kubernetes.io/oci-load-balancer-shape: "flexible"
service.beta.kubernetes.io/oci-load-balancer-shape-flex-min: "10"
service.beta.kubernetes.io/oci-load-balancer-shape-flex-max: "100"
spec:
type: LoadBalancer
selector:
app: vllm-inference
ports:
- port: 80
targetPort: http
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: inference
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: oci-bv
resources:
requests:
storage: 50Gi
A few things I learned the hard way while setting this up:
The nvidia.com/gpu: 1 in resource limits is how Kubernetes knows to schedule this on a GPU node. Forget it and your pod lands on a CPU node and crashes.
The PVC for model cache is important. Without it, the model downloads from HuggingFace every time the pod restarts. Phi-3-mini is a few GB — that's 5-10 minutes of startup time you don't want to repeat.
The initialDelaySeconds: 120 on the liveness probe took me a restart loop to figure out. Model loading is slow. If your liveness probe fires before the model is loaded, Kubernetes kills the pod, it restarts, starts loading again, gets killed again... you get the idea. Give it at least 2 minutes.
The OCI Load Balancer annotations tell OKE to automatically provision a load balancer. No separate Terraform resource needed.
Deploy:
kubectl create namespace inference
# Create OCIR pull secret
kubectl create secret docker-registry ocir-secret \
--namespace inference \
--docker-server=iad.ocir.io \
--docker-username='<tenancy>/<user>' \
--docker-password='<auth-token>'
# Create HuggingFace token secret (from OCI Vault ideally)
kubectl create secret generic hf-token \
--namespace inference \
--from-literal=token=$HF_TOKEN
kubectl apply -f inference-deployment.yaml
After a few minutes (mostly model download time), the service is up and accessible via the load balancer:
LB_IP=$(kubectl get svc vllm-inference -n inference -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl http://$LB_IP/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-3-mini-4k-instruct",
"prompt": "What is Oracle Cloud Infrastructure?",
"max_tokens": 100
}'
Monitoring GPU Utilization
Once the inference service was running, I wanted to see actual GPU utilization. Without this you're flying blind — you have no idea if the GPU is sitting at 10% or 95%. DCGM Exporter gives you Prometheus metrics:
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace monitoring --create-namespace \
--set serviceMonitor.enabled=true
This gives you DCGM_FI_DEV_GPU_UTIL (utilization), DCGM_FI_DEV_MEM_COPY_UTIL (memory), temperature, power draw, etc. I have a Grafana dashboard that shows all of these and it's been useful for right-sizing.
I also built otel-gpu-receiver which does something similar but for OpenTelemetry. If you're already running an OTel collector, it might be a better fit than DCGM Exporter.
What I'm Actually Paying
Here's the monthly bill comparison for running Phi-3-mini on a single A10, always-on:
| Platform | Setup | Monthly Cost |
|---|---|---|
| OCI OKE + VM.GPU.A10.1 | Managed K8s + GPU node | ~$1,210 |
| OCI OKE + preemptible A10 | Same, but preemptible | ~$365 |
| AWS EKS + g5.xlarge | Managed K8s + GPU node | ~$1,100 + $73 (control plane) |
| GCP GKE + g2-standard-4 | Managed K8s + GPU node | ~$1,300 + $73 (control plane) |
| Azure AKS + NC4as_T4_v3 | Managed K8s + T4 GPU | ~$550 + less powerful GPU |
The free control plane saves $73/mo by itself compared to EKS or GKE. And for my dev/test workloads I switched to preemptible instances, which dropped the GPU cost to $365/mo. The pods get evicted occasionally but for development that's fine.
Local Dev with Docker Model Runner
I keep coming back to this because it changed how I work. Before Model Runner, testing a prompt change meant: edit prompt, rebuild image, push to OCIR, wait for OKE to pull it, test, realize it's wrong, repeat. Twenty minutes per iteration.
Now I just run the model locally:
# Pull a model
docker model pull ai/phi3-mini
# Run inference
docker model run ai/phi3-mini "Summarize: Oracle Cloud Infrastructure provides..."
# Or use the API endpoint
curl http://localhost:12434/engines/phi3-mini/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "What is OKE?", "max_tokens": 50}'
Same API, same prompt format. When the prompt works locally, I rebuild the production image and push. The container is what makes this portable.
Was It Worth Switching?
Honestly, yes. The Docker workflow didn't change at all — same Dockerfile, same docker build, same docker push. I just changed the registry URL and the Kubernetes annotations. The inference service runs the same. The GPU utilization is the same. The API responses are the same.
What changed is the bill. And the fact that I don't pay $73/month for a Kubernetes control plane anymore. If you're running GPU workloads on AWS or GCP and haven't priced out OCI, it's worth 30 minutes of your time.
Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. I build GPU infrastructure tools and write about Kubernetes. GitHub | LinkedIn




