Why Linux Kills Your App Without Warning (The OOM Killer, Explained)

There's a specific flavor of production incident that everyone encounters eventually: your app just... disappears. No stack trace. No crash log. No application errors. The process is running, and then it isn't. If you've ever been on-call at 2am staring at systemd telling you Main process exited, code=killed, signal=KILL with zero explanation, this post is for you.

I turned this exact scenario into one of 18 debugging exercises at scenar.site - practice it interactively with an AI interviewer. Details at the end.

The Setup

Java service on an 8GB server. Runs for hours, sometimes days, then dies. Developers swear there's no bug. You check the logs - nothing. Just normal operation right up until the end, then silence.

$ systemctl status myapp
● myapp.service - My Java Application
   Loaded: loaded (/etc/systemd/system/myapp.service; enabled)
   Active: failed (Result: signal) since Tue 2026-01-20 14:23:15 UTC

Jan 20 14:23:15 app-server-01 systemd[1]: myapp.service: Main process exited, code=killed, signal=KILL
Jan 20 14:23:15 app-server-01 systemd[1]: myapp.service: Failed with result 'signal'.

signal=KILL. SIGKILL. The nuclear option. The signal you can't catch, can't ignore, can't clean up from. Something sent SIGKILL to your Java process.

The First Wrong Turn

Most people (past me included) start grepping application logs. You won't find anything. SIGKILL doesn't let the process finish its current syscall, let alone flush a log buffer. The app didn't crash - it was executed.

The right first move is to ask: who could have killed this? Your options are:

A human with sudo (ask around)
A monitoring tool with an auto-remediation rule (check)
The kernel itself

If it's #3, there's exactly one culprit: the OOM killer.

Finding the OOM Killer in the Act

The OOM killer writes to the kernel ring buffer. You read it with dmesg:

$ dmesg | grep -i oom
[482341.234] myapp invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[482341.234] Out of memory: Killed process 8921 (java) total-vm:6291456kB, anon-rss:4194304kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:8192kB oom_score_adj:0
[482341.235] oom_reaper: reaped process 8921 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

There it is. The kernel killed PID 8921 (your Java process) to free memory. anon-rss:4194304kB = your process was using 4GB of anonymous RSS when it got killed.

Why Does the Kernel Do This

Linux overcommits memory. When you malloc(1GB), the kernel says "sure" without actually having 1GB free. It's a bet that most processes won't use all the memory they request. Usually that's fine. When it's not fine, the kernel has to pick a process to kill to free memory - because the alternative is the whole system locking up.

The selection is based on oom_score (higher = more likely to be killed). Check it live:

$ ps aux --sort=-%mem | head
USER       PID  %CPU %MEM    RSS COMMAND
mysql     2341  15.2 45.5 3728000 /usr/sbin/mysqld
root      3456   5.1 28.3 2320000 /usr/bin/prometheus
elastic   4567   8.4 18.2 1490000 /usr/share/elasticsearch/jdk/bin/java

And the OOM scores:

$ cat /proc/2341/oom_score
456
$ cat /proc/8921/oom_score
512

The Java app had the highest score, so it got picked. Scores weight recent memory usage, process age, and oom_score_adj.

The Short-Term Fix

You have three options for right now:

1. Reduce the Java heap. If your systemd unit has -Xmx4g, the JVM will absolutely use 4GB. Drop it:

ExecStart=/usr/bin/java -Xmx2g -jar /opt/myapp/app.jar

2. Protect critical processes with oom_score_adj. Range is -1000 (never kill) to 1000 (kill first):

$ echo -500 > /proc/$(pidof java)/oom_score_adj

Don't set -1000 unless you're sure you want that process to be the last thing running before the kernel panics. I've seen "protected" apps keep a dying system from recovering.

3. Add swap. Buys you time but swapping kills performance. Emergency only.

The Long-Term Fix

The real problem is usually that the server is overcommitted. MySQL wants 4GB, Prometheus wants 2GB, Elasticsearch wants 1.5GB, Java wants 4GB - on an 8GB box. That's 11.5GB of ask on 8GB of capacity. Something has to give eventually.

The right answers:

Use cgroups / systemd MemoryMax= to enforce limits per service. This is the proper fix. Each service gets a guaranteed ceiling. If it exceeds its own cgroup limit, it gets killed inside its cgroup (via oom_kill_disable=0) without taking the whole box down.

[Service]
ExecStart=/usr/bin/java -Xmx2g -jar /opt/myapp/app.jar
MemoryMax=2.5G
MemoryHigh=2G

Move workloads off shared hosts. Put the DB on its own box, the app servers on theirs. Stop co-locating memory-hungry services.
Monitor memory pressure, not just memory usage. /proc/pressure/memory (PSI) tells you when processes are stalled waiting for memory, which is a much earlier signal than "out of memory" alerts.

Prevention Checklist

Before the next OOM kill:

Every service has a MemoryMax= in its systemd unit
Alert on memory available < 10% for 5 minutes, not just on events after death
Alert on memory PSI (avg10 > 10) - catches swapping and thrashing before OOM
Java apps have -XX:+HeapDumpOnOutOfMemoryError so you get something when the JVM itself runs out of heap (different from OS OOM)
Document which processes are "protected" (oom_score_adj < 0) and why

What Interviewers Look For

If this comes up in an SRE interview, they're testing for:

Do you know SIGKILL can't be caught, so absence of logs is a clue, not a failure?
Do you go to dmesg without being told to?
Can you explain why the kernel kills processes (overcommit, not "a bug")?
Do you talk about cgroups as the structural fix, not just tuning oom_score_adj?
Can you distinguish OS-level OOM from JVM-level OutOfMemoryError?

The last one catches a surprising number of people - they conflate the two and assume bumping -Xmx fixes everything, when actually a bigger heap makes OS-level OOM more likely.

Practice It Interactively

This scenario is one of 18 on scenar.site. You describe your debugging approach in plain English, an AI simulates the broken server with realistic command output, and plays interviewer by pushing back on your reasoning. Free tier gets you started, no credit card.