The Node.js bug that's invisible to your monitoring

Your health check is a single line. res.send('ok'). It used to take a millisecond. Then traffic ramped up one afternoon and p99 went to 400ms, and you spent the next three hours staring at dashboards that all said the same thing, which was nothing.

CPU is moderate, event loop lag is flat, memory looks healthy, and your APM is reporting that the request took 400ms while telling you nothing about why. No slow database spans, no slow downstream calls, no errors or GC pauses. The time was spent in a place your APM can't see.

The place is the libuv thread pool. Standard Node observability is built around the event loop, and the pool is a different queue with different occupants, sitting just out of reach of every dashboard you have.

What lives on the pool

Node's event loop runs your JavaScript. Anything that would block the loop, because it's CPU-heavy or because it's blocking I/O on a syscall that has no real async kernel variant, gets pushed to a separate pool of OS threads inside libuv. That pool defaults to four threads. Four, total, for the whole process, regardless of how many cores the machine has.

What runs there is more than people expect. The obvious ones are fs.readFile, fs.writeFile, and the rest of fs, because most filesystems on Linux don't have a real async API at the syscall level and libuv emulates it with blocking calls on worker threads. Crypto goes there too, including pbkdf2, scrypt, randomBytes in its async form, and anything heavy enough that a sync version would block the loop. zlib runs gzip and deflate on the pool. And dns.lookup, which most apps use without thinking, hits the pool through getaddrinfo, which is a blocking call.

That last one trips people. There are two DNS APIs in Node. dns.lookup looks async and behaves async to your code, but the underlying getaddrinfo system call is blocking, so libuv runs it on the pool. dns.resolve uses c-ares directly, which is a real async resolver and doesn't touch the pool at all. Almost all code uses dns.lookup because it's the default everywhere: http.get, https.request, the standard agent, the pg and mysql drivers, all of it goes through getaddrinfo. So your outbound traffic and your database calls are quietly taking pool slots for name resolution, and you didn't write any of that code, and your APM has no concept of it.

The starvation pattern

Four threads is enough for almost everything until it isn't. The classic trigger is bcrypt at login.

A bcrypt.hash(password, 12) call takes roughly 250ms on commodity hardware and runs entirely on the pool. If four people log in at the same time, all four slots are busy for 250ms. The fifth request hits a queue and waits until one of the four finishes before its hash even starts, so it pays 250ms of queue time before it pays its own 250ms of work, and you've turned a 250ms login into a 500ms one without doing anything extra. Scale that to a hundred concurrent logins and the last person waits something like six seconds, none of which shows up in CPU charts, none of which shows up in event loop lag, all of which is correctly attributable to a libuv pool queue that nobody is graphing.

The second classic is large gzip. A 50MB response body compressed with zlib.gzip takes a noticeable fraction of a second even on fast hardware, and during that time it's holding a pool slot. If you have a route that does this and it gets hit four times concurrently, every other pool consumer in the process is queued behind those compressions. Unrelated fs.readFile calls wait, bcrypts in other handlers queue behind them, DNS lookups join the back of the line. The whole pool is jammed by something that has nothing to do with the requests that are slow.

And the third, the one nobody suspects, is the DNS issue from earlier. Under heavy outbound traffic, every fresh HTTP connection your service makes is a getaddrinfo call on the pool. Most of the time those are fast and the OS caches them, but a DNS hiccup in your network stack, or a misconfigured TTL on an internal service, can put four resolutions in flight at once and stall the rest of the pool for hundreds of milliseconds at a stretch. Then your bcrypts and gzips wait for the network nobody thinks of as competing with crypto for compute.

Why the dashboards lie

A request stuck on the pool queue isn't doing anything visible. It's not on CPU, it isn't running JavaScript, and it isn't in the event loop. It's parked, waiting for libuv to dequeue it, and libuv doesn't ship a counter for that queue depth.

This is the gap. Every tool you have is measuring CPU, event loop, or specific spans you instrumented. The event loop lag metric, the one everyone watches, is genuinely useful but measures the wrong thing here, because the event loop is fine. It's processing other requests at full speed, running timers, accepting new connections. The work that's slow isn't on the loop at all. The metric was designed to catch a different class of bug, where your own JS code is heavy enough to block the loop directly, and against that class it works. Against pool starvation it returns "everything is great" while a queue piles up two function calls over.

APMs make this worse because they create the illusion of completeness. You see request spans, DB spans, slow query traces, error rates, all of it broken down, and the natural conclusion is that if nothing in the breakdown is slow, the request must be slow somewhere outside the application, maybe the network or the load balancer. So you go look there, and there's nothing there either, and you come back and the request is still 400ms. The instinct then is to assume the monitoring is right and the slowness must be a phantom, when actually the monitoring is silent about the slice of the runtime where the time is going.

The signature

Latency without lag is the signature.

If your p99 is climbing under load and the event loop lag metric stays flat, the suspect list is short. Either the request is waiting on something external, or it's waiting on the pool. External you can usually rule out with span traces and network logs, and once you have, the pool is what's left.

The cleanest way to confirm is to bump UV_THREADPOOL_SIZE and see what happens. Set it to 64 in your environment, restart the process, run the same load profile, and watch the latency. If p99 drops noticeably, your pool was queueing and now isn't. If p99 doesn't move, your pool wasn't the bottleneck and you can look elsewhere. It's a ten-minute A/B that rules out half the possible causes, and most teams never run it because the pool isn't in their mental model of where time goes.

The other diagnostic is to grep your code and your dependencies for the usual suspects. Any bcrypt, argon2, crypto.pbkdf2, crypto.scrypt, sync-shaped DNS, large zlib operations, or heavy fs work on a hot path is a candidate. If you find one and it's on every request, you have your answer before you even run the experiment.

The fixes

Bumping the pool is the cheapest move and almost always the right first move. The default of four is a 2010-era number from when machines had two cores and SSDs were exotic, and it's almost always too small for a modern Node service. Sixteen to sixty-four is reasonable for most workloads. The pool threads are cheap when idle, so the cost of overprovisioning is small, and the cost of underprovisioning is the thing this whole post is about.

The hard ceiling is high (1024 in current Node), but you almost never want to go that far. The pool shares CPU with your event loop, and if you have hundreds of threads all wanting to run, the kernel scheduler will mix them with your JS work in ways that hurt latency in different places. Pick a number that gives headroom for the burst sizes you actually see, not the worst case you can imagine.

The deeper fix is to get the worst offenders off the pool entirely. bcrypt should run in a worker thread pool you control, sized to your CPU count, not borrowed from libuv. worker_threads with a small fixed pool of two or four workers handles auth load cleanly and never touches the libuv pool, so bcrypt stops competing with fs and DNS for slots. Large zlib work should use the streaming variants where you can, because createGzip doesn't sit on a single slot for the whole compression, it processes chunks and releases the slot between them.

For DNS, the move is to switch to dns.resolve for anything you control, or to set up a DNS cache layer like cacheable-lookup in your HTTP agent. Both bypass the getaddrinfo path and stop your outbound traffic from competing with crypto for slots. Most HTTP clients have a way to inject a lookup function, and most projects never use it.

fs is harder to dodge because there's no real alternative on Linux outside of io_uring, which Node's support for is still experimental. The practical answer is the same as for everything else: don't do filesystem work on a hot request path, batch where you can, prefer streams to avoid holding a slot for the duration of a large file, and keep per-request file work small.

When to suspect this

The shape to watch for is latency that rises with concurrency while CPU stays moderate, event loop lag stays flat, no slow span appears in the trace, and the slowness clusters on routes that touch auth, compression, files, or DNS. If three of those four boxes are checked, it's the pool until proven otherwise.

The reason this post exists is that nothing in standard Node monitoring will lead you here. The metric you'd want, libuv pool queue depth, isn't exposed by default. The metric everyone watches, event loop lag, will positively tell you it isn't the loop while saying nothing about the pool. You have to know the pool exists and that your code is sitting on it behind a queue, and once you do, the diagnostic and the fix are both fast. Until then, you're staring at green dashboards while users wait.