A single probe saying "down" shouldn't wake you at 3am

Most uptime monitors work the same way: one probe somewhere checks your site, and if that probe can't reach it, you get paged. I ran tools like that for years. Maybe half the "down" alerts I got at 3am were the probe's own network having a bad minute — not my site.

A check from a single location is one machine's opinion at one moment. A local routing hiccup, a flaky peering link, a momentary DNS blip on the probe's side: from one vantage point they all look exactly like "your site is down."

So when I built SonarOps, I made the confirmation cascade across regions. Checks run every 60 seconds. When one probe sees an outage, a second probe in another region (EU and USA) re-checks before any alert fires. Second probe also can't reach you? It's real, you get paged. It can? The first probe just had a bad moment, and you stay asleep.

It isn't a vote or a quorum. It's a cascade: one probe raises a hand, a second one somewhere else confirms before I bother you.

The tradeoff is honest: that step costs a second or two of detection delay. I'll take it. I'd rather hear about a real outage two seconds later than get woken for a phantom one that fixes itself before I've opened the laptop.

Three things I learned building this:

Retrying on the same probe isn't enough. If the probe's local network is the problem, a retry from the same place just confirms its bad minute.
Geography beats count. Two probes in one datacenter aren't independent. Two in different regions are.
Most false pages are network, not server. Once I cross-checked across regions, the 3am noise basically went away.

Full version — how cross-region monitoring works and where single-probe checks break: Multi-location monitoring guide.

And if you just want to play with the numbers, I keep a few free calculators (no signup) — uptime SLA, downtime cost, error budget: sonarops.it/tools.