Your server is running. Your application is healthy. Your database is fast. But users can't reach you because your DNS stopped resolving 20 minutes ago and nobody noticed.
DNS is the most critical piece of infrastructure that most teams never think about monitoring. When it works, it's invisible. When it breaks, everything looks broken, and the symptoms rarely point to DNS as the cause.
Why DNS Failures Are Uniquely Dangerous
Most infrastructure failures degrade your service. DNS failures disconnect it entirely:
- Invisible blast radius: Users get "site not found" errors, which look like your company ceased to exist. They don't see a loading spinner or an error page. They see nothing.
- Caching masks the problem: DNS responses are cached at multiple layers (browser, OS, ISP). Some users are fine while others are completely blocked. Makes debugging a nightmare.
- Your own monitoring might miss it: If your uptime monitor has the old DNS record cached, it reports everything as healthy while new visitors can't connect.
Real Scenario
Seven DNS Failure Modes You Should Monitor
1. Record Deletion or Modification
Someone accidentally removes or changes a DNS record. This can happen through Terraform misconfiguration, a DNS provider UI mistake, or an overzealous cleanup script.
# What your DNS should return
api.yourapp.com. A 203.0.113.10
api.yourapp.com. AAAA 2001:db8::1
# What it returns after someone broke it
api.yourapp.com. -- NO ANSWER --2. Propagation Failures
You update a DNS record but the change doesn't propagate to all resolvers. Different users see different results depending on their ISP's caching behavior and which authoritative server they hit.
3. TTL Misconfigurations
A TTL set too high means DNS changes take hours or days to propagate. A TTL set too low means constant re-resolution, adding latency to every request and putting load on your DNS provider.
4. Domain Expiration
Your domain registrar sends renewal emails to an inbox nobody checks. The domain expires. Your DNS stops resolving. Everything goes down, and recovery can take 24-72 hours if the domain enters redemption.
5. NS Record Issues
Your nameserver records (NS) point to a DNS provider you migrated away from months ago. As long as the old provider keeps serving records, it works. When they clean up stale zones, it doesn't.
6. DNS Hijacking
An attacker changes your DNS records to point to their server. Users are unknowingly sending credentials and data to a malicious endpoint that looks exactly like yours. Monitoring that DNS records match expected values catches this.
7. Resolver Failures
Your DNS provider has an outage. This is rarer than application-level issues but has happened to major providers. Having monitoring from multiple regions using different resolvers catches provider-specific outages.
What to Monitor and How
Record Value Assertions
Don't just check that DNS resolves. Verify that it resolves to the expected value:
{
"type": "dns",
"hostname": "api.yourapp.com",
"recordType": "A",
"assertions": [
{ "type": "record.value", "operator": "contains", "value": "203.0.113.10" },
{ "type": "responseTime", "operator": "lt", "value": 500 }
]
}Monitor All Critical Record Types
- A / AAAA records: Your servers' IP addresses
- CNAME records: Aliases pointing to CDNs or load balancers
- MX records: Email delivery (often forgotten until emails stop arriving)
- TXT records: SPF, DKIM, and domain verification records
- NS records: Ensure nameservers are correct after migrations
Don't Forget MX
Multi-Region Resolution
DNS can resolve differently depending on where you query from. Check from multiple geographic locations to catch propagation issues and region-specific resolver failures.
Setting This Up in upti.my
upti.my's DNS monitoring queries your domain from every continent and validates that the response matches your expectations:
- Create a new healthcheck and select the DNS type
- Enter your hostname and the record type to check (A, AAAA, CNAME, MX, TXT, NS)
- Add assertions for expected record values
- Configure check intervals (we recommend every 60 seconds for critical domains)
- Set up alerts for instant notification when records change unexpectedly
Because checks run from multiple regions, you'll catch propagation issues that a single-location check would miss.
📌Key Takeaways
- 1DNS failures look like your site doesn't exist. There's no error page, just nothing
- 2Caching makes DNS issues inconsistent and hard to debug. Some users work, some don't
- 3Monitor record values, not just resolution. Catch hijacking and accidental changes
- 4Check all critical record types including MX (email) and NS (nameserver delegation)
- 5Multi-region DNS monitoring catches propagation failures that single-point checks miss
- 6Domain expiration is a real threat. Monitor it separately from DNS records
DNS monitoring isn't glamorous. Nobody writes blog posts about their great DNS monitoring setup. But the teams that monitor DNS are the teams that don't have 6-hour mystery outages that turn out to be a missing CNAME.