upti.my
All Articles
DevOps··6 min read

The Uptime Monitoring Checklist for 2026

A no-nonsense checklist for monitoring your production stack. Covers APIs, databases, DNS, SSL, cron jobs, background workers, and status pages. Copy it and ship it.

Most teams set up monitoring reactively, after the first outage. By then, you're patching gaps while firefighting. This checklist gives you the full picture upfront so you can ship comprehensive monitoring in an afternoon.

Not everything applies to every stack. Skip what you don't use. But read through the full list. You'll likely find blind spots you didn't know you had.

1. HTTP / API Endpoints

  • ☐ Health check endpoint: A dedicated/health or /readiness endpoint that validates application state (not just returns 200)
  • ☐ Core user-facing endpoint: Monitor an actual API endpoint that represents real functionality (e.g., product listing, user auth)
  • ☐ Response body validation: Assert that the response contains expected data, not just a status code
  • ☐ Response time threshold: Alert when p95 response time exceeds your SLA (e.g., under 2 seconds)
  • ☐ Authentication flow: If your API requires auth, monitor the full flow including token refresh

2. SSL / TLS Certificates

  • ☐ Expiry monitoring: Alert at 30, 14, and 7 days before expiry on every certificate
  • ☐ All subdomains: Main site, API, CDN, status page, internal tools. Each may have separate certificates
  • ☐ Chain validation: Verify the full certificate chain is valid, including intermediate certificates
  • ☐ Post-renewal verification: After auto-renewal, check that the new certificate is actually being served

3. DNS Records

  • ☐ A/AAAA records: Verify your domain resolves to the expected IP addresses
  • ☐ CNAME records: Check aliases for CDN, load balancer, and third-party services
  • ☐ MX records: Ensure email delivery configuration is correct
  • ☐ Multi-region resolution: Check from multiple locations to catch propagation issues
  • ☐ Domain expiration: Know when your domain registration expires

4. Databases & Data Stores

  • ☐ Connection check: TCP monitor to your database port from your application network
  • ☐ Query validation: A lightweight health query that confirms the database is accepting and processing queries
  • ☐ Replica lag: If using replicas, monitor replication delay
  • ☐ Cache connectivity: Redis, Memcached, or whatever cache layer you use

TCP Monitoring

For databases that don't expose HTTP endpoints, TCP monitoring is your best bet. A successful TCP connection to port 5432 (PostgreSQL) or 3306 (MySQL) confirms the service is accepting connections.

5. Background Jobs & Cron Tasks

  • ☐ Heartbeat monitoring: Each scheduled job should ping a heartbeat endpoint after successful completion
  • ☐ Expected schedule: Alert when a job doesn't run within its expected window
  • ☐ Duration monitoring: Catch jobs that start running significantly longer than normal
  • ☐ Queue depth: Monitor background job queue sizes (growing queues mean workers aren't keeping up)

6. Third-Party Dependencies

  • ☐ Payment provider: Stripe, PayPal, or your payment gateway's API
  • ☐ Auth provider: If using Auth0, Firebase Auth, Clerk, etc.
  • ☐ Email service: SendGrid, Postmark, or your transactional email provider
  • ☐ Cloud storage: S3, GCS, or wherever you store user files
💡

Why Monitor Third Parties?

You can't fix their outages, but you can detect them before your customers report them. This is the difference between "we know and are working on it" and "wait, what?"

7. gRPC & Internal Services

  • ☐ gRPC health check protocol: Use the standard grpc.health.v1 service for service-level health
  • ☐ Key RPC methods: Monitor actual RPC calls that represent critical functionality
  • ☐ Service mesh health: If using a service mesh, monitor inter-service connectivity

8. Status Pages & Incident Communication

  • ☐ Public status page: Automated from your monitors so customers always see real-time status
  • ☐ Private status page: For internal teams to see the full picture including internal services
  • ☐ Incident templates: Pre-written templates for common incident types so updates go out fast
  • ☐ Subscriber notifications: Email or webhook notifications for status page subscribers

9. Alerting & Escalation

  • ☐ Multiple channels: At minimum, Slack/Discord for fast visibility and email for audit trails
  • ☐ Escalation policy: If nobody acknowledges within 15 minutes, alert the next person
  • ☐ Alert grouping: Related failures should create one incident, not 50 separate alerts
  • ☐ Maintenance windows: Suppress alerts during planned maintenance

10. Self-Healing (Bonus Round)

  • ☐ Automated restarts: For known transient failures that are fixed by restarting
  • ☐ Guard rails: Rate limits on auto-recovery (max 3 per hour) and a kill switch
  • ☐ Post-recovery verification: Automated check that the recovery actually worked
  • ☐ Audit log: Full log of what self-healed and why, reviewed weekly

📌Key Takeaways

  • 1Monitor the full stack: APIs, SSL, DNS, databases, cron jobs, and third-party deps
  • 2Validate responses, not just availability. A 200 with wrong data is still broken
  • 3SSL and DNS monitoring prevent the hardest-to-debug outages
  • 4Background jobs need heartbeat monitoring. They fail silently by default
  • 5Set up status pages before your first incident, not during it
  • 6Self-healing is the bonus level. Start with solid monitoring first

You don't need to set up everything on this list today. Start with HTTP, SSL, and DNS monitoring for your critical services. Then add heartbeat monitoring for cron jobs. Then expand from there. The key is starting with a complete picture of what needs monitoring, then systematically filling in the gaps.