upti.my
All Articles
Architecture··12 min read

Designing a Heartbeat Monitoring System

Technical deep-dive into building a dead man's switch for scheduled tasks. Architecture patterns and edge cases.

Heartbeat monitoring is conceptually simple: a job pings you when it runs, and you alert if the ping doesn't arrive. But the implementation has subtle edge cases that trip up even experienced engineers.

The Core Concept

A heartbeat monitor is a "dead man's switch". It alerts when something doesn't happen. Traditional monitoring watches for events. Heartbeat monitoring watches for the absence of events.

concept.txt

Traditional: Alert when X happens
Heartbeat:   Alert when X doesn't happen

Examples:
- Backup job didn't run
- Worker stopped processing
- Scheduled report wasn't generated

Architecture Overview

A heartbeat monitoring system has four components:

  1. Ping Ingestion. HTTP endpoints that receive pings.
  2. State Storage. Tracks last ping time per monitor.
  3. Scheduler. Checks for missing pings on schedule.
  4. Alerting. Notifies when pings are late.

Data Model

types.ts

interface HeartbeatMonitor {
  id: string;
  name: string;
  
  // Schedule: when pings are expected
  schedule: CronExpression | IntervalSeconds;
  gracePeriod: number; // seconds to wait before alerting
  
  // State
  lastPingAt: Date | null;
  status: 'healthy' | 'late' | 'down';
  
  // Alerting
  alertChannels: string[];
  alertedAt: Date | null; // prevent duplicate alerts
}

Edge Cases

1. Grace Periods

Jobs don't run at exactly the same time every day. A 5-minute job might take 7 minutes on a busy day. Grace periods absorb natural variation.

grace-period.txt

Expected ping: 02:00:00 UTC
Grace period:  5 minutes
Alert fires:   02:05:00 UTC (if no ping)

Too short: false alarms on slow days
Too long:  late detection of real failures

2. Timezone Handling

Cron expressions are timezone-sensitive. "Run at 2 AM" means different times depending on whether that's UTC, America/New_York, or Europe/London.

timezone.ts

// Store timezone with the monitor
schedule: "0 2 * * *",
timezone: "America/New_York"

// Convert to UTC for scheduling
nextExpectedPing = cronToUtc(schedule, timezone);

// DST edge case: 2 AM might not exist (spring forward)
// or happen twice (fall back)
⚠️

Daylight Saving Time

DST creates two nasty edge cases:
  • Spring forward: 2:00 AM doesn't exist. A job scheduled for 2:30 AM may or may not run.
  • Fall back: 2:00 AM happens twice. The job might run twice, or once, or at an unexpected offset.

Solution: Store schedules in UTC internally, only convert for display.

4. Missed Checks During Outages

If your monitoring system itself is down, you miss the window to detect a late ping. When you come back up, the job already ran (or didn't), and you don't know which.

outage-scenario.txt

Timeline:
02:00 - Job should run
02:05 - Grace period expires (would have alerted)
02:10 - Monitoring system was down during 02:00-02:10
02:11 - System comes back up

Question: Did the job run at 02:00?

Options:
A) Assume it ran (optimistic) - risky
B) Assume it didn't (pessimistic) - noisy
C) Check job logs/output - best but complex

5. Clock Skew

If the job's server clock is wrong, pings arrive at unexpected times. A 5-minute clock skew can cause false alarms or missed detection.

clock-skew.ts

// Record when the ping was sent (from job) vs received
{
  receivedAt: "2024-01-15T02:03:00Z",  // Our clock
  sentAt: "2024-01-15T02:08:00Z",      // Job's clock (5 min ahead)
  
  // Job thinks it's late, but it's actually on time
  // Use receivedAt for alerting, but log skew for debugging
}

6. Duplicate Pings

Network retries can cause duplicate pings. Your system should be idempotent.

dedupe.ts

// Naive: Update lastPingAt on every request
// Problem: Rapid retries spam your database

// Better: Debounce within a window
if (lastPingAt > now - 30seconds) {
  return; // Ignore duplicate
}
lastPingAt = now;

Scaling Considerations

High-Frequency Pings

If jobs ping every 30 seconds and you have 10,000 monitors, that's 20,000 writes/minute. Design for write-heavy workloads.

scaling-options.txt

Options:
1. Time-series database (InfluxDB, TimescaleDB)
2. Redis with TTL-based expiration
3. Write coalescing / batch updates
4. Separate hot/cold storage

Distributed Scheduler

The scheduler that checks for late pings must be highly available. If it goes down, no alerts fire.

scheduler-options.txt

Approaches:
1. Multiple schedulers with leader election
2. Partition monitors across workers
3. Pull-based: each check runs independently

Implementation Patterns

Pattern 1: Push-Based (Recommended)

Job pings the monitoring service. Simplest to implement and debug.

push-ping.sh

# In your cron job
curl -fsS -m 10 https://agents.upti.my/v1/heartbeat/TOKEN

# Monitoring service updates lastPingAt
# Scheduler checks for overdue pings every minute

Pattern 2: Pull-Based

Monitoring service queries job status. More complex, but works for jobs that can't make outbound HTTP calls.

pull-check.sh

# Job writes status to shared location
echo "success:$(date +%s)" > /status/backup-job

# Monitoring service reads periodically
status = read_file("/status/backup-job")
if (status.timestamp < now - grace_period) alert()

Pattern 3: Hybrid

Combine push for normal operation with pull for verification.

upti.my Implementation

upti.my's heartbeat monitoring handles all these edge cases:
  • Flexible schedules. Cron expressions with timezone support.
  • Configurable grace periods. Per-monitor customization.
  • Start/end pings. Detect hung jobs, not just missing ones.
  • Exit code tracking. Distinguish "didn't run" from "ran but failed".
  • Duration monitoring. Alert when jobs take too long.

📌Key Takeaways

  • 1Heartbeat monitoring detects when expected events don't happen
  • 2Grace periods absorb natural variation in job timing
  • 3Timezone and DST handling require careful implementation
  • 4Plan for monitoring system outages, what happens when you're down?
  • 5Design for write-heavy workloads at scale

Building a robust heartbeat monitoring system is harder than it looks. Consider using a purpose-built solution rather than building your own.