Cron jobs are the most neglected part of any infrastructure. They run silently in the background, doing their thing day after day. Until they stop. And when they stop, no one notices until it's too late.
The Problem with Cron
Cron has no built-in monitoring. When a job fails, cron doesn't alert anyone. When a job never runs because the crontab was deleted, there's no notification. This silence is dangerous.
Real-World Horror Stories
- Backup job stopped 3 weeks ago. Server update cleared the crontab. Discovered during disaster recovery.
- Daily report never ran. Script permissions changed, cron sent email to /dev/null.
- Database cleanup hung. Job started but never finished, tables grew unbounded.
- Certificate renewal failed. Let's Encrypt cron job errored, site went down on expiry day.
Five Cron Failure Modes
1. Job Never Started
The most common failure: the job didn't run at all.
- Crontab deleted or modified
- Cron daemon not running
- Server rebooted and cron didn't start
- Wrong user's crontab edited
Detection: Heartbeat monitoring. The job must ping an external service when it starts.
2. Job Started but Failed
The job ran, but exited with an error.
- Script error or exception
- Missing dependencies
- Permission denied
- Database connection failed
Detection: Track exit codes. Ping the heartbeat only on successful completion.
3. Job Started but Never Finished
The job is hanging indefinitely.
- Deadlock or infinite loop
- Waiting on locked resource
- Network timeout (no timeout configured)
- OOM killed mid-execution
Detection: Duration monitoring. Set maximum expected runtime and alert if exceeded.
4. Job Succeeded but Data is Wrong
The job completed successfully, but produced incorrect results.
- Zero records processed (but no error)
- Partial completion (some records skipped)
- Wrong environment variables
- Stale credentials
Detection: Include result data in the heartbeat ping. Validate expected outcomes.
5. Job Runs but Too Slowly
The job completes, but takes much longer than expected.
- Growing data volume
- Resource contention
- Inefficient queries
- Network degradation
Detection: Track execution duration over time. Alert on increasing trends.
Implementing Heartbeat Monitoring
The solution is simple: your cron job pings an external service when it runs. If the ping doesn't arrive, you get alerted.
Basic Implementation
#!/bin/bash
# backup.sh - nightly backup with heartbeat monitoring
set -e
# Run the backup
pg_dump mydb > /backups/mydb-$(date +%Y%m%d).sql
# Compress
gzip /backups/mydb-$(date +%Y%m%d).sql
# Only ping heartbeat if everything succeeded
curl -fsS -m 10 https://agents.upti.my/v1/heartbeat/your-tokenWith Duration Tracking
#!/bin/bash
START_TIME=$(date +%s)
# Signal start
curl -fsS -m 10 https://agents.upti.my/v1/heartbeat/your-token/start
# Run the job
/path/to/your/job.sh
EXIT_CODE=$?
# Calculate duration
DURATION=$(($(date +%s) - START_TIME))
# Signal completion with status
curl -fsS -m 10 "https://agents.upti.my/v1/heartbeat/your-token?exit_code=$EXIT_CODE&duration=$DURATION"Monitoring with upti.my
upti.my provides purpose-built heartbeat monitoring for cron jobs:
- Flexible schedules. Cron expressions, intervals, or calendar-based.
- Grace periods. Allow for natural variation in execution time.
- Exit code tracking. Distinguish between "didn't run" and "ran but failed".
- Duration monitoring. Alert when jobs take too long.
- Start/end pings. Detect hanging jobs.
Common Mistakes
- Pinging at the start. Ping at the end, after successful completion.
- No timeout on the ping. Use
-m 10to prevent the ping itself from hanging. - Relying on cron email. Cron email goes to local mailbox that nobody reads.
- Same grace period for all jobs. A 5-minute job needs different tolerance than a 5-hour job.
📌Key Takeaways
- 1Cron has no built-in failure notification
- 2Jobs can fail in five distinct ways, monitor all of them
- 3Heartbeat monitoring is the solution: jobs ping when they complete
- 4Track duration and exit codes, not just "job ran"
- 5Set appropriate grace periods for natural variation
Don't wait for your backup to fail to discover it hasn't run in weeks. Set up heartbeat monitoring for every scheduled job.