Background workers are the invisible backbone of modern applications. They process emails, resize images, sync data, and handle everything users don't want to wait for. And when they stop working, nobody notices. Not until the damage is done.
The Silent Failure Problem
Unlike web servers that immediately show errors to users, background workers fail silently. There's no user to complain. No HTTP 500. No immediate symptoms.
How Workers Fail Silently
- Worker process dies. Container crash, OOM kill, or unhandled exception.
- Worker stuck on one job. Infinite loop, deadlock, or waiting on external service.
- Worker can't connect to queue. Redis/RabbitMQ credentials expired or network issue.
- Worker processing but failing every job. All jobs error out and move to dead letter queue.
- Worker too slow. Processing but can't keep up with incoming job rate.
Detection Patterns
Pattern 1: Heartbeat Pings
The simplest approach: your worker pings an external service at regular intervals. If the pings stop, the worker is dead.
// Worker with heartbeat
class EmailWorker {
private heartbeatInterval: NodeJS.Timer;
start() {
// Ping every minute while processing
this.heartbeatInterval = setInterval(() => {
fetch('https://agents.upti.my/v1/heartbeat/email-worker');
}, 60000);
this.processQueue();
}
stop() {
clearInterval(this.heartbeatInterval);
}
}Pattern 2: Queue Depth Monitoring
Monitor the queue length over time. A growing queue means workers aren't keeping up.
// Monitor queue depth
const queueDepth = await redis.llen('jobs:email');
await fetch('https://upti.my/metrics', {
method: 'POST',
body: JSON.stringify({
metric: 'queue_depth',
queue: 'email',
value: queueDepth
})
});
// Alert if queue depth growing consistentlyPattern 3: Job Age Monitoring
Track how long jobs wait before being processed. Old jobs indicate worker problems.
// Check oldest job in queue
const oldestJob = await redis.lindex('jobs:email', -1);
const jobAge = Date.now() - oldestJob.createdAt;
if (jobAge > 5 * 60 * 1000) { // 5 minutes
alert('Email queue backlogged');
}Pattern 4: Throughput Monitoring
Track jobs processed per minute. A drop in throughput indicates problems.
// Track successful job completion
async function processJob(job) {
try {
await handleJob(job);
// Increment success counter
await redis.incr('stats:email:success:' + getMinuteBucket());
} catch (error) {
await redis.incr('stats:email:failure:' + getMinuteBucket());
throw error;
}
}Pattern 5: Dead Letter Queue Monitoring
Failed jobs eventually end up in a dead letter queue. Monitor its growth.
// Monitor dead letter queue
const dlqDepth = await redis.llen('jobs:email:dead');
if (dlqDepth > 100) {
alert('Too many failed email jobs');
}Combining Patterns
No single pattern catches all failures. Use multiple approaches:
Failure Mode Detection Matrix
| Failure Mode | Detection Pattern |
|---|---|
| Worker process died | Heartbeat stops |
| Worker stuck on job | Heartbeat stops, queue grows |
| All jobs failing | DLQ growing, throughput drops |
| Worker too slow | Queue depth growing, job age increasing |
| Partial failures | Error rate increasing, DLQ growing |
Implementation with upti.my
upti.my provides integrated monitoring for background workers:
- Heartbeat endpoints. Workers ping on regular intervals.
- Custom metrics. Track queue depth, throughput, and error rates.
- Threshold alerts. Alert when metrics cross boundaries.
- Grace periods. Avoid false alarms during deployments.
Common Mistakes
- Only monitoring the queue server. Redis being up doesn't mean workers are processing.
- Heartbeat too frequent. Creates noise, wastes resources.
- Heartbeat from wrong place. Ping from the processing loop, not the main process.
- Ignoring error rates. Workers can be "working" while failing every job.
- No baseline. You need to know normal throughput to detect anomalies.
📌Key Takeaways
- 1Background workers fail silently and no user sees the error
- 2Use heartbeats to detect dead or stuck workers
- 3Monitor queue depth and job age for capacity issues
- 4Track throughput and error rates for quality issues
- 5Combine multiple patterns because no single metric catches everything
Background workers are critical infrastructure. Give them the monitoring they deserve.