upti.my
All Articles
DevOps··9 min read

What Self-Healing Monitoring Looks Like in Practice

Self-healing is not science fiction. It is a monitoring check that detects a known problem and runs a known fix. Here is how to build it for real systems.

Most monitoring works like a smoke detector. It goes off when there is a problem. Then a human has to wake up, assess the situation, and fix it. Self-healing monitoring is a smoke detector connected to a sprinkler system. It detects the problem and fixes it, then tells you what happened.

This is not theoretical. Teams already do this for common, well-understood failure modes. The trick is knowing which failures to automate and which ones still need a human.

What Self-Healing Actually Means

Self-healing is not an AI that magically understands and fixes any problem. It is a simple loop:

  1. Monitoring detects a specific, known failure condition
  2. A predefined remediation action runs automatically
  3. Monitoring verifies that the fix worked
  4. The team gets notified about what happened

Each component is straightforward. The power is in connecting them so the loop runs without human intervention for problems you have already solved before.

Failures Worth Automating

Not every failure should be self-healed. Good candidates share three traits:

  • Well-understood: You know exactly what causes it and how to fix it.
  • Repeating: It happens regularly enough that automating the fix saves real time.
  • Safe to retry: The fix has no destructive side effects if it runs when it should not.

Example: Service Process Crash

Your application process crashes due to a memory leak. It happens every few days. The fix is always the same: restart the process. Perfect candidate for automation.

self-healing-config.yaml
trigger:
  check: api-health
  condition: consecutive_failures >= 3

action:
  type: agent_command
  command: systemctl restart myapp
  timeout: 30s

verify:
  check: api-health
  condition: consecutive_successes >= 2
  timeout: 120s

notify:
  channels: [slack-ops]
  message: "Auto-restarted myapp after health check failures"

Example: Disk Space Full

Log files fill up the disk. The application stops writing. Users see errors. The fix: delete old log files and restart the app.

cleanup-logs.sh
#!/bin/bash
# Clean up log files older than 7 days
find /var/log/myapp -name "*.log" -mtime +7 -delete

# Restart the application
systemctl restart myapp

Example: SSL Certificate Renewal

Your Let's Encrypt renewal cron job failed. The certificate is about to expire. The self-healing action: run the renewal script, verify the new certificate, reload the web server.

Example: Upstream DNS Change

Your application caches DNS lookups. An upstream service changed their IP. Your app is hitting the old IP. The fix: flush the DNS cache and restart the connection pool.

⚠️

When not to self-heal

Do not automate fixes for problems you do not fully understand. If a database is returning errors, automatically restarting it could cause data corruption. If an API is returning 500s, automatically restarting the service might mask a code bug that needs investigation. Self-healing is for known, safe-to-retry fixes only.

Building Self-Healing Loops

1. Start With Your Runbooks

Look at your incident history. Which incidents had the same root cause and the same fix? Those are your candidates. If the fix was "SSH into the server and restart the service," that is automatable.

2. Add Safety Guards

Every self-healing action should have limits:

  • Max retry count: If the fix does not work after 2 attempts, escalate to a human. Do not loop forever.
  • Cooldown period: Do not restart a service 50 times in an hour. Wait at least 5 minutes between attempts.
  • Verification: After the fix runs, check that it actually worked. If the health check still fails, the fix did not work.

3. Always Notify

Self-healing does not mean invisible healing. Every automated action should produce a notification: what triggered it, what action ran, and whether it worked. The team needs to know because:

  • If it triggers frequently, the underlying problem needs a real fix
  • If it fails, a human needs to take over
  • It provides an audit trail for postmortems

4. Deploy Agents on Your Infrastructure

Self-healing actions need to run on your servers. An external monitoring service can detect the problem, but the fix needs to execute locally. An agent installed on your infrastructure bridges the gap: it receives instructions from your monitoring platform and executes remediation commands.

How This Works in upti.my

  1. Install the upti.my agent on your server
  2. Create a healthcheck that monitors the service
  3. Define a self-healing action: the command to run when the check fails
  4. Set safety limits: max retries, cooldown, verification check
  5. Configure notification channels for visibility

When the healthcheck fails, the agent runs your remediation command. If the check recovers, the incident closes automatically. If it does not, the system escalates to a human. Either way, you get a full record of what happened and what was tried.

The 80/20 of Self-Healing

You do not need to automate every possible failure. Start with the one incident type that wakes you up most often. Automate that fix. Then move to the next one. Most teams find that 3 to 5 automated remediations eliminate 80% of their after-hours pages.

📌Key Takeaways

  • 1Self-healing is detect, fix, verify, notify in an automated loop
  • 2Only automate fixes for well-understood, repeating, safe-to-retry failures
  • 3Add safety guards: max retries, cooldown periods, and verification checks
  • 4Always notify the team about automated actions for audit and continuous improvement
  • 5Start with the single most common incident and automate that fix first
  • 63 to 5 automated remediations can eliminate 80% of after-hours pages