Self-Healing Recovery Actions

Configure automatic recovery actions that trigger when checks fail. Restart services, clear caches, run scripts, and more.

Overview

Self-healing recovery actions allow upti.my agents to automatically respond to check failures without waiting for human intervention. When a local check fails, the agent can execute a configured recovery action to restore the service, clean up resources, or run a diagnostic script. This reduces mean time to recovery (MTTR) and keeps your systems running even during off-hours.

Each recovery action is linked to a specific local check. When the check fails, the agent evaluates the retry and cooldown settings before executing the action. All execution results are logged and reported back to the upti.my dashboard for full visibility.

ℹ️ Recovery Action Flow

The flow is: Check fails → Agent evaluates cooldown → Agent executes recovery action → Agent re-runs the check → If still failing, retry up to max_retries → Report final status.

Common Settings

All recovery action types share the following configuration options:

Setting	Type	Default	Description
max_retries	integer	3	Maximum number of times the recovery action will be attempted before giving up.
retry_interval	integer (seconds)	60	Time to wait between retry attempts.
cooldown_period	integer (seconds)	300	Minimum time between recovery action executions to prevent rapid-fire retries.

Recovery Action Types

1. Restart Service (systemd)

Restart a systemd-managed service on the host. The agent runs systemctl restart for the specified service name. This is the most common recovery action for Linux servers.

Field	Type	Description
service_name	string	Name of the systemd service, e.g., `nginx` or `postgresql`

Restart Service Example

{
  "type": "restart_service",
  "service_name": "nginx",
  "max_retries": 3,
  "retry_interval": 60,
  "cooldown_period": 300
}

2. Restart Docker Container

Restart a Docker container by name. The agent issues a docker restart command for the specified container. Useful for recovering crashed or unresponsive containers.

Field	Type	Description
container_name	string	Name of the Docker container, e.g., `redis-cache`

Restart Docker Container Example

{
  "type": "restart_docker_container",
  "container_name": "redis-cache",
  "max_retries": 2,
  "retry_interval": 30,
  "cooldown_period": 180
}

3. Restart Kubernetes Pod

Delete a Kubernetes pod to trigger a restart by the controller (Deployment, StatefulSet, etc.). The agent uses kubectl to delete the pod in the specified namespace.

Field	Type	Description
pod_name	string	Name of the Kubernetes pod to restart
namespace	string	Kubernetes namespace where the pod runs. Default: `default`
kubectl_path	string	Path to the kubectl binary. Default: `/usr/local/bin/kubectl`

Restart Kubernetes Pod Example

{
  "type": "restart_kubernetes_pod",
  "pod_name": "api-server-7d8f9b6c4-x2k9p",
  "namespace": "production",
  "kubectl_path": "/usr/local/bin/kubectl",
  "max_retries": 2,
  "retry_interval": 60,
  "cooldown_period": 300
}

4. Execute Shell Script

Run a custom shell script on the host. This is the most flexible recovery action, allowing you to execute any command or sequence of commands. The script runs with the agent's permissions and is subject to the configured timeout.

Field	Type	Description
script	string	Shell script content to execute
timeout	integer (seconds)	Maximum execution time for the script. Default: 30.

Execute Shell Script Example

{
  "type": "execute_script",
  "script": "#!/bin/bash\ncd /var/app && ./restart.sh && echo 'Recovery complete'",
  "timeout": 60,
  "max_retries": 2,
  "retry_interval": 30,
  "cooldown_period": 600
}

⚠️ Script Security

Shell scripts run with the same permissions as the agent process. Avoid running the agent as root unless necessary. Always validate script content carefully, as malformed scripts can cause additional problems. Use timeouts to prevent scripts from hanging indefinitely.

5. Send HTTP Webhook

Send an HTTP request to an external endpoint as a recovery action. This is useful for triggering external automation pipelines, notifying third-party services, or calling a custom recovery API.

Field	Type	Description
url	string	Webhook URL to call
method	string	HTTP method: GET, POST, PUT. Default: POST.
headers	object	Optional request headers
body	string	Optional request body (JSON string)
expected_status	integer	Expected response status code. Default: 200.

HTTP Webhook Example

{
  "type": "http_webhook",
  "url": "https://automation.example.com/recover",
  "method": "POST",
  "headers": { "Authorization": "Bearer token123" },
  "body": "{ \"service\": \"api\", \"action\": \"restart\" }",
  "expected_status": 200,
  "max_retries": 3,
  "cooldown_period": 300
}

6. Clean Disk Space

Remove old files from specified directories to free up disk space. The agent deletes files older than the configured number of days. This pairs well with the Disk Usage local check.

Field	Type	Description
paths	string array	Directories to clean, e.g., `["/var/log", "/tmp"]`
older_than_days	integer	Only delete files older than this many days. Default: 7.

Clean Disk Space Example

{
  "type": "clean_disk_space",
  "paths": ["/var/log", "/tmp"],
  "older_than_days": 7,
  "max_retries": 1,
  "cooldown_period": 3600
}

7. Kill Process

Terminate a specific process by name or PID. You can choose between a graceful shutdown (SIGTERM) or a forced kill (SIGKILL). This is useful for stopping runaway processes that consume excessive resources.

Field	Type	Description
process_name	string	Name of the process to kill. Either this or `pid` is required.
pid	integer	Process ID to kill. Either this or `process_name` is required.
signal	string	Signal to send: `SIGTERM` (graceful) or `SIGKILL` (forced). Default: SIGTERM.

Kill Process Example

{
  "type": "kill_process",
  "process_name": "stuck-worker",
  "signal": "SIGTERM",
  "max_retries": 2,
  "retry_interval": 15,
  "cooldown_period": 120
}

8. Clear Application Cache

Clear an application's cache by deleting a cache directory or running a cache-clearing command. This is helpful when stale cache data causes application errors or performance degradation.

Field	Type	Description
cache_path	string	Path to the cache directory to clear, e.g., `/var/app/cache`
command	string	Alternative: a shell command to clear the cache, e.g., `redis-cli FLUSHALL`

Clear Cache Example

{
  "type": "clear_cache",
  "command": "redis-cli FLUSHALL",
  "max_retries": 1,
  "cooldown_period": 600
}

9. DNS Flush

Flush the local DNS resolver cache. This can resolve issues caused by stale DNS records, such as after a DNS failover event. The agent uses the system default flush command or a custom command you provide.

Field	Type	Description
custom_command	string	Optional custom DNS flush command. If omitted, the agent uses the OS default.

DNS Flush Example

{
  "type": "dns_flush",
  "custom_command": "systemd-resolve --flush-caches",
  "max_retries": 1,
  "cooldown_period": 300
}

10. Custom Recovery

A fully custom recovery action that lets you define an arbitrary script with complete control over the recovery logic. Use this for complex recovery workflows that don't fit into the other action types.

Field	Type	Description
script	string	Full custom recovery script content

Custom Recovery Example

{
  "type": "custom_recovery",
  "script": "#!/bin/bash\necho 'Starting custom recovery...'\nsystemctl stop myapp\nrm -rf /tmp/myapp-locks\nsystemctl start myapp\necho 'Recovery complete'",
  "max_retries": 2,
  "retry_interval": 60,
  "cooldown_period": 600
}

💡 Best Practices

Start with conservative cooldown periods (300 seconds or more) to avoid recovery loops. Set max_retries to 2 or 3 for most actions. Always test recovery actions in a staging environment before deploying to production. Monitor the recovery action logs in your upti.my dashboard to verify that actions execute as expected.

Recovery Action Summary

Action Type	Key Fields	Use Case
Restart Service	service_name	Restart crashed systemd services
Restart Docker Container	container_name	Recover unresponsive containers
Restart Kubernetes Pod	pod_name, namespace, kubectl_path	Force pod restart in a cluster
Execute Shell Script	script, timeout	Run arbitrary recovery commands
Send HTTP Webhook	url, method, headers, body	Trigger external automation
Clean Disk Space	paths, older_than_days	Free disk space by removing old files
Kill Process	process_name or pid, signal	Stop runaway processes
Clear Application Cache	cache_path or command	Clear stale cache data
DNS Flush	custom_command	Resolve stale DNS entries
Custom Recovery	script	Complex multi-step recovery workflows