Self-Healing Recovery Actions
Configure automatic recovery actions that trigger when checks fail. Restart services, clear caches, run scripts, and more.
Overview
Self-healing recovery actions allow upti.my agents to automatically respond to check failures without waiting for human intervention. When a local check fails, the agent can execute a configured recovery action to restore the service, clean up resources, or run a diagnostic script. This reduces mean time to recovery (MTTR) and keeps your systems running even during off-hours.
Each recovery action is linked to a specific local check. When the check fails, the agent evaluates the retry and cooldown settings before executing the action. All execution results are logged and reported back to the upti.my dashboard for full visibility.
ℹ️ Recovery Action Flow
The flow is: Check fails → Agent evaluates cooldown → Agent executes recovery action → Agent re-runs the check → If still failing, retry up to max_retries → Report final status.
Common Settings
All recovery action types share the following configuration options:
| Setting | Type | Default | Description |
|---|---|---|---|
| max_retries | integer | 3 | Maximum number of times the recovery action will be attempted before giving up. |
| retry_interval | integer (seconds) | 60 | Time to wait between retry attempts. |
| cooldown_period | integer (seconds) | 300 | Minimum time between recovery action executions to prevent rapid-fire retries. |
Recovery Action Types
1. Restart Service (systemd)
Restart a systemd-managed service on the host. The agent runs systemctl restart for the specified service name. This is the most common recovery action for Linux servers.
| Field | Type | Description |
|---|---|---|
| service_name | string | Name of the systemd service, e.g., nginx or postgresql |
{
"type": "restart_service",
"service_name": "nginx",
"max_retries": 3,
"retry_interval": 60,
"cooldown_period": 300
}2. Restart Docker Container
Restart a Docker container by name. The agent issues a docker restart command for the specified container. Useful for recovering crashed or unresponsive containers.
| Field | Type | Description |
|---|---|---|
| container_name | string | Name of the Docker container, e.g., redis-cache |
{
"type": "restart_docker_container",
"container_name": "redis-cache",
"max_retries": 2,
"retry_interval": 30,
"cooldown_period": 180
}3. Restart Kubernetes Pod
Delete a Kubernetes pod to trigger a restart by the controller (Deployment, StatefulSet, etc.). The agent uses kubectl to delete the pod in the specified namespace.
| Field | Type | Description |
|---|---|---|
| pod_name | string | Name of the Kubernetes pod to restart |
| namespace | string | Kubernetes namespace where the pod runs. Default: default |
| kubectl_path | string | Path to the kubectl binary. Default: /usr/local/bin/kubectl |
{
"type": "restart_kubernetes_pod",
"pod_name": "api-server-7d8f9b6c4-x2k9p",
"namespace": "production",
"kubectl_path": "/usr/local/bin/kubectl",
"max_retries": 2,
"retry_interval": 60,
"cooldown_period": 300
}4. Execute Shell Script
Run a custom shell script on the host. This is the most flexible recovery action, allowing you to execute any command or sequence of commands. The script runs with the agent's permissions and is subject to the configured timeout.
| Field | Type | Description |
|---|---|---|
| script | string | Shell script content to execute |
| timeout | integer (seconds) | Maximum execution time for the script. Default: 30. |
{
"type": "execute_script",
"script": "#!/bin/bash\ncd /var/app && ./restart.sh && echo 'Recovery complete'",
"timeout": 60,
"max_retries": 2,
"retry_interval": 30,
"cooldown_period": 600
}⚠️ Script Security
Shell scripts run with the same permissions as the agent process. Avoid running the agent as root unless necessary. Always validate script content carefully, as malformed scripts can cause additional problems. Use timeouts to prevent scripts from hanging indefinitely.
5. Send HTTP Webhook
Send an HTTP request to an external endpoint as a recovery action. This is useful for triggering external automation pipelines, notifying third-party services, or calling a custom recovery API.
| Field | Type | Description |
|---|---|---|
| url | string | Webhook URL to call |
| method | string | HTTP method: GET, POST, PUT. Default: POST. |
| headers | object | Optional request headers |
| body | string | Optional request body (JSON string) |
| expected_status | integer | Expected response status code. Default: 200. |
{
"type": "http_webhook",
"url": "https://automation.example.com/recover",
"method": "POST",
"headers": { "Authorization": "Bearer token123" },
"body": "{ \"service\": \"api\", \"action\": \"restart\" }",
"expected_status": 200,
"max_retries": 3,
"cooldown_period": 300
}6. Clean Disk Space
Remove old files from specified directories to free up disk space. The agent deletes files older than the configured number of days. This pairs well with the Disk Usage local check.
| Field | Type | Description |
|---|---|---|
| paths | string array | Directories to clean, e.g., ["/var/log", "/tmp"] |
| older_than_days | integer | Only delete files older than this many days. Default: 7. |
{
"type": "clean_disk_space",
"paths": ["/var/log", "/tmp"],
"older_than_days": 7,
"max_retries": 1,
"cooldown_period": 3600
}7. Kill Process
Terminate a specific process by name or PID. You can choose between a graceful shutdown (SIGTERM) or a forced kill (SIGKILL). This is useful for stopping runaway processes that consume excessive resources.
| Field | Type | Description |
|---|---|---|
| process_name | string | Name of the process to kill. Either this or pid is required. |
| pid | integer | Process ID to kill. Either this or process_name is required. |
| signal | string | Signal to send: SIGTERM (graceful) or SIGKILL (forced). Default: SIGTERM. |
{
"type": "kill_process",
"process_name": "stuck-worker",
"signal": "SIGTERM",
"max_retries": 2,
"retry_interval": 15,
"cooldown_period": 120
}8. Clear Application Cache
Clear an application's cache by deleting a cache directory or running a cache-clearing command. This is helpful when stale cache data causes application errors or performance degradation.
| Field | Type | Description |
|---|---|---|
| cache_path | string | Path to the cache directory to clear, e.g., /var/app/cache |
| command | string | Alternative: a shell command to clear the cache, e.g., redis-cli FLUSHALL |
{
"type": "clear_cache",
"command": "redis-cli FLUSHALL",
"max_retries": 1,
"cooldown_period": 600
}9. DNS Flush
Flush the local DNS resolver cache. This can resolve issues caused by stale DNS records, such as after a DNS failover event. The agent uses the system default flush command or a custom command you provide.
| Field | Type | Description |
|---|---|---|
| custom_command | string | Optional custom DNS flush command. If omitted, the agent uses the OS default. |
{
"type": "dns_flush",
"custom_command": "systemd-resolve --flush-caches",
"max_retries": 1,
"cooldown_period": 300
}10. Custom Recovery
A fully custom recovery action that lets you define an arbitrary script with complete control over the recovery logic. Use this for complex recovery workflows that don't fit into the other action types.
| Field | Type | Description |
|---|---|---|
| script | string | Full custom recovery script content |
{
"type": "custom_recovery",
"script": "#!/bin/bash\necho 'Starting custom recovery...'\nsystemctl stop myapp\nrm -rf /tmp/myapp-locks\nsystemctl start myapp\necho 'Recovery complete'",
"max_retries": 2,
"retry_interval": 60,
"cooldown_period": 600
}💡 Best Practices
Start with conservative cooldown periods (300 seconds or more) to avoid recovery loops. Set max_retries to 2 or 3 for most actions. Always test recovery actions in a staging environment before deploying to production. Monitor the recovery action logs in your upti.my dashboard to verify that actions execute as expected.
Recovery Action Summary
| Action Type | Key Fields | Use Case |
|---|---|---|
| Restart Service | service_name | Restart crashed systemd services |
| Restart Docker Container | container_name | Recover unresponsive containers |
| Restart Kubernetes Pod | pod_name, namespace, kubectl_path | Force pod restart in a cluster |
| Execute Shell Script | script, timeout | Run arbitrary recovery commands |
| Send HTTP Webhook | url, method, headers, body | Trigger external automation |
| Clean Disk Space | paths, older_than_days | Free disk space by removing old files |
| Kill Process | process_name or pid, signal | Stop runaway processes |
| Clear Application Cache | cache_path or command | Clear stale cache data |
| DNS Flush | custom_command | Resolve stale DNS entries |
| Custom Recovery | script | Complex multi-step recovery workflows |