upti.my

Self-Healing Recovery Actions

Configure automatic recovery actions that trigger when checks fail. Restart services, clear caches, run scripts, and more.

Overview

Self-healing recovery actions allow upti.my agents to automatically respond to check failures without waiting for human intervention. When a local check fails, the agent can execute a configured recovery action to restore the service, clean up resources, or run a diagnostic script. This reduces mean time to recovery (MTTR) and keeps your systems running even during off-hours.

Each recovery action is linked to a specific local check. When the check fails, the agent evaluates the retry and cooldown settings before executing the action. All execution results are logged and reported back to the upti.my dashboard for full visibility.

ℹ️ Recovery Action Flow

The flow is: Check fails → Agent evaluates cooldown → Agent executes recovery action → Agent re-runs the check → If still failing, retry up to max_retries → Report final status.

Common Settings

All recovery action types share the following configuration options:

SettingTypeDefaultDescription
max_retriesinteger3Maximum number of times the recovery action will be attempted before giving up.
retry_intervalinteger (seconds)60Time to wait between retry attempts.
cooldown_periodinteger (seconds)300Minimum time between recovery action executions to prevent rapid-fire retries.

Recovery Action Types

1. Restart Service (systemd)

Restart a systemd-managed service on the host. The agent runs systemctl restart for the specified service name. This is the most common recovery action for Linux servers.

FieldTypeDescription
service_namestringName of the systemd service, e.g., nginx or postgresql
Restart Service Example
{
  "type": "restart_service",
  "service_name": "nginx",
  "max_retries": 3,
  "retry_interval": 60,
  "cooldown_period": 300
}

2. Restart Docker Container

Restart a Docker container by name. The agent issues a docker restart command for the specified container. Useful for recovering crashed or unresponsive containers.

FieldTypeDescription
container_namestringName of the Docker container, e.g., redis-cache
Restart Docker Container Example
{
  "type": "restart_docker_container",
  "container_name": "redis-cache",
  "max_retries": 2,
  "retry_interval": 30,
  "cooldown_period": 180
}

3. Restart Kubernetes Pod

Delete a Kubernetes pod to trigger a restart by the controller (Deployment, StatefulSet, etc.). The agent uses kubectl to delete the pod in the specified namespace.

FieldTypeDescription
pod_namestringName of the Kubernetes pod to restart
namespacestringKubernetes namespace where the pod runs. Default: default
kubectl_pathstringPath to the kubectl binary. Default: /usr/local/bin/kubectl
Restart Kubernetes Pod Example
{
  "type": "restart_kubernetes_pod",
  "pod_name": "api-server-7d8f9b6c4-x2k9p",
  "namespace": "production",
  "kubectl_path": "/usr/local/bin/kubectl",
  "max_retries": 2,
  "retry_interval": 60,
  "cooldown_period": 300
}

4. Execute Shell Script

Run a custom shell script on the host. This is the most flexible recovery action, allowing you to execute any command or sequence of commands. The script runs with the agent's permissions and is subject to the configured timeout.

FieldTypeDescription
scriptstringShell script content to execute
timeoutinteger (seconds)Maximum execution time for the script. Default: 30.
Execute Shell Script Example
{
  "type": "execute_script",
  "script": "#!/bin/bash\ncd /var/app && ./restart.sh && echo 'Recovery complete'",
  "timeout": 60,
  "max_retries": 2,
  "retry_interval": 30,
  "cooldown_period": 600
}

⚠️ Script Security

Shell scripts run with the same permissions as the agent process. Avoid running the agent as root unless necessary. Always validate script content carefully, as malformed scripts can cause additional problems. Use timeouts to prevent scripts from hanging indefinitely.

5. Send HTTP Webhook

Send an HTTP request to an external endpoint as a recovery action. This is useful for triggering external automation pipelines, notifying third-party services, or calling a custom recovery API.

FieldTypeDescription
urlstringWebhook URL to call
methodstringHTTP method: GET, POST, PUT. Default: POST.
headersobjectOptional request headers
bodystringOptional request body (JSON string)
expected_statusintegerExpected response status code. Default: 200.
HTTP Webhook Example
{
  "type": "http_webhook",
  "url": "https://automation.example.com/recover",
  "method": "POST",
  "headers": { "Authorization": "Bearer token123" },
  "body": "{ \"service\": \"api\", \"action\": \"restart\" }",
  "expected_status": 200,
  "max_retries": 3,
  "cooldown_period": 300
}

6. Clean Disk Space

Remove old files from specified directories to free up disk space. The agent deletes files older than the configured number of days. This pairs well with the Disk Usage local check.

FieldTypeDescription
pathsstring arrayDirectories to clean, e.g., ["/var/log", "/tmp"]
older_than_daysintegerOnly delete files older than this many days. Default: 7.
Clean Disk Space Example
{
  "type": "clean_disk_space",
  "paths": ["/var/log", "/tmp"],
  "older_than_days": 7,
  "max_retries": 1,
  "cooldown_period": 3600
}

7. Kill Process

Terminate a specific process by name or PID. You can choose between a graceful shutdown (SIGTERM) or a forced kill (SIGKILL). This is useful for stopping runaway processes that consume excessive resources.

FieldTypeDescription
process_namestringName of the process to kill. Either this or pid is required.
pidintegerProcess ID to kill. Either this or process_name is required.
signalstringSignal to send: SIGTERM (graceful) or SIGKILL (forced). Default: SIGTERM.
Kill Process Example
{
  "type": "kill_process",
  "process_name": "stuck-worker",
  "signal": "SIGTERM",
  "max_retries": 2,
  "retry_interval": 15,
  "cooldown_period": 120
}

8. Clear Application Cache

Clear an application's cache by deleting a cache directory or running a cache-clearing command. This is helpful when stale cache data causes application errors or performance degradation.

FieldTypeDescription
cache_pathstringPath to the cache directory to clear, e.g., /var/app/cache
commandstringAlternative: a shell command to clear the cache, e.g., redis-cli FLUSHALL
Clear Cache Example
{
  "type": "clear_cache",
  "command": "redis-cli FLUSHALL",
  "max_retries": 1,
  "cooldown_period": 600
}

9. DNS Flush

Flush the local DNS resolver cache. This can resolve issues caused by stale DNS records, such as after a DNS failover event. The agent uses the system default flush command or a custom command you provide.

FieldTypeDescription
custom_commandstringOptional custom DNS flush command. If omitted, the agent uses the OS default.
DNS Flush Example
{
  "type": "dns_flush",
  "custom_command": "systemd-resolve --flush-caches",
  "max_retries": 1,
  "cooldown_period": 300
}

10. Custom Recovery

A fully custom recovery action that lets you define an arbitrary script with complete control over the recovery logic. Use this for complex recovery workflows that don't fit into the other action types.

FieldTypeDescription
scriptstringFull custom recovery script content
Custom Recovery Example
{
  "type": "custom_recovery",
  "script": "#!/bin/bash\necho 'Starting custom recovery...'\nsystemctl stop myapp\nrm -rf /tmp/myapp-locks\nsystemctl start myapp\necho 'Recovery complete'",
  "max_retries": 2,
  "retry_interval": 60,
  "cooldown_period": 600
}

💡 Best Practices

Start with conservative cooldown periods (300 seconds or more) to avoid recovery loops. Set max_retries to 2 or 3 for most actions. Always test recovery actions in a staging environment before deploying to production. Monitor the recovery action logs in your upti.my dashboard to verify that actions execute as expected.

Recovery Action Summary

Action TypeKey FieldsUse Case
Restart Serviceservice_nameRestart crashed systemd services
Restart Docker Containercontainer_nameRecover unresponsive containers
Restart Kubernetes Podpod_name, namespace, kubectl_pathForce pod restart in a cluster
Execute Shell Scriptscript, timeoutRun arbitrary recovery commands
Send HTTP Webhookurl, method, headers, bodyTrigger external automation
Clean Disk Spacepaths, older_than_daysFree disk space by removing old files
Kill Processprocess_name or pid, signalStop runaway processes
Clear Application Cachecache_path or commandClear stale cache data
DNS Flushcustom_commandResolve stale DNS entries
Custom RecoveryscriptComplex multi-step recovery workflows