Incident Management

Track, manage, and resolve incidents with automatic detection, timeline tracking, and post-incident analysis.

Overview

upti.my provides a built-in incident management system that tracks every outage from detection to resolution. Incidents are created automatically when incident conditions are met, or manually when your team identifies an issue through other channels. Every incident captures a complete timeline, affected services, linked health checks, and resolution details.

The incident system integrates tightly with incident conditions, workflows, and status pages. When an incident is created, your workflows execute to send notifications and route alerts. As the incident progresses through its lifecycle, status updates can be published to your public or private status pages automatically.

Incident Lifecycle

Every incident follows a structured lifecycle with four stages. Moving between stages is tracked with timestamps for accurate metric calculation.

Stage	Description
Detected	The incident has been identified, either automatically by an incident condition or manually by a team member. Workflows execute to send notifications and the timeline begins.
Acknowledged	A team member has acknowledged the incident and is aware of the issue. This stops escalation timers in escalation conditions.
Investigating	The team is actively investigating the root cause. Status updates can be posted to keep stakeholders informed.
Resolved	The issue has been fixed and the affected services are back to normal. Resolution time is recorded and MTTR is calculated.

ℹ️ Automatic Incident Creation

When an incident condition is met, upti.my automatically creates an incident if one does not already exist for the affected health check. If an open incident already exists, the new event is linked to the existing incident instead of creating a duplicate.

Incident Timeline

Every incident maintains a detailed timeline that records all events from detection to resolution. Timeline entries are created automatically for system events and can be added manually by team members.

Condition triggered - Records which incident condition was met and which health check failed
Stage transitions - Timestamps for each lifecycle stage change (Detected, Acknowledged, Investigating, Resolved)
Status updates - Manual updates posted by team members with free-text descriptions
Affected services changed - When services are added to or removed from the incident
Team assignments - When responders are assigned or changed
Recovery actions executed - If self-healing actions ran, their results are logged here

Example Timeline Entry

{
  "timestamp": "2025-06-15T14:32:00Z",
  "event_type": "status_update",
  "author": "jane@example.com",
  "message": "Root cause identified: database connection pool exhaustion. Scaling up pool size from 20 to 50 connections.",
  "stage": "investigating"
}

Affected Services

Each incident can be linked to one or more services that are impacted. Services are automatically associated based on the health checks that triggered the alert. You can also manually add or remove affected services as the incident investigation reveals the full scope of impact.

Affected services are displayed on your status pages, giving your users real-time visibility into which parts of your system are experiencing issues.

Status Page Integration

Incident updates can be automatically published to your upti.my status pages. When you post a status update to an incident, you can choose to push it to one or more status pages. This keeps your users informed without requiring manual updates to your status page.

Automatic publishing - Configure incidents to auto-publish to status pages when created
Selective updates - Choose which status updates are public-facing and which are internal-only
Service impact levels - Set the impact level per service: Operational, Degraded Performance, Partial Outage, Major Outage
Scheduled maintenance - Create planned incidents that appear on the status page before the maintenance window

💡 Communicate Early and Often

Post an initial status update within 5 minutes of detection, even if you do not yet know the root cause. Users appreciate knowing that you are aware of the issue. Follow up with updates every 15 to 30 minutes until the incident is resolved.

Response Team Assignment

Assign responders to incidents to track who is working on the issue. Responders are notified when they are assigned and receive all subsequent status updates for the incident. You can configure default response teams per service or per incident condition.

Feature	Description
Default responders	Configure default team members who are automatically assigned when an incident is created for a specific service
Manual assignment	Add or change responders at any time during the incident lifecycle
On-call integration	Integrate with PagerDuty or OpsGenie to automatically assign the current on-call engineer

Incident Metrics

upti.my calculates key incident metrics automatically, giving your team data-driven insights into your reliability and response performance.

Metric	Description
MTTD (Mean Time to Detect)	Average time from when a failure starts to when it is detected by a health check or alert. Lower MTTD means faster detection.
MTTR (Mean Time to Recover)	Average time from incident detection to resolution. This is the primary measure of your team's response efficiency.
Incident Count	Total number of incidents over a given period, broken down by severity, service, and status.
Uptime Percentage	Calculated from incident duration relative to total monitored time. Displayed on status pages and dashboards.

Example Incident Metrics Response

{
  "period": "2025-06-01T00:00:00Z/2025-06-30T23:59:59Z",
  "metrics": {
    "mttd_seconds": 45,
    "mttr_seconds": 1230,
    "incident_count": 7,
    "uptime_percentage": 99.94,
    "by_severity": {
      "critical": 2,
      "warning": 3,
      "info": 2
    }
  }
}

Post-Incident Analysis

After an incident is resolved, upti.my provides tools for post-incident analysis. Each incident includes a dedicated notes section where your team can document root cause, contributing factors, and action items. This helps your team learn from incidents and improve reliability over time.

Root cause documentation - Record what caused the incident and how it was identified
Contributing factors - Document environmental or systemic factors that contributed to the issue
Action items - Track follow-up tasks to prevent recurrence
Timeline review - Review the complete incident timeline to identify response bottlenecks
Metric comparison - Compare MTTD and MTTR against your team's historical averages

ℹ️ Blameless Post-Mortems

upti.my encourages blameless post-incident analysis. Focus on systemic improvements rather than individual blame. The incident timeline and metrics provide objective data points for identifying process improvements and automation opportunities.

Creating Incidents Manually

While most incidents are created automatically by incident conditions, you can also create incidents manually for issues detected through other channels. Manual incidents support all the same features: lifecycle tracking, timeline, affected services, status page publishing, and metrics.

Manual Incident Creation

{
  "title": "Elevated API latency in EU region",
  "description": "Users in the EU region are reporting slow API response times. CDN cache hit rate has dropped significantly.",
  "severity": "warning",
  "affected_services": ["api-eu", "cdn-eu"],
  "responders": ["oncall@example.com"],
  "publish_to_status_pages": ["public-status"]
}

⚠️ Duplicate Prevention

Before creating a manual incident, check the active incidents list to ensure a related incident does not already exist. If a related incident is open, add your findings as a status update to the existing incident instead of creating a new one.