Mayur Rathi

⭐ 34.1k GitHub stars

Incident-Postmortem

Incident-Postmortem is an code AI skill with a core value of Use when an outage, production incident, or significant service degradation has occurred and the team needs to write a structured blameless post-mortem. It helps developers solve real-world problems in the code domain, boosting efficiency, automating repetitive tasks, and optimizing workflows.

Use when an outage, production incident, or significant service degradation has occurred and the team needs to write a structured blameless post-mortem. Triggers on phrases like "write a post-mortem",

Last verified on: 2026-06-28

mkdir -p ./skills/incident-postmortem && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/incident-postmortem/SKILL.md -o ./skills/incident-postmortem/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# Incident Post-Mortem

Guide a team through writing a structured, blameless post-mortem after a production incident. The output is a document that builds shared understanding, identifies root causes without blame, and produces concrete action items to prevent recurrence.

Blameless Principle

Systems fail, not people. The goal is to understand HOW the incident happened — not WHO caused it. Avoid language like "X forgot to", "Y should have known". Use "the system did not", "the process lacked", "the alert did not fire".

When to Use

- Production outage or service degradation has been resolved

- A significant near-miss occurred (would have been an incident if caught later)

- User-facing errors, data loss, or SLA breach happened

- Team wants to capture learnings before context fades

**Not for:** Minor bugs caught in staging, planned maintenance windows, or incidents with no learning value.

Input Requirements

Gather these details before writing the post-mortem. Ask for anything missing:

Incident Metadata

- Incident title (short, descriptive)

- Date and time of detection (with timezone)

- Date and time of resolution

- Severity / impact level (P1–P4 or equivalent)

- Incident commander / on-call owner

Impact

- Affected services and systems

- User-facing impact (errors, slowness, full outage)

- Estimated number of users affected

- Data loss or corruption (yes/no, scope)

- SLA/SLO breach (yes/no, by how much)

Timeline Events

Key moments to reconstruct:

- First symptom occurred

- Alert fired (or was noticed manually)

- On-call paged / incident declared

- Investigation started

- Root cause identified

- Mitigation applied

- Full resolution confirmed

- Customer communication sent (if any)

Contributing Factors

Ask the team: "What made this worse than it needed to be?" — not "who failed". Examples:

- Alert threshold too high / alert didn't fire

- Runbook was missing or outdated

- Deploy lacked a feature flag for rollback

- Monitoring didn't cover this failure mode

- On-call handoff missed context

Process

Step 1 — Gather Metadata

If the user has not provided full incident details, ask for them section by section. Don't proceed to writing until you have: title, times, severity, affected services, and at least a rough timeline.

Step 2 — Reconstruct Timeline

Work with the user to build a precise chronological timeline. For each event:

- Exact time (UTC preferred)

- What happened (system event or human action)

- Who observed it or took the action

- Link to log / alert / Slack message if available

Flag gaps: "We don't know what happened between 14:32 and 14:47 — worth checking logs."

Step 3 — Root Cause Analysis

Use the **5 Whys** iteratively:

text

Why did users see 500 errors?
→ The API pods were crash-looping.

Why were they crash-looping?
→ Memory limit was exceeded.

Why was the limit exceeded?
→ A new query was loading full result sets into memory.

Why wasn't this caught before deploy?
→ Load tests only covered the p50 case, not high-cardinality accounts.

Why did load tests only cover p50?
→ We had no test fixtures for large accounts.

Stop when you reach a system/process gap you can fix. The last "why" should point to an action item.

Distinguish:

- **Root cause** — the deepest systemic gap (one or two)

- **Contributing factors** — conditions that made it worse but aren't the root cause

Step 4 — Impact Quantification

Help the user be precise:

- Duration: detection to resolution (not symptom start to resolution — separate these)

- Error rate at peak vs. normal baseline

- Percentage of traffic affected

- Revenue / business impact if known

Step 5 — Action Items

For each root cause and contributing factor, generate at least one action item:

|---|--------|-------|----------|----------|

| 2 | Lower memory alert threshold from 90% to 75% | @platf

🎯 Best For

GitHub Copilot users
Claude users
Software engineers
Development teams
Tech leads

💡 Use Cases

Code quality improvement
Best practice enforcement

📖 How to Use This Skill

1
Install the Skill

Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
2
Load into Your AI Assistant

Open GitHub Copilot or Claude and reference the skill. Paste the SKILL.md content or use the system prompt tab.
3
Apply Incident-Postmortem to Your Work

Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.
4
Review and Refine

Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.

❓ Frequently Asked Questions

Is Incident-Postmortem compatible with Cursor and VS Code?

Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.

Do I need specific dependencies for Incident-Postmortem?

Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.

How do I install Incident-Postmortem?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/incident-postmortem/SKILL.md, ready to use.

Can I customize this skill for my team?

Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.

⚠️ Common Mistakes to Avoid

Skipping validation

Always test AI-generated code changes, even for simple refactors.

Missing dependency updates

Check if the skill requires updated dependencies or new packages.

🔗 Related Skills

00-andruia-consultant 00-andruia-consultant 007 007 20-andruia-niche-intelligence 20-andruia-niche-intelligence a11y A11Y accessibility-runtime-tester Accessibility Runtime Tester accesslint-scan accesslint-scan