Incident Postmortem Template (Free & Blameless)
A free, copy-ready blameless incident postmortem template with all sections, plus guidance on who attends, when to write one, and how to run the meeting.
A blameless incident postmortem turns a painful outage into a durable learning artifact. The template below is free to copy, covers every section a rigorous postmortem needs, and the guidance that follows explains exactly how to fill it in and run the review meeting.
The complete incident postmortem template
Copy the block below into your wiki, incident management tool, or a plain markdown file. Every section is required; resist the urge to skip Impact or Lessons Learned when you are under pressure to close the ticket.
# Incident Postmortem — [Service / Component Name]
---
## Summary
**Incident ID:** INC-YYYY-NNN
**Severity:** SEV-1 / SEV-2 / SEV-3
**Status:** Resolved
**Date of incident:** YYYY-MM-DD
**Document owner:** [Name, team]
**Last updated:** YYYY-MM-DD
Write 2–4 sentences describing what broke, how customers or internal users were affected, and how long it lasted. Anyone who reads only this section should understand the scope of the event.
---
## Impact
| Dimension | Detail |
|---------------------|---------------------------------------------|
| Duration | HH:MM (start → end in UTC) |
| Users / tenants affected | e.g. 100 % of paid users, region X only |
| Error rate | e.g. 94 % of requests returned 502 |
| Revenue impact | e.g. ~$X estimated lost transactions |
| SLO burn | e.g. consumed 45 % of monthly error budget |
| Downstream systems | List any systems that cascaded |
---
## Timeline
All times in UTC. Be precise; pull from logs, alerting tool, or on-call records.
| Time (UTC) | Event |
|--------------|------------------------------------------------------------|
| HH:MM | Triggering change or first anomaly observed in metrics |
| HH:MM | Monitoring / heartbeat alert fired (detection time) |
| HH:MM | On-call engineer acknowledged alert |
| HH:MM | Incident declared SEV-X, war room opened |
| HH:MM | Initial hypothesis formed |
| HH:MM | Root cause confirmed |
| HH:MM | Mitigation applied (rollback / hotfix / config change) |
| HH:MM | Service fully restored, error rate normal |
| HH:MM | All-clear sent to stakeholders |
| HH:MM | Postmortem doc created |
---
## Root Cause
Describe the technical root cause in plain language. Use the "5 Whys" technique:
1. **Why did users see errors?** ...
2. **Why did that component fail?** ...
3. **Why was that condition possible?** ...
4. **Why was there no safeguard?** ...
5. **Why did the process allow this?** ...
**Contributing factors** (list any secondary conditions that made the incident worse):
- ...
- ...
---
## Detection
**How was the incident first detected?**
[ ] Automated alert (monitoring / heartbeat / uptime check)
[ ] Customer report
[ ] On-call engineer noticed manually
[ ] Internal user report
**Time to detection (TTD):** HH:MM from triggering event to first alert
**Time to acknowledge (TTA):** HH:MM from alert to engineer ack
**Detection gaps identified:**
Describe any signals that existed but did not trigger an alert, or alerts that fired too late.
---
## Resolution
Describe the steps taken to restore service. Include any rollbacks, feature flags toggled, config changes, or manual interventions.
1. ...
2. ...
3. ...
**Mitigation vs. fix distinction:**
- *Mitigation (applied during incident):* ...
- *Permanent fix (post-incident work item):* ...
---
## Action Items
Each action item must have an owner and a due date. Vague items rot in backlogs.
| # | Action | Owner | Priority | Due date | Status |
|---|-------------------------------------|--------------|----------|------------|---------|
| 1 | Add heartbeat monitor for job X | @engineer | P1 | YYYY-MM-DD | Open |
| 2 | Increase alert sensitivity on Y | @engineer | P2 | YYYY-MM-DD | Open |
| 3 | Add runbook link to alert message | @on-call-lead| P2 | YYYY-MM-DD | Open |
| 4 | Review change management process | @tech-lead | P3 | YYYY-MM-DD | Open |
---
## Lessons Learned
**What went well?**
- ...
**What went poorly?**
- ...
**Where did we get lucky?**
- ...
**What would we do differently?**
- ...
---
*This document follows a blameless postmortem culture. The goal is to understand system and process failures, not to assign fault to individuals.*
What makes a postmortem blameless
Blameless means the review assumes every person involved made the best decision they could with the information available at the time. The phrase is not a euphemism for accountability-free; action items still have owners and due dates. The distinction is that blame targets people, while a blameless review targets systems and processes. When engineers fear punishment, they stop sharing details, timelines become sanitized, and the organisation learns nothing.
If someone says "engineer X should have caught this", redirect the question: "What system, alert, or process would have caught it regardless of who was on call?" That reframe keeps the conversation productive.
When to write a postmortem
Most teams write postmortems for SEV-1 and SEV-2 incidents automatically. A lightweight version is worth doing for any event that meets one or more of these criteria.
- Customer-facing downtime or degradation lasting more than 15 minutes
- An SLO error budget burned beyond a defined threshold (commonly 10% in a single event)
- Data loss or data corruption of any magnitude
- A security incident, even if contained quickly
- A near-miss that would have been severe had a safeguard not caught it
- Any incident that repeats a pattern seen in a previous postmortem
Write the first draft within 24 hours of resolution while memory is fresh. Hold the review meeting within 48 to 72 hours. Waiting longer means reconstructed timelines and faded context.
Who attends the review meeting
| Role | Responsibility in the meeting |
|---|---|
| Incident commander / lead | Owns the document, presents the timeline, drives discussion |
| On-call engineers | Provide technical detail, correct the timeline, explain decisions made during the incident |
| Engineering manager | Ensures action items are prioritised and resourced; does not dominate technical discussion |
| Product or customer-facing representative | Communicates customer impact accurately; translates technical facts for business stakeholders |
| Any engineer whose change triggered the incident | Invited as a valued source of context, not as the accused |
Keep the meeting under 60 minutes. If the root cause analysis is still ongoing, hold a brief sync to align on the timeline and schedule a second session once analysis is complete.
How detection time ties into the postmortem
The Detection section of the template is often the most actionable. Time to detection (TTD) is the gap between the moment something broke and the moment your team knew about it. Every minute of TTD is a minute of user-facing impact you had no chance to stop. The most common detection gap is a missing or misconfigured monitor.
There are two classes of job to monitor. Jobs you run as HTTP requests can be watched directly. Jobs that run on your own infrastructure (cron, queues, scripts) need heartbeat monitoring: the job pings a URL on success, and if the ping does not arrive within the expected window plus a grace period, an alert fires. CronJobPro supports both patterns. For any job that should run on a schedule, setting up a heartbeat monitor at /heartbeat-monitoring means a missed or crashed run is detected automatically rather than discovered by a customer.
Set up a heartbeat monitor for your scheduled jobs →
A common action item after a postmortem is "add monitoring for X". Be specific: record what the monitor checks, what threshold triggers an alert, and which channel receives it (email, Slack, PagerDuty, etc.). Vague action items rarely get completed.
How to run the meeting step by step
- 1
Read the document before the meeting
Require all attendees to read the draft postmortem at least 30 minutes before the meeting starts. The meeting is for discussion and corrections, not for reading the document aloud.
- 2
Walk the timeline, not the blame
The facilitator reads each timeline entry and asks: is this accurate? Is anything missing? Correct the record before discussing causes.
- 3
Apply the 5 Whys to the root cause
Keep asking why until you reach a process or system gap, not a person. If the fifth answer is still a person's name, keep going.
- 4
Review the Detection section explicitly
Ask: what was our TTD, and how do we cut it in half for the next similar incident? This is where new monitors, tighter thresholds, and runbook links get identified.
- 5
Draft action items live
Every identified gap becomes an action item with a named owner and a due date assigned before the meeting ends. Items without owners are not action items — they are wishes.
- 6
Publish and schedule follow-up
Publish the final document in a location the whole engineering team can find. At the next sprint planning or weekly sync, review the status of open action items.
Postmortem anti-patterns to avoid
- Skipping the Impact section because the outage felt minor — if it happened, measure it
- Writing action items without owners or due dates
- Conflating mitigation (what stopped the bleeding) with the permanent fix (what prevents recurrence)
- Closing the document before the permanent fix is verified in production
- Storing postmortems in a private folder where only the team involved can read them — organisational learning requires visibility
- Running the review more than a week after the incident
Closing the loop with monitoring
A postmortem with no follow-through is a document, not a practice. The highest-value action items are usually monitoring improvements. After every postmortem, check whether each gap in the Detection section now has a corresponding alert. Public status pages give customers visibility into incidents in real time and reduce support load during outages. If your jobs run on a schedule, verify each one has either a direct HTTP check or a heartbeat monitor so the next missed run surfaces in seconds, not hours.