On-Call Rotation Best Practices for Dev Teams
A practical guide to on-call rotation models, fair scheduling, handoffs, runbooks, and burnout prevention for small and medium engineering teams.
A well-designed on-call rotation keeps production systems healthy without quietly destroying your team. This guide covers the most common rotation models, how to make scheduling fair, what a good handoff looks like, and how to build the runbooks and tooling that make being on-call bearable.
Choosing a Rotation Model
There is no universally correct rotation model. The right choice depends on your team size, geographic distribution, and the actual volume and severity of incidents you handle. The three models below cover most small-to-medium team scenarios.
Weekly Primary/Secondary
One engineer is primary on-call for a full week; a second engineer is secondary and is only paged if the primary does not acknowledge within a defined escalation window (typically 5-10 minutes). This is the most common model for teams of 4-12 engineers. The week-long window gives each person enough context to investigate recurring issues, and the secondary role provides a safety net without doubling the interrupt load.
Follow-the-Sun
Teams distributed across multiple time zones rotate responsibility by business hours rather than calendar weeks. A Europe-based engineer handles daytime alerts for their region; a North America-based engineer picks up the next shift. This model dramatically reduces overnight pages and is worth the coordination overhead for any team with genuine geographic spread. It requires strict handoff discipline because context must transfer between engineers who may never overlap synchronously.
Daily Rotation
Daily rotations distribute the burden more evenly and expose every team member to incidents faster. The tradeoff is that a single day is rarely enough time to recognise patterns or fully resolve slow-burning issues. Daily rotations work best for mature teams with thorough runbooks, low alert volume, and incidents that are usually resolved within a few hours.
Example Rotation Schedules
The table below shows a four-person weekly primary/secondary rotation across a one-month period. Engineers rotate through both roles so everyone shares overnight and weekend exposure equally.
| Week | Primary | Secondary | Notes |
|---|---|---|---|
| 1 | Alice | Bob | Alice leads; Bob escalation backup |
| 2 | Bob | Carol | Bob leads; Carol escalation backup |
| 3 | Carol | Dave | Carol leads; Dave escalation backup |
| 4 | Dave | Alice | Dave leads; Alice escalation backup |
| 5 | Alice | Bob | Cycle repeats |
For a follow-the-sun example with two regions, the schedule might look like this:
| Shift | Hours (UTC) | Engineer Pool | Handoff Time |
|---|---|---|---|
| EMEA | 07:00 - 16:00 | Alice, Carol | 16:00 UTC daily |
| Americas | 14:00 - 23:00 | Bob, Dave | 23:00 UTC daily |
| Overnight low-traffic | 23:00 - 07:00 | Pager escalates to secondary after 15 min | Async Slack note |
Fairness and Load Management
Rotation fairness is not only about equal slots in a calendar. It also means equal exposure to hard shifts. Track weekend and holiday coverage separately and ensure those are distributed across the rotation, not absorbed by the same two people each cycle.
- Count weekend and holiday shifts as weighted slots (many teams count one weekend day as equivalent to two weekday shifts).
- Never assign someone to on-call the week after they return from leave.
- Keep a public rotation calendar so everyone can see the full schedule at least four weeks in advance.
- Allow swap requests through a defined process, not informal pressure.
- Track alert volume per shift and use it to motivate alert reduction, not to rank engineers.
Alert fatigue is cumulative. If your primary on-call receives more than 4-5 actionable pages per shift on average, the rotation will burn people out regardless of how fairly you schedule it. Treat high alert volume as a reliability engineering problem, not a staffing problem.
Handoffs That Actually Work
A handoff is only useful if the incoming engineer finishes it knowing what is still broken, what is being watched, and what context is missing. Verbal handoffs without a written record are unreliable. Use a consistent template.
- 1
Write a shift summary
Before ending your shift, document every incident that occurred: what happened, what was done, current status, and any open follow-up tickets. Post it to a dedicated Slack channel or incident log.
- 2
Flag ongoing degradations
If something is not fully resolved or is being monitored for recurrence, say so explicitly. Do not assume the incoming engineer will discover it on their own.
- 3
Transfer ownership of open alerts
Make sure any acknowledged-but-unresolved alerts in your monitoring tool are re-acknowledged or reassigned. Alerts that fall into a silent acknowledged state are frequently missed.
- 4
Sync briefly if possible
A 10-minute overlap call or voice note at the start of a shift is far more effective than reading a wall of text alone. For follow-the-sun teams, a recorded Loom or short async video can substitute.
- 5
Update the runbook if you learned something new
If you investigated an incident and the existing runbook was wrong, incomplete, or missing, update it before handing off. This is the compounding return on every incident.
Writing Runbooks That Get Used
Runbooks fail in practice because they are written once, never tested, and grow stale. A runbook that engineers do not trust is worse than no runbook, because it wastes time during an incident.
- Write runbooks for specific alerts, not generic system components. A runbook titled 'database issues' is useless at 2 AM.
- Include the exact commands to run, in order, with expected output where relevant.
- State explicitly what constitutes resolution versus what constitutes 'safe to monitor until business hours'.
- Add a 'last tested' date and require review after every incident where the runbook was used.
- Link directly from your alert to the corresponding runbook — engineers should not have to search.
## Alert: High job failure rate (>20% in 5 min)
**Severity:** P2
**Last tested:** 2025-04-10
### Steps
1. Check https://app.example.com/jobs for recent failures
2. Look for a common error pattern in job logs
3. If failures are in one job type: disable that job, open incident ticket
4. If widespread: check downstream API status pages and DB lag
5. Escalate to secondary if not resolved in 20 minutes
### Resolution criteria
Failure rate drops below 5% for 10 consecutive minutes.
### Escalation contact
Secondary on-call (see rotation calendar)Monitoring and Alerting That Supports the Rotation
Your on-call rotation is only as effective as the signal it receives. Poorly tuned alerting — too noisy, too slow, or covering the wrong things — is the most common cause of on-call burnout on teams that believe they have a scheduling problem.
For cron jobs and scheduled tasks specifically, a common gap is the missing-execution problem: a job silently stops running and no alert fires because nothing failed — it simply never started. Heartbeat monitoring addresses this directly. The job pings a URL like https://cronjobpro.com/ping/<token> on completion; if the ping does not arrive within the expected period plus a grace window, an alert is sent. This inverts the model from 'wait for an error' to 'expect a signal and alert on its absence'. Connecting those alerts to your on-call tool via PagerDuty, Opsgenie, or webhook means the rotation receives actionable, targeted pages rather than a flood of symptoms.
How heartbeat monitoring works →
Compensation and Acknowledgement
Compensation norms for on-call vary widely by company size and region, but the core principle is consistent: on-call time has a cost and that cost should be recognised. Common approaches include:
- A flat per-shift stipend, typically paid regardless of whether incidents occur.
- Time-in-lieu for overnight or weekend incidents above a minimum duration threshold.
- An incident bonus per resolved page during off-hours.
- For smaller teams with no formal budget: explicit acknowledgement in performance reviews, protected recovery time the following day, and flexibility to decline non-critical meetings the day after a hard shift.
If your company cannot yet pay a stipend, the most effective alternative is protected recovery time. An engineer who handles incidents from midnight to 3 AM should not be expected in a 9 AM planning meeting with full cognitive function. Making this policy explicit and enforced is worth more than a small cash payment with no recovery time attached.
Avoiding Burnout
Burnout from on-call comes from two sources: volume and perceived futility. Engineers can tolerate a hard week if they believe it is temporary, rare, and that their feedback will be acted on. They burn out when they handle the same recurring incidents repeatedly with no reduction in underlying failures.
- Run a post-incident review after every significant page and track action items to completion.
- Measure and report weekly alert volume as a team metric, not an individual one.
- Reserve engineering time each sprint specifically for reliability work — alert reduction, runbook updates, automation.
- Set a maximum alert threshold (for example, no single on-call shift should receive more than 10 pages) and treat breaches as a team priority.
- Rotate new engineers into secondary positions before putting them on primary. Dropping someone into primary with no prior exposure is a fast way to lose them.
Public status pages reduce inbound alert noise from customers and internal stakeholders. When users can self-serve to check whether an issue is known, on-call engineers spend less time answering Slack messages and more time resolving the underlying problem.