How many engineers do you need before setting up a formal on-call rotation?

Most teams start a formal rotation at 3-4 engineers. Below that threshold, the rotation cycle is so short that every engineer is on-call more often than off, which negates the benefit. With fewer than three people, a shared responsibility model with explicit escalation contacts is usually more practical than a rotating schedule.

What is the difference between primary and secondary on-call?

The primary on-call engineer is the first person paged when an alert fires. The secondary is only contacted if the primary does not acknowledge the alert within a defined escalation window, typically 5-15 minutes. The secondary role exists as a safety net, not as a co-responder, and should not be paged for every incident.

How do you handle on-call for cron jobs and scheduled tasks?

Scheduled tasks require a different monitoring approach than web services. Because they run silently, a failure often means the job simply stops without producing an error. Heartbeat monitoring — where the job pings a URL on completion and an alert fires if the ping does not arrive — is the standard solution. Alerts can then route to your on-call engineer via the same channels used for other infrastructure alerts.

What should a good handoff include?

A handoff should cover: any incidents that occurred during the shift and their current status, ongoing issues being monitored, any alerts that are acknowledged but not resolved, any runbook gaps discovered, and any context the incoming engineer needs that is not documented elsewhere. A written summary posted to a shared channel is the minimum; a short sync call or voice note is better.

Is it normal to compensate engineers for on-call time even if no incidents occur?

Yes, and most engineering compensation guidelines recommend it. Being on-call has a real cost even without pages: restricted movement, interrupted sleep, and background cognitive load. A flat per-shift stipend that does not depend on incident volume is the clearest way to acknowledge this. In environments where cash compensation is not possible, protected recovery time and explicit recognition in performance reviews are common alternatives.

On-Call Rotation Best Practices for Dev Teams

A practical guide to on-call rotation models, fair scheduling, handoffs, runbooks, and burnout prevention for small and medium engineering teams.

A well-designed on-call rotation keeps production systems healthy without quietly destroying your team. This guide covers the most common rotation models, how to make scheduling fair, what a good handoff looks like, and how to build the runbooks and tooling that make being on-call bearable.

Choosing a Rotation Model

There is no universally correct rotation model. The right choice depends on your team size, geographic distribution, and the actual volume and severity of incidents you handle. The three models below cover most small-to-medium team scenarios.

Weekly Primary/Secondary

One engineer is primary on-call for a full week; a second engineer is secondary and is only paged if the primary does not acknowledge within a defined escalation window (typically 5-10 minutes). This is the most common model for teams of 4-12 engineers. The week-long window gives each person enough context to investigate recurring issues, and the secondary role provides a safety net without doubling the interrupt load.

Follow-the-Sun

Teams distributed across multiple time zones rotate responsibility by business hours rather than calendar weeks. A Europe-based engineer handles daytime alerts for their region; a North America-based engineer picks up the next shift. This model dramatically reduces overnight pages and is worth the coordination overhead for any team with genuine geographic spread. It requires strict handoff discipline because context must transfer between engineers who may never overlap synchronously.

Daily Rotation

Daily rotations distribute the burden more evenly and expose every team member to incidents faster. The tradeoff is that a single day is rarely enough time to recognise patterns or fully resolve slow-burning issues. Daily rotations work best for mature teams with thorough runbooks, low alert volume, and incidents that are usually resolved within a few hours.

Example Rotation Schedules

The table below shows a four-person weekly primary/secondary rotation across a one-month period. Engineers rotate through both roles so everyone shares overnight and weekend exposure equally.

Week	Primary	Secondary	Notes
1	Alice	Bob	Alice leads; Bob escalation backup
2	Bob	Carol	Bob leads; Carol escalation backup
3	Carol	Dave	Carol leads; Dave escalation backup
4	Dave	Alice	Dave leads; Alice escalation backup
5	Alice	Bob	Cycle repeats

For a follow-the-sun example with two regions, the schedule might look like this:

Shift	Hours (UTC)	Engineer Pool	Handoff Time
EMEA	07:00 - 16:00	Alice, Carol	16:00 UTC daily
Americas	14:00 - 23:00	Bob, Dave	23:00 UTC daily
Overnight low-traffic	23:00 - 07:00	Pager escalates to secondary after 15 min	Async Slack note

Fairness and Load Management

Rotation fairness is not only about equal slots in a calendar. It also means equal exposure to hard shifts. Track weekend and holiday coverage separately and ensure those are distributed across the rotation, not absorbed by the same two people each cycle.

Count weekend and holiday shifts as weighted slots (many teams count one weekend day as equivalent to two weekday shifts).
Never assign someone to on-call the week after they return from leave.
Keep a public rotation calendar so everyone can see the full schedule at least four weeks in advance.
Allow swap requests through a defined process, not informal pressure.
Track alert volume per shift and use it to motivate alert reduction, not to rank engineers.

Alert fatigue is cumulative. If your primary on-call receives more than 4-5 actionable pages per shift on average, the rotation will burn people out regardless of how fairly you schedule it. Treat high alert volume as a reliability engineering problem, not a staffing problem.

Handoffs That Actually Work

A handoff is only useful if the incoming engineer finishes it knowing what is still broken, what is being watched, and what context is missing. Verbal handoffs without a written record are unreliable. Use a consistent template.

1
Write a shift summary
Before ending your shift, document every incident that occurred: what happened, what was done, current status, and any open follow-up tickets. Post it to a dedicated Slack channel or incident log.
2
Flag ongoing degradations
If something is not fully resolved or is being monitored for recurrence, say so explicitly. Do not assume the incoming engineer will discover it on their own.
3
Transfer ownership of open alerts
Make sure any acknowledged-but-unresolved alerts in your monitoring tool are re-acknowledged or reassigned. Alerts that fall into a silent acknowledged state are frequently missed.
4
Sync briefly if possible
A 10-minute overlap call or voice note at the start of a shift is far more effective than reading a wall of text alone. For follow-the-sun teams, a recorded Loom or short async video can substitute.
5
Update the runbook if you learned something new
If you investigated an incident and the existing runbook was wrong, incomplete, or missing, update it before handing off. This is the compounding return on every incident.

Writing Runbooks That Get Used

Runbooks fail in practice because they are written once, never tested, and grow stale. A runbook that engineers do not trust is worse than no runbook, because it wastes time during an incident.

Write runbooks for specific alerts, not generic system components. A runbook titled 'database issues' is useless at 2 AM.
Include the exact commands to run, in order, with expected output where relevant.
State explicitly what constitutes resolution versus what constitutes 'safe to monitor until business hours'.
Add a 'last tested' date and require review after every incident where the runbook was used.
Link directly from your alert to the corresponding runbook — engineers should not have to search.

## Alert: High job failure rate (>20% in 5 min)

**Severity:** P2  
**Last tested:** 2025-04-10

### Steps
1. Check https://app.example.com/jobs for recent failures
2. Look for a common error pattern in job logs
3. If failures are in one job type: disable that job, open incident ticket
4. If widespread: check downstream API status pages and DB lag
5. Escalate to secondary if not resolved in 20 minutes

### Resolution criteria
Failure rate drops below 5% for 10 consecutive minutes.

### Escalation contact
Secondary on-call (see rotation calendar)

Monitoring and Alerting That Supports the Rotation

Your on-call rotation is only as effective as the signal it receives. Poorly tuned alerting — too noisy, too slow, or covering the wrong things — is the most common cause of on-call burnout on teams that believe they have a scheduling problem.

For cron jobs and scheduled tasks specifically, a common gap is the missing-execution problem: a job silently stops running and no alert fires because nothing failed — it simply never started. Heartbeat monitoring addresses this directly. The job pings a URL like https://cronjobpro.com/ping/<token> on completion; if the ping does not arrive within the expected period plus a grace window, an alert is sent. This inverts the model from 'wait for an error' to 'expect a signal and alert on its absence'. Connecting those alerts to your on-call tool via PagerDuty, Opsgenie, or webhook means the rotation receives actionable, targeted pages rather than a flood of symptoms.

How heartbeat monitoring works →

Compensation and Acknowledgement

Compensation norms for on-call vary widely by company size and region, but the core principle is consistent: on-call time has a cost and that cost should be recognised. Common approaches include:

A flat per-shift stipend, typically paid regardless of whether incidents occur.
Time-in-lieu for overnight or weekend incidents above a minimum duration threshold.
An incident bonus per resolved page during off-hours.
For smaller teams with no formal budget: explicit acknowledgement in performance reviews, protected recovery time the following day, and flexibility to decline non-critical meetings the day after a hard shift.

If your company cannot yet pay a stipend, the most effective alternative is protected recovery time. An engineer who handles incidents from midnight to 3 AM should not be expected in a 9 AM planning meeting with full cognitive function. Making this policy explicit and enforced is worth more than a small cash payment with no recovery time attached.

Avoiding Burnout

Burnout from on-call comes from two sources: volume and perceived futility. Engineers can tolerate a hard week if they believe it is temporary, rare, and that their feedback will be acted on. They burn out when they handle the same recurring incidents repeatedly with no reduction in underlying failures.

Run a post-incident review after every significant page and track action items to completion.
Measure and report weekly alert volume as a team metric, not an individual one.
Reserve engineering time each sprint specifically for reliability work — alert reduction, runbook updates, automation.
Set a maximum alert threshold (for example, no single on-call shift should receive more than 10 pages) and treat breaches as a team priority.
Rotate new engineers into secondary positions before putting them on primary. Dropping someone into primary with no prior exposure is a fast way to lose them.

Public status pages reduce inbound alert noise from customers and internal stakeholders. When users can self-serve to check whether an issue is known, on-call engineers spend less time answering Slack messages and more time resolving the underlying problem.

Set up a public status page →

More reliability guides →