Escalation Policies for Cron Job Alerting
Learn how escalation policies work, why you need alert tiers, and how to design one for cron job failures with PagerDuty and Opsgenie.
When a scheduled job fails silently at 3 AM, a single email alert is rarely enough. An escalation policy defines who gets notified, in what order, and what happens when no one responds — turning a reactive alert chain into a reliable incident-response process.
What Is an Escalation Policy?
An escalation policy is a structured set of rules that governs how an alert moves from its first recipient to higher-priority responders when it remains unacknowledged. Each step in the policy specifies a target (a person, a team, or a channel), a timeout window, and the condition that triggers the next step. The chain continues until someone acknowledges the alert or the incident is resolved.
In the context of cron jobs and scheduled tasks, this matters because failures are often time-sensitive. A nightly billing job that skips silently can cause downstream data inconsistencies for hours before anyone notices. An escalation policy ensures the right people know within minutes, not the next morning.
Why You Need Alert Tiers
A flat alert — one notification to one address — has two failure modes: alert fatigue (the recipient ignores it because too many fire) and missed coverage (the recipient is unavailable). Tiers solve both problems.
- Tier 1 — Primary on-call: the engineer or team most familiar with the job. They receive the alert first and carry responsibility during their shift.
- Tier 2 — Secondary on-call or backup: reached automatically if Tier 1 does not acknowledge within the timeout window.
- Tier 3 — Manager or incident commander: pulled in only if both lower tiers are unresponsive, signaling a genuine escalation that warrants broader awareness.
Keep escalation chains to three or four tiers maximum. Longer chains introduce confusion about ownership and slow down actual incident response.
How Acknowledgement Timeouts Work
Every escalation step carries a timeout: the maximum time the system waits for a response before advancing to the next tier. The timeout clock starts when the alert is delivered to the current tier, not when the underlying failure occurred.
A typical configuration looks like this: alert fires, primary responder receives a page, and has 10 minutes to acknowledge. Acknowledgement stops the escalation clock. If 10 minutes pass without acknowledgement, the policy automatically pages the secondary responder, who gets another 10-minute window. If that window also expires, the manager is paged. At any point, acknowledging the alert halts further escalation.
| Step | Target | Timeout | Condition to escalate |
|---|---|---|---|
| 1 | Primary on-call | 10 minutes | No acknowledgement |
| 2 | Secondary on-call | 10 minutes | No acknowledgement |
| 3 | Engineering manager | 15 minutes | No acknowledgement |
| 4 | Incident bridge / Slack channel | Continuous | Until resolved |
Repeat Notifications
Many escalation platforms also support repeat intervals within a single tier — re-paging the same person every few minutes before escalating. This handles the case where a notification was delivered but not seen (phone on silent, push notification missed). Use repeat intervals shorter than your escalation timeout so the person gets at least two chances before the alert moves on.
Designing an Escalation Policy for Cron Job Failures
Scheduled jobs have properties that should inform how you design escalation: they run on a known schedule, they have a known expected duration, and a missed or failed run always carries a business impact proportional to what the job does. Your policy should reflect that impact.
- 1
Classify your jobs by criticality
Group jobs into tiers: critical (revenue, compliance, data integrity), important (reporting, syncs), and routine (housekeeping, logs). Critical jobs warrant immediate paging and short escalation timeouts. Routine jobs may be fine with a single email.
- 2
Set the initial alert channel by criticality
Critical jobs should go directly to PagerDuty or Opsgenie, which are purpose-built for on-call management. Important jobs can go to Slack or Discord with a secondary PagerDuty rule. Routine jobs can use email alone.
- 3
Define your timeout windows
For critical jobs, 5-10 minutes per tier is typical. For important jobs, 15-20 minutes is reasonable. Avoid timeouts shorter than 5 minutes — false positives and transient delivery delays will cause unnecessary escalations.
- 4
Assign named individuals, not just team inboxes
A team channel is not an escalation target — it has no acknowledgement semantics. In PagerDuty and Opsgenie, escalation steps must point to schedules or specific users with defined on-call rotations so the timeout clock has a clear owner.
- 5
Test the full chain
Simulate a failure during business hours and verify that each tier fires in order. Then simulate one during off-hours. On-call rotations often behave differently at night when schedules change hands.
Routing CronJobPro Alerts to PagerDuty and Opsgenie
CronJobPro supports native integrations with PagerDuty and Opsgenie as alert channels alongside email, Slack, Discord, Teams, and webhooks. When a monitored job fails — either a cron job returning a non-2xx response, or a heartbeat monitor that stops receiving pings at the expected interval — CronJobPro fires an alert to all configured channels for that job.
To route into your escalation policy, configure the PagerDuty or Opsgenie channel in CronJobPro with your integration key. The alert CronJobPro sends becomes an incident in PagerDuty or Opsgenie, where your escalation policy takes over. This clean separation of concerns means CronJobPro handles detection and initial firing, while your incident management platform handles the escalation chain, on-call schedules, and acknowledgement tracking.
PagerDuty Setup
- In PagerDuty, create a service and select the Events API v2 integration.
- Copy the integration key from the Integrations tab.
- In CronJobPro, open the job or monitor settings, add a PagerDuty alert channel, and paste the integration key.
- In PagerDuty, assign your escalation policy to that service.
Opsgenie Setup
- In Opsgenie, create an API integration under the relevant team.
- Copy the API key.
- In CronJobPro, add an Opsgenie alert channel and paste the API key.
- Ensure the Opsgenie team has an escalation policy assigned that defines your response tiers.
For heartbeat monitors — jobs that run on external infrastructure and ping CronJobPro on success — the same alert channels apply. If the ping does not arrive within the period plus grace window, CronJobPro fires the alert exactly as it would for a failing HTTP cron job. See the heartbeat monitoring guide for configuration details.
Heartbeat monitoring — how dead man's switch alerts work →
A Complete Example Policy
The following example shows how a critical nightly billing job might be configured end-to-end, from CronJobPro detection through to manager escalation.
# Conceptual escalation policy — configured in PagerDuty or Opsgenie
# CronJobPro fires the trigger; this policy governs what happens after
escalation_policy:
name: Billing Job Critical
repeat_count: 1 # repeat the whole chain once if unresolved
rules:
- step: 1
targets:
- type: schedule
name: Engineering On-Call Primary
escalate_after: 10m
- step: 2
targets:
- type: schedule
name: Engineering On-Call Secondary
escalate_after: 10m
- step: 3
targets:
- type: user
name: Engineering Manager
escalate_after: 15m
- step: 4
targets:
- type: slack_channel
name: "#incidents"
Do not rely solely on Slack or email for critical job failures. These channels have no native acknowledgement or escalation semantics — a message can be seen and ignored with no record. For anything affecting revenue or data integrity, route through PagerDuty or Opsgenie.
Keeping Your Escalation Policy Maintained
An escalation policy is only as good as its data. On-call schedules drift: people leave teams, rotations change, new engineers join. Review your escalation policies quarterly or after every significant team change. Policies pointing at former employees or inactive schedules silently break your entire incident-response chain.
- Run a fire drill every quarter: deliberately trigger a test alert and verify the full chain fires correctly.
- Document which jobs map to which escalation policy in your runbook, not just in PagerDuty.
- For jobs on shared infrastructure, confirm the policy owner is a team, not a single individual.
- Pair escalation policies with status pages so stakeholders can self-serve during incidents without adding noise to the responder chain.
Public status pages — communicate incidents to stakeholders without alert noise →