Monitoring Resources

In-depth guides on the concepts behind reliable scheduled jobs — monitoring models, on-call practices, incident response, and ready-to-use templates.

Concepts & Comparisons

Mental models for monitoring — what to watch, and why it matters.

Heartbeat Monitoring vs Uptime Monitoring

Uptime monitoring probes your URLs from the outside. Heartbeat monitoring listens for a ping from your job. Learn when each model fits and why they are complementary.

Read guide →

Monitor AI Agents and LLM Pipelines

AI agent jobs fail silently. Learn how heartbeat monitoring catches missed LLM runs, stuck embeddings, and n8n workflows that quietly stop producing results.

Read guide →

On-Call & Incidents

Run a humane on-call rotation and respond to failures with a plan.

On-Call Rotation Best Practices for Dev Teams

A practical guide to on-call rotation models, fair scheduling, handoffs, runbooks, and burnout prevention for small and medium engineering teams.

Read guide →

Escalation Policies for Cron Job Alerting

Learn how escalation policies work, why you need alert tiers, and how to design one for cron job failures with PagerDuty and Opsgenie.

Read guide →

Templates

Free, copy-ready templates you can adapt to your team today.

Incident Postmortem Template (Free & Blameless)

A free, copy-ready blameless incident postmortem template with all sections, plus guidance on who attends, when to write one, and how to run the meeting.

Read guide →