What is Failure Rate?
The percentage of job executions that result in failure over a given time period.
Definition
Failure rate is the ratio of failed executions to total executions, expressed as a percentage. A job with 100 executions in a week, of which 3 failed, has a 3% failure rate. Tracking failure rate over time reveals whether a job is becoming less reliable (increasing rate), stable, or improving. Sudden spikes in failure rate indicate new issues, while gradual increases suggest degrading dependencies.
Simple Analogy
Like a batting average in baseball, but inverted โ instead of measuring hits, you are measuring strikes. A lower failure rate means better performance.
Why It Matters
Failure rate is the single most important reliability metric for cron jobs. A 1% failure rate might be acceptable for non-critical monitoring jobs, but unacceptable for payment processing. Tracking failure rate helps you set SLA targets, prioritize reliability improvements, and detect degradation before it impacts business operations.
How to Verify
In CronJobPro, the dashboard shows success rate (inverse of failure rate) per job and overall. Calculate manually by dividing failed executions by total executions over a period. Set up alerts for when failure rate exceeds your acceptable threshold.
Common Mistakes
Ignoring low but consistent failure rates that indicate a systematic issue. Measuring failure rate over too short a period, giving noisy results. Not differentiating between failure types โ a 500 error has different implications than a timeout.
Best Practices
Set failure rate targets per job based on business criticality. Measure over rolling windows (24h, 7d, 30d) for different perspectives. Alert when the rate exceeds baseline by more than 2x. Investigate any sustained increase in failure rate, even if the absolute rate seems low.
CronJobPro Monitoring
See monitoring features
Try it free โFrequently Asked Questions
What is Failure Rate?
Failure rate is the ratio of failed executions to total executions, expressed as a percentage. A job with 100 executions in a week, of which 3 failed, has a 3% failure rate. Tracking failure rate over time reveals whether a job is becoming less reliable (increasing rate), stable, or improving. Sudden spikes in failure rate indicate new issues, while gradual increases suggest degrading dependencies.
Why does Failure Rate matter for cron jobs?
Failure rate is the single most important reliability metric for cron jobs. A 1% failure rate might be acceptable for non-critical monitoring jobs, but unacceptable for payment processing. Tracking failure rate helps you set SLA targets, prioritize reliability improvements, and detect degradation before it impacts business operations.
What are best practices for Failure Rate?
Set failure rate targets per job based on business criticality. Measure over rolling windows (24h, 7d, 30d) for different perspectives. Alert when the rate exceeds baseline by more than 2x. Investigate any sustained increase in failure rate, even if the absolute rate seems low.
Related Terms
Execution Status
The outcome classification of a job run: success, failure, timeout, or skipped.
Alerting
Automated notifications sent when a job fails, times out, or behaves abnormally.
Retry
Automatically re-executing a failed job to recover from transient errors.
SLA (Service Level Agreement)
A formal commitment defining guaranteed uptime, response times, and remedies for failures.