What is Failure Rate?

The percentage of job executions that result in failure over a given time period.

Definition

Failure rate is the ratio of failed executions to total executions, expressed as a percentage. A job with 100 executions in a week, of which 3 failed, has a 3% failure rate. Tracking failure rate over time reveals whether a job is becoming less reliable (increasing rate), stable, or improving. Sudden spikes in failure rate indicate new issues, while gradual increases suggest degrading dependencies.

๐Ÿ’ก

Simple Analogy

Like a batting average in baseball, but inverted โ€” instead of measuring hits, you are measuring strikes. A lower failure rate means better performance.

Why It Matters

Failure rate is the single most important reliability metric for cron jobs. A 1% failure rate might be acceptable for non-critical monitoring jobs, but unacceptable for payment processing. Tracking failure rate helps you set SLA targets, prioritize reliability improvements, and detect degradation before it impacts business operations.

How to Verify

In CronJobPro, the dashboard shows success rate (inverse of failure rate) per job and overall. Calculate manually by dividing failed executions by total executions over a period. Set up alerts for when failure rate exceeds your acceptable threshold.

โš ๏ธ

Common Mistakes

Ignoring low but consistent failure rates that indicate a systematic issue. Measuring failure rate over too short a period, giving noisy results. Not differentiating between failure types โ€” a 500 error has different implications than a timeout.

โœ…

Best Practices

Set failure rate targets per job based on business criticality. Measure over rolling windows (24h, 7d, 30d) for different perspectives. Alert when the rate exceeds baseline by more than 2x. Investigate any sustained increase in failure rate, even if the absolute rate seems low.

CronJobPro Monitoring

See monitoring features

Try it free โ†’

Frequently Asked Questions

What is Failure Rate?

Failure rate is the ratio of failed executions to total executions, expressed as a percentage. A job with 100 executions in a week, of which 3 failed, has a 3% failure rate. Tracking failure rate over time reveals whether a job is becoming less reliable (increasing rate), stable, or improving. Sudden spikes in failure rate indicate new issues, while gradual increases suggest degrading dependencies.

Why does Failure Rate matter for cron jobs?

Failure rate is the single most important reliability metric for cron jobs. A 1% failure rate might be acceptable for non-critical monitoring jobs, but unacceptable for payment processing. Tracking failure rate helps you set SLA targets, prioritize reliability improvements, and detect degradation before it impacts business operations.

What are best practices for Failure Rate?

Set failure rate targets per job based on business criticality. Measure over rolling windows (24h, 7d, 30d) for different perspectives. Alert when the rate exceeds baseline by more than 2x. Investigate any sustained increase in failure rate, even if the absolute rate seems low.

Related Terms