99.999% Uptime (Five Nines)
99.999% availability allows about 5 min 16 sec of downtime per year. Here is the full breakdown — plus what it means for scheduled jobs.
Allowed downtime at 99.999%
| Period | Allowed downtime | In seconds |
|---|---|---|
| Per day | 0 sec | 1 |
| Per week | 6 sec | 6 |
| Per month | 26 sec | 26 |
| Per quarter | 1 min 19 sec | 79 |
| Per year | 5 min 16 sec | 316 |
Missed scheduled runs at 99.999%
Downtime is not just lost availability — for scheduled jobs it means runs that never happen. At 99.999% uptime, here is roughly how many executions a cron job loses per year at common frequencies:
| Cron frequency | Missed runs / year |
|---|---|
| Every minute | 5 |
| Every 5 minutes | 1 |
| Every 15 minutes | 0 |
| Hourly | 0 |
What uptime percentages mean
Uptime percentage expresses the fraction of time a system is operational and reachable over a given period, typically measured monthly or annually. The informal "nines" naming convention counts the leading nines in the figure: one nine is 90%, two nines is 99%, three nines is 99.9%, four nines is 99.99%, and five nines is 99.999%. A critical gotcha for architects is composite availability: three independent services each running at 99.9% uptime, when chained together, produce a combined availability of roughly 99.7% because their failure probabilities multiply.
Error budget
An error budget is the complement of your SLA target: a service promising 99.9% availability has a 0.1% error budget, which works out to roughly 43 minutes of allowable downtime per month. Teams treat this budget as a finite resource — every confirmed outage, degraded-performance window, and failed deployment draws it down, and when it is exhausted before the period ends, further risky releases are typically paused. Tracking the remaining budget in real time aligns engineering and product priorities by making the cost of unreliability concrete and visible.
SLI vs. SLO vs. SLA
A Service Level Indicator (SLI) is the raw measurement — for example, the percentage of HTTP requests that return a successful response within a defined latency threshold over a rolling window. A Service Level Objective (SLO) is the internal target your team sets for that indicator, such as keeping successful-request rate above 99.9%; breaching an SLO triggers an internal response but carries no external consequence. A Service Level Agreement (SLA) is the contractual commitment made to customers, backed by defined remedies such as service credits or refunds when the agreed threshold is not met.
How to achieve 99.999% uptime
Target architecture: Extreme multi-region.
- Deploy active-active across multiple geographic regions so a full regional outage causes automatic failover with no manual intervention.
- Design every component for graceful degradation so partial failures reduce capability rather than producing total unavailability.
- Maintain sub-30-second automated recovery paths for all critical services, validated continuously in production via synthetic traffic.
- Dedicate engineering capacity to sustained reliability work: dependency audits, redundancy gap analysis, and SLO review cycles.
- Engage dedicated reliability engineering roles and formal incident review processes to drive the error budget toward zero over successive quarters.
Monitoring your SLA — including silent failures
Traditional HTTP uptime checks confirm that a URL is reachable and returning expected responses, but they are blind to an entire class of failure: a cron job that simply never runs. If a scheduled task silently skips its execution window — due to a dead worker process, a misconfigured schedule, or a deployment that removed the job — no HTTP probe will fire an alert, yet your users are affected and the time lost counts against your error budget invisibly. The correct pattern for job-based systems is a heartbeat or dead-man's-switch check, where the job itself pings a monitoring endpoint on each successful run and an alert fires when that ping goes missing within the expected interval. CronJobPro combines both approaches — external HTTP polling and built-in heartbeat monitoring — so neither web-endpoint failures nor silent missed runs go undetected.
Frequently asked questions
How much downtime does 99.999% uptime allow?
99.999% uptime allows about 5 min 16 sec of downtime per year, which is roughly 26 sec per month and 0 sec per day.
What does five nines uptime mean?
"Five Nines" is the informal name for 99.999% availability — it counts the leading nines in the figure. It permits about 5 min 16 sec of downtime per year.
Does planned maintenance count against my SLA?
It depends on how the SLA is written. Many commercial SLAs explicitly exclude downtime that occurs during pre-announced maintenance windows, provided the vendor gives customers adequate advance notice — commonly 48 to 72 hours. However, some enterprise agreements treat all unavailability equally regardless of cause. Always read the exclusions section of any SLA carefully, and if you are drafting your own, define the maintenance window policy explicitly to avoid disputes.
How is an error budget calculated?
An error budget starts from your SLA target: subtract the target percentage from 100% to get the permitted failure percentage, then apply it to the time window you are measuring. For example, a 99.9% monthly SLA on a 30-day month yields 0.1% of 43,200 minutes, which is approximately 43 minutes of allowable downtime. Tracking cumulative downtime against that budget throughout the month lets teams make data-driven decisions about when to slow down risky changes.
What is a good uptime SLA for a SaaS product?
There is no universally correct answer because the right target depends on your customers' tolerance for downtime, your infrastructure investment, and your product's criticality. Consumer-facing SaaS products commonly commit to 99.9% (around 43 minutes of monthly downtime), while business-critical or enterprise tools often target 99.95% or 99.99%. Committing to a number higher than your current measured baseline is a liability, so it is better to set a target you can reliably exceed and raise it as your infrastructure matures.
How frequently should I run uptime checks to meet my SLA?
Check frequency determines how quickly an outage is detected, which directly affects your mean time to alert and the total undetected downtime that erodes your error budget. As a rule of thumb, your check interval should be significantly shorter than the downtime allowance for a single incident. A 99.9% monthly SLA permits roughly 43 minutes total, so a 5-minute check interval is a reasonable floor; a 99.99% SLA permits only about 4 minutes per month, making 1-minute or sub-minute checks necessary. Always combine polling intervals with alerting thresholds that account for transient failures before paging on-call staff.
What is the difference between availability and reliability?
Availability measures the proportion of time a system is in a working state and reachable by users — it is expressed as a percentage and is the basis for SLA calculations. Reliability is a broader concept that encompasses whether the system produces correct results consistently over time, including under load or adverse conditions. A system can be highly available but unreliable if it responds to requests but returns wrong data; conversely, a system with planned maintenance windows has lower availability but may be highly reliable during the time it is running. Good SLA design considers both dimensions rather than treating uptime percentage as the sole quality indicator.