How to Monitor Kubernetes CronJobs
Kubernetes CronJobs sit at the intersection of the scheduler, the Job controller, and the container runtime — any of those three layers can fail without producing an alert you will ever see. Unlike a crashing Deployment, a CronJob that stops running leaves no obvious signal: the resource still exists, kubectl get cronjobs still shows a schedule, and your cluster looks healthy. External heartbeat monitoring is the only reliable way to know your scheduled work is actually completing on time.
Why Kubernetes CronJobs Fail Without Warning
A Kubernetes CronJob is not a single entity — it is a three-layer stack. The CronJob controller creates a Job object on schedule. The Job controller creates one or more Pods. The kubelet on a node pulls the image and runs the container. A failure at any layer is independent of the others, and the top layer (the CronJob itself) reports no error when a lower layer is broken. The most dangerous failure mode is the missed-schedule limit: if the controller detects that more than 100 scheduled runs have been missed, it permanently stops scheduling that CronJob and logs a single line — 'too many missed start time (> 100)' — with no Kubernetes Event, no status-condition change visible in a normal watch, and no built-in alert. This exact issue caused a production job at the Kubernetes project itself to go undetected for 24 days. Even when a run is attempted, the Job object can be created but its Pods may never reach Running state due to image pull errors or resource pressure, meaning the work is silently skipped. kubectl has no built-in mechanism to alert you when a CronJob misses a run or when its most recent Job did not complete successfully.
- Missed-schedule limit (> 100): when the controller counts more than 100 missed scheduled times, it halts the CronJob permanently with a log line only — no alert, no event, no status change visible to operators.
- startingDeadlineSeconds misconfiguration: when this field is set, the controller counts missed schedules only within that window. A value smaller than the gap between schedule intervals can cause every single run to be skipped indefinitely.
- concurrencyPolicy: Forbid with a long-running job: if the previous Job is still active when the next schedule fires, the new run is silently skipped. Repeated over time this can compound into the > 100 missed-schedule failure.
- Image pull failures (ErrImagePull / ImagePullBackOff): the CronJob creates a Job and the Job creates a Pod, but the Pod never leaves Pending/Waiting state because the image tag does not exist, the registry is unreachable, or credentials are expired. The Job reports no completion and the CronJob schedule continues untouched.
- Pod unschedulable: the cluster has insufficient CPU or memory to place the Pod. The Job exists, the Pod exists, but the Pod stays in Pending with a FailedScheduling event that is easy to miss and is not surfaced by the CronJob object at all.
- Job created but never completes: without activeDeadlineSeconds, a hung process runs indefinitely. The CronJob's lastSuccessfulTime stops advancing while the cluster reports the job as active — a silent stall that can last days.
The Right Fix: External Heartbeat Monitoring
Internal Kubernetes tooling — kubectl, Prometheus kube-state-metrics, cluster Events — can tell you a Job failed after the fact, but none of them can tell you that a Job was expected and never ran. This is the structural gap: the cluster has no concept of a deadline by which work must have completed. The heartbeat (dead-man's switch) pattern inverts the check. You give your job a unique ping URL from an external monitoring service. The job calls that URL at the end of a successful run. The monitoring service expects a ping within your schedule interval plus a configurable grace period. If the ping does not arrive in time, the service fires an alert — regardless of whether the failure happened because the CronJob controller stalled, the Pod never scheduled, the container crashed before reaching the curl call, or the job ran but produced an error your code surfaced via a non-zero exit code. This catches every silent failure mode listed above, including the > 100 missed-schedule halt, because no ping arrives in any of those cases. With CronJobPro, each heartbeat monitor gives you a unique URL at https://cronjobpro.com/ping/<token>. You call it on success. You can also call https://cronjobpro.com/ping/<token>/fail explicitly on known error paths, or https://cronjobpro.com/ping/<token>/exitcode/<n> to forward the container's exit code directly. If the expected ping does not arrive within the period plus grace window you configure, CronJobPro sends an alert to email, Slack, Discord, Teams, PagerDuty, Opsgenie, or a webhook of your choice.
Add a heartbeat to Kubernetes CronJobs
- 1
Create a heartbeat monitor in CronJobPro
In your CronJobPro dashboard, create a new Heartbeat monitor. Set the schedule to match your CronJob's cron expression and configure the grace period to account for typical job runtime plus a safety margin. CronJobPro will generate a unique ping URL in the form https://cronjobpro.com/ping/<token>. Copy that token.
- 2
Store the ping URL as a Kubernetes Secret
Create a Secret in the same namespace as your CronJob: kubectl create secret generic cronjob-heartbeat --from-literal=PING_URL=https://cronjobpro.com/ping/<token> -n <namespace>. Storing the token in a Secret keeps it out of your YAML manifests and version control, and lets you rotate it without rebuilding your image.
- 3
Inject the Secret into your CronJob spec as an environment variable
Add an envFrom or env block to the container spec in your CronJob manifest that references the Secret. Your container command can then read the PING_URL environment variable at runtime without the value being hardcoded in the image or manifest.
- 4
Add the curl ping at the end of your container command
Chain your existing command with a curl call using shell && so the ping only fires on success. Use the /fail endpoint in an explicit error handler, or use /exitcode/$? immediately after your command to forward the real exit code unconditionally. Ensure curl is available in your image — if your image is distroless or minimal, add wget or use a sidecar, or install curl in your Dockerfile.
- 5
Verify the integration end-to-end before relying on it
Trigger the CronJob manually with kubectl create job --from=cronjob/<name> test-run-1 and confirm the ping appears in your CronJobPro monitor's history. Then test the failure path: run the job with a command that exits non-zero or call the /fail endpoint directly, and confirm an alert fires. Only after both paths are verified should you treat the monitor as production-ready.
yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: data-export
namespace: production
spec:
schedule: "0 3 * * *"
timeZone: "Etc/UTC"
concurrencyPolicy: Forbid
startingDeadlineSeconds: 300
jobTemplate:
spec:
activeDeadlineSeconds: 3600
backoffLimit: 1
template:
spec:
restartPolicy: Never
containers:
- name: exporter
image: myregistry/data-exporter:v1.4.2
envFrom:
- secretRef:
name: cronjob-heartbeat
command:
- /bin/sh
- -c
- |
# Run the actual job
/app/run-export.sh
EXIT_CODE=$?
# Forward the real exit code to CronJobPro
# On success (0): registers as a successful ping
# On failure (non-0): registers as a failed ping and triggers alert
curl -fsS --retry 3 \
"${PING_URL}/exitcode/${EXIT_CODE}" \
-o /dev/null
# Preserve original exit code so Kubernetes
# backoffLimit and Job status remain accurate
exit $EXIT_CODEFrequently asked questions
Can I just use kubectl get cronjobs and check lastScheduleTime to know if my job ran?
Not reliably. lastScheduleTime tells you when the controller last attempted to create a Job object, not whether that Job completed successfully or whether the Pod ever ran. A Job can be created, spawn a Pod that sits in ImagePullBackOff for hours, and lastScheduleTime will still show a recent timestamp. You need to check the Job's completionTime and the Pod's exit code separately — and none of that triggers an alert automatically.
What is the 100 missed-schedule limit and how do I avoid it?
When the CronJob controller restarts or loses sync and then tries to reconcile, it counts how many scheduled runs were missed. If that count exceeds 100, the controller stops scheduling the CronJob entirely and logs 'too many missed start time (> 100)'. It does not create an Event or change the CronJob status in a way that is easy to detect. The primary mitigation is to set startingDeadlineSeconds to a value that limits the lookback window — for a job that runs every hour, setting startingDeadlineSeconds to 3600 means the controller looks back at most one hour, capping missed-schedule counts at 1 per reconcile cycle. A heartbeat monitor will catch this regardless, because no ping will arrive during the stall.
My job shows as Active in kubectl but nothing is happening. What is wrong?
This usually means the Job was created and a Pod was created, but the Pod is stuck in Pending or Waiting state. Run kubectl get pods -n <namespace> and look for Pending status, then kubectl describe pod <pod-name> to see Events. Common causes are ImagePullBackOff (wrong tag, expired credentials, Docker Hub rate limit), Unschedulable (insufficient node resources or a taint/toleration mismatch), and PVC mount failures. The Job controller waits for the Pod rather than timing it out unless you have set activeDeadlineSeconds.
Why does my CronJob skip runs when concurrencyPolicy is set to Forbid?
With concurrencyPolicy: Forbid, Kubernetes will not start a new Job if the previous one is still active. If your job regularly takes longer than its schedule interval, every subsequent run is silently skipped. Over time this accumulates missed schedules. Fix the root cause by either reducing job runtime, widening the schedule interval, or switching to concurrencyPolicy: Replace if overwriting the previous run is acceptable. Always set activeDeadlineSeconds so a hung job does not block future runs indefinitely.
Does adding Prometheus and kube-state-metrics replace the need for a heartbeat monitor?
Prometheus with kube-state-metrics gives you metrics like kube_job_status_failed and kube_cronjob_next_schedule_time, which are useful for dashboards and trend alerting. However, detecting that a job was expected and did not run requires you to write an alert rule that compares the expected schedule against actual completions — this is non-trivial to get right across daylight-saving transitions, startingDeadlineSeconds interactions, and the > 100 missed-schedule halt. A heartbeat monitor handles this logic for you: if the ping does not arrive, you get alerted, regardless of the internal cluster state. The two approaches are complementary, not interchangeable.
More monitoring guides
Catch silent failures in Kubernetes CronJobs
Add one HTTP ping and CronJobPro alerts you the moment a run is missed or fails.