Monitor Celery Beat Periodic Tasks (Python)
Celery Beat is a single-process scheduler: if it dies, every periodic task in your application stops running, and nothing in Celery itself will tell you. Workers can also hang after a broker reconnection, consuming no tasks while appearing fully operational in logs. These silent failures make Celery Beat one of the trickiest parts of a Python background-job stack to observe correctly.
Why Celery Beat Periodic Tasks Fail Without Any Alert
Celery Beat schedules tasks by placing messages onto a broker queue at the configured interval. That single process is the entire heartbeat of your scheduled workload. When it stops — due to an OOM kill, a container restart, a network hiccup that causes it to hang, or an uncaught exception in a custom scheduler — the queue simply receives no new messages. Workers keep polling for work and finding nothing. No exception is raised, no Celery log line says 'beat is dead', and Flower (the official Celery monitoring UI) has no visibility into the Beat process at all. The only observable symptom is that expected side effects — emails not sent, reports not generated, database rows not updated — stop occurring silently. By the time a human notices, hours or days of missed runs have already accumulated.
- Beat process dies or hangs silently: the Beat process can freeze after hours of uptime with no error in logs and no crash — it simply stops scheduling. This is a well-documented production issue in django-celery-beat (GitHub issue #824 and others), triggered by database connection timeouts, scheduler lock contention, or OOM on the host.
- Worker stops consuming after broker reconnection: a known Kombu-level bug causes workers to freeze after Redis or RabbitMQ drops and re-establishes the connection. The worker reconnects but its channel state is stale, so it polls the broker and finds nothing. No error is logged after the reconnection completes — the worker looks alive but executes zero tasks.
- Task acknowledgement and re-queue race: when a worker loses its broker connection mid-execution, RabbitMQ requeues the message. After reconnection the worker no longer holds the original delivery_tag and cannot ACK the original message, leaving zombie tasks stuck in the queue while the worker idles.
- Multiple Beat instances running simultaneously: if you restart a container without a clean shutdown, or if you run Beat embedded in a worker via --beat on multiple replicas, two Beat processes can write to the same schedule simultaneously. The result is duplicate task execution, missed intervals, or a corrupted schedule database.
- Task silently swallowed by a full queue: if the broker queue reaches its maximum length (common with Redis when maxmemory-policy is set to allkeys-lru), new messages from Beat are dropped without any exception on the Beat side. Beat logs show normal scheduling; workers never receive the task.
- Celery inspect ping blindspot: the recommended liveness probe celery inspect ping does not respond when workers are busy executing a long task. A probe that times out is indistinguishable from a dead worker, causing false positives that mask real outages.
The Heartbeat / Dead-Man's-Switch Approach
An external heartbeat monitor inverts the detection model: instead of polling your infrastructure to ask 'is Beat running?', you make the task itself report 'I just ran successfully' by sending an HTTP ping to a URL provided by the monitoring service. CronJobPro issues you a unique URL in the form https://cronjobpro.com/ping/<token>. Your Celery task calls that URL at the end of a successful run. CronJobPro knows the expected interval (for example, every 60 minutes) and a grace window. If no ping arrives within that window, CronJobPro fires an alert to whatever channel you configured — email, Slack, Discord, Microsoft Teams, PagerDuty, Opsgenie, or a custom webhook. This approach catches every failure mode that internal tooling misses: a dead Beat process never schedules the task so the ping never fires; a hung worker accepts the task from the queue but never completes it so the ping never fires; a dropped message means the task never runs so the ping never fires. Flower, Prometheus worker metrics, and Kubernetes liveness probes all check whether a process is alive — none of them verify that the task actually executed its business logic and completed. The heartbeat ping is the only signal that confirms end-to-end success.
Add a heartbeat to Celery
- 1
Create a heartbeat monitor in CronJobPro
Log in to CronJobPro and create a new Heartbeat monitor. Set the schedule to match your Celery Beat interval (for example, every 60 minutes) and configure a grace period (for example, 5 minutes). CronJobPro will display your unique ping URL in the form https://cronjobpro.com/ping/<token>. Copy this token — you will embed it in your task.
- 2
Install the requests library if not already present
Your Celery task needs to make an outbound HTTP GET request on success. The standard choice is the requests library: pip install requests. If you are already using httpx or urllib3 in your project, either works fine. Keep the ping call lightweight — a simple GET with a short timeout (3–5 seconds) is sufficient.
- 3
Add the ping call inside your Celery task
Place the ping call at the very end of the task body, after all business logic has completed successfully. Use a try/except around the ping so that a transient network error reporting the ping does not cause the task itself to be retried. For a failed run, call the /fail endpoint instead: https://cronjobpro.com/ping/<token>/fail. If you want to report the process exit code, use https://cronjobpro.com/ping/<token>/exitcode/<n>.
- 4
Configure Beat and verify the schedule
Make sure only one Beat process is running (never use --beat on multiple worker replicas). Confirm that the task appears in CELERYBEAT_SCHEDULE or in the django-celery-beat database with the correct interval. Run celery -A yourapp beat --loglevel=info for a few cycles and verify the ping reaches CronJobPro before switching to production.
- 5
Set up alert channels in CronJobPro
In the monitor settings, configure at least one alert channel — email is available immediately with no extra setup. For on-call escalation, connect PagerDuty or Opsgenie. For team visibility, connect Slack or Discord. CronJobPro will alert on first missed ping and can be configured to re-alert if the outage continues.
python
import requests
from celery import shared_task
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
CRONJOBPRO_PING_URL = "https://cronjobpro.com/ping/<your-token-here>"
@shared_task(bind=True, max_retries=3, default_retry_delay=60)
def send_daily_report(self):
"""
Example periodic task scheduled via Celery Beat.
Pings CronJobPro on success so an alert fires if the task
stops running (Beat dies, worker hangs, broker drops the message, etc.).
"""
try:
# --- your business logic here ---
generate_and_email_report()
logger.info("Daily report sent successfully.")
# Ping the heartbeat monitor to confirm successful completion.
# Use a short timeout so a slow ping never blocks the task from finishing.
try:
requests.get(CRONJOBPRO_PING_URL, timeout=5)
except requests.RequestException as ping_err:
# Log but do not re-raise: a failed ping must not retry the task.
logger.warning("CronJobPro heartbeat ping failed: %s", ping_err)
except Exception as exc:
# Report failure to CronJobPro so the outage is visible immediately.
try:
requests.get(f"{CRONJOBPRO_PING_URL}/fail", timeout=5)
except requests.RequestException:
pass
raise self.retry(exc=exc)
# In celery.py / settings.py — wire the task to Celery Beat:
#
# from celery.schedules import crontab
#
# app.conf.beat_schedule = {
# "send-daily-report": {
# "task": "myapp.tasks.send_daily_report",
# "schedule": crontab(hour=7, minute=0), # every day at 07:00 UTC
# },
# }Frequently asked questions
Can I use Flower to detect a dead Beat process?
No. Flower monitors workers via Celery events, but Beat is not a worker and does not emit worker events. Flower has no visibility into the Beat process. If Beat dies, Flower will show all workers as healthy because they are still running — they just have no tasks to execute.
Why does my Celery worker appear online but stop executing tasks after a Redis restart?
This is a known Kombu bug (tracked in celery/celery discussions #7276 and #8030). After the broker connection drops and reconnects, the worker's internal channel state becomes stale and it stops consuming the queue. The process is alive and logs show no error. The workarounds include setting broker_connection_retry=True with broker_connection_max_retries and restarting workers after Redis restarts, or upgrading to a Celery/Kombu version where the fix has been backported.
What is the difference between placing the ping inside the task versus using a Celery signal?
Placing the ping inside the task body is the most reliable option because it only fires when the business logic actually completed. Using the task_success signal via app.signals or @task_postrun is also valid, but signals fire after any task success including tasks you did not intend to monitor. If you use signals, filter by task name. Both approaches are correct; inline is simpler and easier to audit.
What happens if two Beat instances start at the same time?
Two Beat instances will each try to schedule tasks independently. With the default file-based scheduler they write to the same celerybeat-schedule file and corrupt it. With django-celery-beat they contend on database row locks, which can cause duplicate task execution or missed schedules. Always run exactly one Beat process. In Kubernetes, use a Deployment with replicas: 1 for the Beat pod and never pass --beat to a horizontally scaled worker Deployment.
Should I use CELERY_TASK_ACKS_LATE to avoid losing tasks when a worker dies mid-execution?
Setting acks_late=True makes Celery acknowledge the task only after it finishes rather than when it is received. This means if a worker is killed mid-task, the broker requeues the message and another worker picks it up. This is safer for critical tasks, but you must ensure your task is idempotent — it may run more than once. Combine acks_late with task_reject_on_worker_lost=True to avoid leaving messages in an unacknowledged state indefinitely.
More monitoring guides
Catch silent failures in Celery
Add one HTTP ping and CronJobPro alerts you the moment a run is missed or fails.