The Challenge
A business running automated processes across multiple servers had no visibility into whether those processes were actually running. Scheduled tasks, data synchronisation jobs, automated reports, and background maintenance scripts all operated silently. When everything worked, nobody needed to think about them. When something failed — a job crashed, a schedule skipped, a process hung indefinitely — nobody knew until the downstream impact became visible. Usually that meant a client reporting missing data or a team member discovering that a nightly sync had not run for three days.
The gap was not in the processes themselves. They worked reliably most of the time. The problem was the failure mode: silent and invisible. There were no heartbeats, no status indicators, no alerts. The business was running critical automation with a monitoring strategy that amounted to “we will find out when something goes wrong.” For processes that run every thirty seconds, that meant a failure could go undetected for days if the immediate output was not something a human checked regularly.
Previous attempts at monitoring had been ad-hoc — a cron job that sends an email when a script exits with an error code, a log file that someone checks periodically. These approaches created noise without clarity. The email alerts fired on transient errors that resolved themselves, desensitising the team. The log files were checked inconsistently and contained too much raw output to scan efficiently.
The Approach
We built a monitoring system designed specifically for unattended processes. Each process sends a lightweight heartbeat at regular intervals — a small payload confirming it is alive, what it is doing, and whether it has encountered any issues. The monitoring system tracks these heartbeats and raises alerts when a process goes silent, reports errors, or deviates from its expected schedule.
The architecture uses API key authentication scoped to the client rather than individual users. This was a deliberate choice for unattended processes. These scripts and jobs run on servers with no human present — they cannot complete a login flow or respond to a session expiry. Client-scoped API keys with automatic rotation handle authentication cleanly without manual intervention.
Alert intelligence was the most important design investment. Rather than alerting on every error, the system distinguishes between transient failures (a single retry that resolves) and genuine outages (a process that stops sending heartbeats entirely). This prevents the alert fatigue that killed previous monitoring attempts. When an alert fires, the team trusts that it represents a real problem requiring human attention.
The monitoring interface surfaces process health across all servers in a single view. Each process shows its last heartbeat, current status, recent event history, and any active alerts. This replaced the scattered approach of checking individual server logs and email inboxes.
What Was Delivered
- A centralised monitoring system tracking heartbeats from background processes across multiple servers
- Client-scoped API key authentication with automatic rotation for unattended process environments
- Intelligent alerting that distinguishes transient failures from genuine outages, eliminating alert fatigue
- A unified dashboard showing process health, heartbeat status, event history, and active alerts across all monitored processes
- Lifecycle tracking for each process — started, running, completed, failed — with timestamps and metadata
The Result
The first genuine outage detected by the system validated the entire investment. A nightly data synchronisation job had failed silently and would not have been noticed for days under the old approach. The monitoring system flagged it within minutes of the missed heartbeat, and the team had it resolved before the start of business the next morning. Under the previous regime, that failure would have surfaced as a client complaint about missing data — days late and with reputational damage already done.
Alert fatigue dropped to near zero. The previous ad-hoc email alerts had trained the team to ignore monitoring notifications because most were false positives. The new system’s intelligent alerting meant that when a notification arrived, it was worth investigating. Alert engagement went from “glance and dismiss” to “investigate immediately” because trust in the alert quality was earned early and maintained consistently.
The operational benefit extended beyond failure detection. The heartbeat data revealed patterns that were invisible before — processes that were slowly degrading in performance, jobs that occasionally skipped executions without failing outright, and schedules that drifted over time due to server clock inconsistencies. These were problems the team never knew they had because there was no data to reveal them.
What Made This Work
Investing in alert intelligence rather than alert volume was the critical decision. Monitoring systems that alert on everything produce the same result as monitoring systems that alert on nothing — the team ignores them. By building logic that filters transient noise and only surfaces genuine issues, the system earned trust from the first week. That trust is the difference between a monitoring tool that people check and one that runs forgotten in the background, creating a false sense of security while actual failures go unnoticed.
Running Processes With No Visibility?
If your automated processes run silently and your monitoring strategy is “wait for someone to notice,” the risk compounds with every process you add. Get in touch to discuss how structured monitoring could give you visibility into what is already running.