Who This Guide Is For
This guide is for developers and technical leads responsible for keeping production systems running. You have applications deployed to servers (cloud VMs, managed hosting, or containers) and want to know when something is wrong before your users tell you. You understand basic server administration and want a monitoring setup that provides actionable visibility without drowning you in noise.
Before You Start
You should have at least one production server running an application with real traffic. This guide covers monitoring the infrastructure your application runs on — CPU, memory, disk, network, and application logs. It does not cover application performance monitoring (APM) or business metrics, though both benefit from the same monitoring infrastructure. If you have no monitoring at all, this guide gives you a production-ready setup. If you have basic monitoring and want to improve it, the sections on alerting thresholds and log aggregation will be most useful.
Step 1: Decide What to Measure
Every monitoring setup starts with the same question: what metrics actually matter? The temptation is to collect everything and figure it out later. Resist that. Too many metrics create dashboard fatigue, and teams that are overwhelmed by data tend to ignore all of it.
The core four metrics that every server needs are CPU utilisation, memory usage, disk usage, and disk I/O. These cover the majority of infrastructure failures. A server that runs out of memory kills processes unpredictably. A full disk causes writes to fail, which cascades into application errors, database corruption, and log loss. High sustained CPU means requests queue and response times degrade.
Network metrics matter for applications that handle significant traffic or communicate with external services. Monitor bandwidth utilisation, connection counts, and packet error rates. A sudden spike in connections often indicates an attack or a runaway process. Elevated error rates suggest network infrastructure issues that the application layer cannot resolve.
Application-specific metrics sit between infrastructure and APM. For a web application, this means queue depth (how many jobs are waiting), worker count (how many are processing), and response time at the web server level (not the application level — Nginx or Apache response times catch issues that application-level monitoring misses because the request never reaches your code).
Database metrics deserve their own attention even if the database runs on the same server. Monitor active connections, slow query count, replication lag (if applicable), and buffer pool hit ratio. A database that is technically “up” but has a 95% cache miss rate is functionally broken for your users.
Start with the core four plus whatever is most likely to fail in your specific architecture. You can always add metrics later. You cannot easily undo the alert fatigue caused by monitoring everything from day one.
Step 2: Set Up Metrics Collection
Metrics collection has two components: an agent that runs on each server and collects data, and a central system that stores and queries that data.
The collection agent is a lightweight process that reads system metrics at regular intervals and forwards them to your monitoring backend. Common choices include the Prometheus node exporter, Telegraf, Datadog Agent, or the monitoring agent provided by your cloud platform. The agent should start automatically with the server, restart on failure, and use minimal resources — a monitoring agent that consumes significant CPU or memory defeats the purpose.
Collection interval determines how granular your data is. Fifteen-second intervals are standard for most metrics. One-second intervals are appropriate for debugging specific performance issues but generate too much data for long-term storage. Five-minute intervals miss transient spikes that cause user-facing problems. Start at fifteen seconds and adjust per-metric if needed.
The monitoring backend stores time-series data and provides a query interface. Self-hosted options include Prometheus with Grafana for visualisation, or the InfluxDB and Telegraf stack. Managed options include Datadog, New Relic Infrastructure, or your cloud provider’s native monitoring (CloudWatch, Azure Monitor, Google Cloud Monitoring). For teams without dedicated infrastructure engineers, managed solutions justify their cost by eliminating the maintenance burden of running monitoring infrastructure.
Tag your metrics with metadata that makes them useful: server hostname, environment (production, staging), application name, and region. Tags allow you to filter and aggregate metrics across your fleet. A CPU graph is useful. A CPU graph filtered by application and environment is actionable.
Verify your collection pipeline by checking that metrics appear in your backend within a few minutes of enabling the agent. Missing or delayed metrics indicate a configuration issue that must be resolved before you build anything on top of the data.
Step 3: Configure Alerting Thresholds
Alerts are the entire point of monitoring. Dashboards are useful for investigation, but alerts are what wake you up when something is broken at 3am. Bad alerting — alerts that fire too often, too late, or for things that do not matter — is worse than no alerting because it trains your team to ignore alerts.
CPU alerts: alert on sustained high CPU, not momentary spikes. A server hitting 95% CPU for ten seconds during a deployment is normal. A server at 85% CPU for fifteen minutes straight indicates a problem. Set the threshold at 85% sustained for fifteen minutes as a starting point, and adjust based on your application’s normal load profile.
Memory alerts: alert at 90% usage sustained for five minutes. Memory usage tends to climb gradually (a memory leak) or spike suddenly (a traffic burst or runaway process). The five-minute window catches both patterns while ignoring the brief spikes that the kernel’s memory management handles automatically.
Disk alerts: alert at 80% disk usage, which gives you time to act before the disk fills. Also alert on rapid disk growth — a disk that goes from 50% to 75% in an hour is a more urgent problem than a disk that has been at 82% for a month. Log files and temporary files are the usual culprits for rapid growth.
Disk I/O: alert when I/O wait exceeds 20% for five minutes. High I/O wait means the CPU is idle because it is waiting for disk operations to complete. This typically indicates a database under heavy write load, insufficient RAM for the working set (causing excessive swapping), or a disk that has reached its throughput limit.
Alert routing determines who gets notified and how. Critical alerts (server unreachable, disk full, out of memory) should page the on-call engineer via a direct channel — SMS, phone call, or a dedicated alerting app. Warning alerts (approaching thresholds, elevated error rates) should go to a team channel where they will be reviewed within business hours. Informational alerts should not exist — if no one needs to act on it, it is a metric, not an alert.
Alert deduplication and grouping prevent notification storms. When a server goes down, you do not want twenty separate alerts for CPU, memory, disk, application, and database. Group related alerts so a single incident produces a single notification with relevant context.
Step 4: Set Up Log Aggregation
Server metrics tell you that something is wrong. Logs tell you why. Without centralised log aggregation, diagnosing an issue requires SSH access to each server, searching through multiple log files, and correlating timestamps manually. This is unacceptable in production.
Centralise your logs to a single system where they can be searched, filtered, and correlated. The ELK stack (Elasticsearch, Logstash, Kibana) is the self-hosted standard. Managed alternatives include Datadog Logs, Papertrail, Logtail, or your cloud provider’s logging service. Choose based on your team’s capacity to maintain infrastructure: if you already run monitoring infrastructure, self-hosted logging is a natural addition. If monitoring is managed, logging should be too.
What to aggregate: application logs (Laravel’s log files, your web server’s access and error logs), system logs (syslog, auth log, kernel messages), and database logs (slow query log, error log). Do not aggregate debug-level application logs in production — they generate excessive volume and rarely contain information useful for incident response.
Structured logging transforms logs from text blobs into queryable data. Instead of a log line that reads “Payment failed for user 42,” a structured log emits a JSON object with fields for event type, user ID, payment amount, and error code. Every field becomes a filter in your log aggregation tool. If your application does not yet use structured logging, migrating to it is one of the highest-value improvements you can make to your observability.
Log retention balances storage cost against diagnostic value. Thirty days of full-fidelity logs covers most incident investigations. Ninety days of aggregated data (counts and patterns, not individual log lines) covers trend analysis. Compliance requirements may mandate longer retention for specific log types — check before you set automatic deletion policies.
Correlate logs with metrics by using consistent timestamps (UTC everywhere) and shared identifiers. When a CPU alert fires, you should be able to switch to your log aggregation tool and search for errors on that server during the alert window. If your logs and metrics use different time zones or server identifiers, correlation becomes manual work that slows incident response.
Step 5: Build Dashboards for Context
Dashboards are not the primary output of monitoring — alerts are. But dashboards provide the context you need when responding to an alert. A well-structured dashboard answers the question “what is happening right now?” in under ten seconds.
The overview dashboard shows the health of your entire fleet at a glance. One panel per server or service, colour-coded by status (green, yellow, red). This dashboard is for the first five seconds of incident response: which server or service is affected?
Per-server dashboards show the core four metrics (CPU, memory, disk, network) over time, with the current alert thresholds drawn as reference lines. When an alert fires for a specific server, this dashboard shows whether the problem is a sudden spike or a gradual trend, which informs your response.
Application dashboards combine infrastructure metrics with application-level indicators: request rate, error rate, response time, and queue depth. These show whether an infrastructure issue is affecting users and how severely.
Avoid dashboard sprawl. Three to five well-maintained dashboards are more useful than thirty that no one remembers exist. Each dashboard should serve a specific purpose (overview, per-server detail, application health) and be reviewed regularly to remove panels that no one looks at.
Set a default time range of one hour for investigation dashboards and twenty-four hours for overview dashboards. Most incidents are diagnosed within the last hour of data, and a longer default time range compresses recent data into a graph where short spikes are invisible.
Common Mistakes
- Alerting on every metric. If your team receives more than two or three alerts per week that do not require action, your alert thresholds are wrong. Every alert should require a human decision.
- No baseline before setting thresholds. Run your monitoring in observation mode for at least a week before setting alert thresholds. Your application’s normal CPU usage might be 60%, which means an 80% alert threshold gives you only 20% headroom.
- Monitoring the monitoring server from itself. If the server running your monitoring stack goes down, it cannot alert you. Use an external uptime check (Uptime Robot, Pingdom, or similar) to monitor the monitoring system itself.
- Ignoring disk I/O. Teams monitor disk space but not disk throughput. A database server with plenty of free space but saturated I/O is just as broken as a full disk.
- No log rotation. Application logs that grow without rotation will eventually fill the disk. Configure log rotation before you aggregate, not after the disk fills up.
What Good Looks Like
A well-monitored production environment has: metrics collected at fifteen-second intervals for the core four resources plus application-specific indicators, alerting thresholds calibrated to the application’s normal load profile with routing that escalates critical issues to the right person, centralised log aggregation with structured logs and thirty-day retention, and a small number of focused dashboards that provide context during incident response. The team receives fewer than five actionable alerts per week under normal conditions, and when an alert does fire, the dashboard and log infrastructure provide enough context to diagnose the issue without SSH access to the server.
Next Steps
For applications that expose APIs, How to Implement Rate Limiting covers protecting your services from traffic patterns that monitoring alone cannot solve. For the deployment pipeline that monitoring should wrap around, How to Set Up a Staging Environment covers environment parity that keeps monitoring consistent across environments.