How an effective monitoring system protects your organisation
By Sai Nikhil Chandra Kappagantula
In any organisational technology environment, whether in retail, financial services, healthcare, energy, or other industry sectors, the role of robust network and system infrastructure monitoring cannot be overstated. Monitoring is the backbone that supports operational continuity, security, and efficiency.
Effective monitoring allows for the timely detection of issues, the prevention of potential problems, and the overall improvement of operational efficiency. Taking a proactive approach to managing IT environments is critical, but to do so effectively, it is important to understand the intricacies of monitoring and alerting processes.
Why must we monitor?
Monitoring is essential for several reasons, starting with detecting problems before they impact users, which allows for intervention before issues become critical. For example, in a complex network environment, monitoring for unusual latency between key network segments can help identify issues with network equipment or potential congestion points.
Detecting an increase in latency could indicate an impending failure of a network switch or router, enabling preemptive maintenance. Likewise, in a microservices architecture, monitoring inter-service communication can detect increased error rates or timeouts, indicating issues with specific services or network paths that need to be addressed to maintain overall system health.
Monitoring is also beneficial to performance optimisation. By providing insights into system performance and helping to optimise resource usage, monitoring ensures that a network or system is running efficiently, minimising delays or downtime. This is evident in how network performance monitoring can track bandwidth usage across different links and segments. Identifying overutilised links can lead to load balancing or bandwidth upgrades to prevent network bottlenecks.
Additionally, application performance monitoring (APM) tools can track response times and throughput for various application components. By analysing this data, performance bottlenecks at specific layers (e.g., web servers, application servers, databases) can be identified and addressed promptly.
A critical application of monitoring is security, wherein the monitoring system detects unauthorized access or anomalies. In this arena, intrusion detection systems (IDS) and intrusion prevention systems (IPS) monitor network traffic for signatures of known attacks. By detecting and alerting on these signatures, potential breaches can be prevented or mitigated promptly.
In another application, implementing security information and event management (SIEM) systems to aggregate and analyse logs from various sources can help identify patterns indicating a coordinated attack, such as distributed denial-of-service (DDoS) attempts or advanced persistent threats (APT).
Monitoring for compliance helps organisations meet their regulatory requirements by logging and monitoring activities. This is often mandated, as in the payment card industry, where PCI-DSS compliance requires detailed monitoring of all systems that handle credit card information. This includes monitoring access logs, transaction logs, and network traffic to ensure that no unauthorized access or data breaches occur.
For example, government regulations like the General Data Protection Regulation (GDPR) mandate that organisations track and log access to personal data. Monitoring tools can help ensure that access controls are enforced, and that any unauthorised access attempts are logged and investigated.
Another effective use for monitoring is in capacity planning, through which companies capture valuable data and usage patterns to understand future needs and forecast future resource requirements. Monitoring traffic patterns across multiple network segments can help in planning for future network expansions.
For instance, identifying consistently high traffic on specific paths can justify investment in higher capacity links or additional routing infrastructure. In cloud environments, monitoring resource utilisation can help optimise cost by scaling resources up or down based on demand. Predictive analytics based on historical monitoring data can automate this process, ensuring cost-effective resource management.
A single source of truth
In the realm of network and system infrastructure monitoring, having a “single source of truth” for data is crucial. This concept ensures that all monitoring data is centralised, accurate, and consistent, which is essential for effective decision-making and issue resolution.
A single source of truth eliminates discrepancies that can arise from multiple data sources, ensuring that everyone in the organisation is working with the same information. These discrepancies often occur in large enterprises, where different teams might monitor the same systems using different tools.
Without a centralised source, the operations team might see a different set of metrics than the security team, leading to inconsistent actions and confusion. A single source of truth ensures that all teams are aligned and informed.
Centralised management consolidates data from various devices and systems, providing a unified view of the entire infrastructure. To implement this, network performance data, server health metrics, application logs, and security alerts can all be aggregated into a single dashboard.
This holistic view allows administrators to quickly identify and correlate issues across different components of the infrastructure. A centralised monitoring system can integrate data from on-premises servers, cloud infrastructure, and remote devices, providing comprehensive visibility into hybrid environments.
To ensure reliability, monitoring data should be sourced directly from the devices and systems being monitored, without intermediate transformations that could introduce errors.
Examples include SNMP (Simple Network Management Protocol) data collected directly from network switches and routers to provide real-time insight into network health and performance, or using agents installed on servers to collect performance metrics, which ensures that data such as CPU usage, memory consumption, and disk I/O are accurate and up-to-date. These agents can directly communicate with the monitoring system, avoiding any potential data loss or modification.
Centralising monitoring data also accelerates incident response and resolution times. This is critical during a network outage, when having a single source of truth allows the incident response team to quickly access all relevant data, identify the root cause, and implement a fix.
Furthermore, automated correlation of alerts from different systems can highlight broader issues that might be missed when data is siloed, enabling quicker identification and resolution of complex incidents.
Monitoring vs. alerting: two distinct services
When we refer to the context of network and system infrastructure, monitoring and alerting serve complementary but distinct purposes. Monitoring involves the continuous observation and collection of data from various components within the IT infrastructure.
This data provides insights into the performance, health, and usage of the systems and networks, enabling administrators to make informed decisions and take proactive measures. Monitoring is typically conducted through three primary channels:
1. Telemetry refers to the automated collection and transmission of data from remote sources. This can include metrics such as CPU usage, memory utilisation, network throughput, and application performance.
Telemetry provides a comprehensive view of the system's current state and trends over time, such as in collecting real-time data on network latency and packet loss to assess the quality of service, and identify potential issues before they escalate.
2. Logging involves recording detailed information about system events and transactions. Logs can capture a wide range of activities, including system errors, user actions, security events, and application behaviours, and are invaluable for troubleshooting, auditing, and forensic analysis. One example of logging is maintaining detailed logs of firewall activity to detect and analyse potential security breaches or configuration errors.
3. Visualisation tools, such as dashboards and graphs, present the collected data in an easily interpretable format. These visualisations help administrators quickly grasp the state of the infrastructure and identify patterns or anomalies. Using a dashboard to visualise server CPU and memory usage over time, helping to identify trends and plan for capacity upgrades, is one way that data engineers can visualise monitoring data.
Alerting is the process of generating notifications based on predefined conditions or thresholds in the monitoring data. Alerts are designed to bring immediate attention to issues that require prompt action, helping to prevent minor problems from becoming major incidents. This domain generally includes three types of alerts:
1. Threshold-based alerts are triggered when specific metrics exceed or fall below predefined thresholds. This approach is useful for identifying conditions that are clearly indicative of problems. Generating an alert when CPU usage on a critical server exceeds 90% for more than five minutes, indicating a potential overload, is a threshold-based alert.
2. Event-based alerts are triggered by specific events or sequences of events recorded in logs. This type of alert is often used for security and operational incidents, as when a failed login attempt occurs more than five times in a short period, suggesting a potential brute-force attack.
3. Anomaly detection alerts use advanced analytics and machine learning to identify deviations from normal behaviour. This approach can detect subtle issues that may not be captured by static thresholds. An example of this type of alert is detecting an unusual increase in network traffic from a particular device, which could indicate malware activity or data exfiltration.
While monitoring provides the comprehensive data and insights needed to understand the overall health and performance of the infrastructure, alerting focuses on drawing immediate attention to specific issues that require prompt action. Together, these processes form the foundation of an effective IT management strategy, ensuring that potential problems are detected and addressed before they impact users or operations.
Building a robust and responsive alerting system into your overall monitoring strategy requires an understanding of alerting policies, determining what and how to monitor, setting up notifications, integrating with ticketing systems, creating runbooks for alerts, balancing alert noise with actionable insights, and determining the appropriate actions to take on alerts (Figure 1).
Figure 1: The components of an effective alerting system. (Source: Sai Nikhil Chandra Kappagantula)
What to monitor (assets)
Identifying what critical assets to monitor is the first step in defining an alerting policy. These assets typically include:
· Servers: Monitoring server performance metrics such as CPU usage, memory consumption, disk I/O, and network activity is crucial. This ensures that servers are running efficiently and can handle the required workloads. For example, setting alerts for high CPU usage, low available memory, or high disk I/O can help detect performance bottlenecks or potential failures.
· Network Devices: Monitoring network infrastructure, including routers, switches, firewalls, and load balancers, is vital for maintaining network performance and security, such as setting alerts for high network latency, packet loss, interface errors, or bandwidth utilisation, which can help identify network congestion, hardware issues, or security breaches.
· Applications: Monitoring application performance, availability, and error rates ensures that end-users have a seamless experience. Setting alerts for high response times, increased error rates, or downtime can help detect application performance issues or outages.
· Databases: Monitoring database health and performance metrics such as query performance, connection counts, and storage usage is essential for maintaining data integrity and performance. Setting alerts for slow running queries, high connection counts, or low available storage can help detect database performance issues or capacity problems.
· Security: Monitoring for security-related events, such as unauthorised access attempts, malware detections, and configuration changes, helps protect the infrastructure from threats. Setting alerts for multiple failed login attempts, unexpected configuration changes, or detected malware can help detect and respond to security incidents promptly.
How to monitor (protocols)
Choosing the appropriate monitoring protocols and methods is required for collecting accurate and timely data from the monitored assets. These might include one or more of the following:
· SNMP (Simple Network Management Protocol) is widely used for monitoring network devices and provides detailed information about device status and performance. An example of this type of monitoring is configuring SNMP agents on routers and switches to report metrics such as interface status, traffic volumes, and error rates.
· WMI (Windows Management Instrumentation) is a set of specifications from Microsoft for consolidating the management of devices and applications in a network from Windows-based systems. System engineers use WMI to monitor CPU usage, memory consumption, and disk space on Windows servers.
· API monitoring: many modern applications and services provide Application Programming Interfaces (APIs) for monitoring and management. API monitoring can provide detailed and customizable metrics, for example, using REST APIs provided by cloud services to monitor resource usage, application performance, and health status.
· Agent-based monitoring: installing monitoring agents on servers and endpoints allows for detailed and customisable monitoring. Using agents like Nagios or Prometheus exporters to collect performance metrics and system logs is an example of agent-based monitoring.
· Log monitoring: Collecting and analysing logs from various systems can provide insights into performance, errors, and security events. Tools such as ELK Stack (Elasticsearch, Logstash, Kibana) or Splunk can be used to aggregate and analyse logs from servers, applications, and network devices.
How to notify on alerts
Effective notification methods ensure that the right people are informed about issues promptly and can take appropriate action. Three common notification methodologies are Email, Slack, and PagerDuty. Email notifications are useful for detailed alerts that include comprehensive information about the issue, such as detailed logs and metrics when a server experiences high CPU usage or a database query fails.
Alternatively, Slack notifications are ideal for team-based communication and quick alerts. Many IT teams will send a Slack message to a dedicated channel when a network device goes offline, or an application experiences an increase in error rates.
Lastly, PagerDuty is a popular incident response platform that provides robust alerting and on-call management capabilities; it is typically used for critical alerts that require immediate attention, such as a server outage or a security breach.
Ticketing integration
Integrating monitoring systems with ticketing systems ensures that issues are tracked, prioritised, and resolved efficiently. This should include the following three types of integrations:
1. Automated Ticket Creation: When an alert is triggered, a ticket can be automatically created in the ticketing system; this enables all issues to be logged and assigned for resolution. An example is integrating with Jira or ServiceNow to automatically create tickets for alerts such as server outages or application errors.
2. Priority Assignment: Tickets can be prioritized based on the severity of the alert, so that critical issues are addressed first. For example, assigning high priority to tickets generated from alerts about security breaches or major system failures, while lower priority is assigned to less critical issues like minor performance warnings.
3. Workflow Integration: Integrating the monitoring system with the ticketing system ensures a seamless workflow from issue detection to resolution. This could include setting up workflows in the ticketing system to automatically assign tickets to the appropriate teams based on the type of alert, and ensuring that notifications are sent to relevant stakeholders.
By defining a comprehensive alerting policy, organisations can effectively monitor critical assets, use appropriate protocols for data collection, and notify the right people promptly. Integrating with ticketing systems further enhances the ability to track and resolve issues efficiently, improving overall infrastructure reliability and performance.
Notifications to customers
When we discuss monitoring and alerting, it is essential to differentiate between service impacting alerts and localised alerts. Understanding this distinction helps prioritise responses and manage notifications effectively, both internally and externally, including communications with customers.
Service impacting alerts are those that indicate a widespread issue affecting a significant portion of the infrastructure or service. They are typically characterised by a broad scope that affects multiple components or services; high urgency requiring immediate attention to minimise impact; and the potential for significant downtime due to extended outages if the problem is not addressed quickly. These alerts usually require immediate attention, as they can impact many users or critical business functions.
Examples of service impacting alerts include a failure in a core network switch that disrupts connectivity for multiple servers or services, affecting many users and potentially causing a significant business impact; or a major outage in a cloud service provider impacting access to key applications and services hosted on that platform.
A well defined communications strategy will help an organisation manage customer expectations and maintain trust during service-impacting incidents and localised alert resolutions. The response and notification protocol for service-impacting alerts must include defined procedures for communicating both internally and with customers (externally).
Such internal notifications should provide for immediate alerts to on-call engineers via PagerDuty or similar incident management platforms, and notifications to all relevant internal stakeholders including IT leadership, through email, Slack, or SMS, while simultaneously activating incident response procedures and coordination among different teams to address the issue.
Externally, managing customer expectations with timely and transparent communication is of critical importance. Customers should receive their initial notification in clear and straightforward language, free of technical jargon, through established channels such as email, status pages, social media, or customer portals, informing customers about the issue and its potential impact.
Provide regular and reassuring updates to customers on the progress of the resolution, including expected timelines for restoration, and any interim measures being taken. To maintain customer satisfaction, post-resolution communication is equally important; this should summarise the incident, the root cause, the actions taken to resolve it, and steps being implemented to prevent future occurrences.
Maintaining a public status page can be an effective way to communicate the status of your services, including ongoing incidents and historical uptime data. Use the status page to provide real-time updates during an incident. The status page can also help you track incident history, by maintaining a log of past incidents and resolutions to demonstrate transparency and your commitment to improving service reliability.
The role of AI in monitoring systems
Artificial intelligence (AI) is revolutionising monitoring systems by significantly enhancing the efficiency and effectiveness of issue detection, response, and resolution. Traditional monitoring systems rely heavily on predefined thresholds and manual configuration.
AI, however, brings the ability to learn from historical data, recognize patterns, and adapt to changing conditions autonomously, reducing human intervention and enabling smarter monitoring strategies. Proactive organisations are now using AI tools to optimise anomaly detection, predictive analytics for capacity planning, automated incident response, natural language processing (NLP) for log analysis, and reducing alert fatigue.
The increasing complexity of IT environments demands more advanced monitoring solutions, and AI is playing a pivotal role in transforming how organisations manage their infrastructure. AI powered monitoring systems offer smarter, faster, and more reliable ways to ensure operational continuity, security, and performance.
As technology continues to evolve, the integration of AI into monitoring strategies will become a necessity rather than a luxury, enabling organisations to stay ahead of potential issues, optimize resource usage, and ensure safe and seamless service.
About the author
Sai Nikhil Chandra Kappagantula is Senior Software Engineer for the GlobalNOC Indiana University, where he serves as lead for the GlobalNOC database, a core product that helps manage large scale network and system infrastructure, and supports monitoring, AAA, and measurement solutions.
With more than a decade of experience in software engineering, Kappagantula is a specialist in security, distributed systems, and network engineering, providing advanced expertise across the entire IT stack, encompassing development, DevOps, infrastructure management, and disaster recovery.
Applying a deep understanding of IT systems and processes, he is highly proficient in identifying inefficiencies, implementing innovative solutions, and driving operational excellence for research, educational, and federal organisations. Kappagantula earned an M.S. degree in Computer Science from Indiana University Purdue University (IUPUI) and received his M.B.A. degree from the Indiana University Kelley School of Business.
His dual academic experience in technology and business enables Kappagantula to uniquely combine technical depth with strategic acumen for navigating complex IT challenges and delivering robust, secure, and scalable tech solutions..
Continue reading…