4 Steps to Effectively Address Monitoring Issues

Andy Feller

Andy Feller

Monitoring is the surest way for any organization to deliver the best experience to customers. However, it’s more than just assessing whether hosts or services are up. It’s a means to understanding the performance and seasonality of systems, identifying customer-affecting anomalies and issues, and predicting the future before it’s too late.

At our recent Triangle DevOps presentation (slides, audio), I showcased the different tools and services we leverage at Bronto but little about how they’ve influenced our approach to monitoring. The following process outlines the steps our engineering teams execute to effectively address monitoring issues using available technology.

Step 1: Identify

Whenever you notice an issue, you need to identify the underlying cause, based on symptoms. As with a doctor, you need information about what stakeholders are seeing, along with how your systems are performing, in order to make a “diagnosis” and take action.

Though tools such as Cacti and RRDtool have been used for some time, Etsy’s “Measure Anything, Measure Everything” post demonstrates how Graphite and Statsd surpassed them by allowing DevOps teams to capture arbitrary metrics, calculate useful statistics, manipulate data, render ad-hoc graphs, and leverage telemetry for monitoring. Building upon graphite’s popularity and flexibility, DevOps teams have been able to leverage tools such as Grafana, Codahale Metrics, and Statsite to quickly identify issues through dashboards, capture JVM metrics, and calculate larger volumes of rate and timing statistics.

Grafana dashboard
Grafana dashboard showing operational data stored in graphite about redis instance.

Step 2: Communicate

As you work to understand the underlying cause of an issue, you need to communicate necessary information with engineers, client support, and ultimately all stakeholders. With the scale and complexity of systems today, a team effort is necessary. No one person knows it all or can work on everything.

Taking advantage of chat room services like IRC or Atlassian’s free HipChat, monitoring systems can be integrated to automatically broadcast relevant events to special purpose chat rooms, such as operations and client support. These chat rooms become even more effective as other systems such as continuous integrations and deployment post about build health and new releases, providing a central comprehensive view of events along with discussions about impact and the process for addressing issues.

Operations HipChat
Operations HipChat chat room with monitoring and deployment notifications and engineers addressing different aspects of operations.

Step 3: Understand

Once issues have been identified and communicated, you need to understand the state of any relevant systems to get to the root cause and determine customer impact. Engineers can only get so far with telemetry; thus, they rely on a system’s logs, which contain low-level information about web requests, backend processes, and supporting services. However, searching log files can be difficult for complex systems if the logs are numerous or large, or if they have restricted access.

Several services, including Splunk, Loggly, and Sentry, transport and/or capture log events from systems to a central ingestion / collation / indexing / searching tool to facilitate searching and reporting across all your systems. For those with greater capacity needs and small budgets, Elasticsearch’s ELK stack allows you to operate and manage the service in-house without the need for commercial licensing costs.

Splunk search result
Splunk search result on error log giving engineers ability to search across hosts and systems to troubleshoot issues.

Step 4: Enable

Finally, the mobile experience for services has become much more important with the proliferation of smart phones, laptops, and wifi/mifi access. Tools such as Hubot and Lita allow engineers and DevOps teams to build chat-based tools to respond to events without host access. PagerDuty, OpsGenie, and VictorOps facilitate monitoring notifications, escalations, and on-call scheduling to communicate issues to relevant audiences. Using wiki-based tools like Atlassian’s Confluence offering, DevOps teams work with engineers to create playbooks to diagnose and resolve monitoring alerts quickly without anyone being an island of knowledge. With mobile apps or websites, these services and tools enable companies to reduce time to detect and to resolve issues for the best customer and employee experiences possible.

Atlassian Confluence wiki page
Atlassian Confluence wiki page describing process to diagnose monitoring check on PagerDuty service.

Monitoring has a significantly broad and deep impact on both the operation and direction for any organization. It encompasses technology, information, and process to equip you with knowledge to make the necessary decisions today and tomorrow.