People often conflate monitoring and observability, and I can’t blame them. Marketers often use the terms interchangeably. However, monitoring and observability are two fundamentally different but related things. Understanding the differences between the two both technically and intuitively can help you become a better network troubleshooter, architect, and manager. After all, like many buzzwords before it, observability is an important concept if you can get past the fluff.

So, let’s do that. Let’s start by taking a step back, compare what monitoring and observability are fundamentally, explain how monitoring enables observability, and explore what improving a system’s observability entails in the real world.

What’s the difference between observability and monitoring?

Monitoring is a thing you do. Observability is a characteristic that describes a system.

To begin, let’s define each of the terms:

  • “Monitoring” is the process of collecting and recording data to determine a system’s state. For example, a network monitoring tool helps IT professionals determine the health of the networks they manage. In that case, the network is the “system” and the IT professional is using software tools to perform the monitoring.
  • “Observability” is a characteristic that describes how well a system’s internal state can be determined by its outputs. In short, the more accurately you can measure outputs and know a system’s health, the higher that system’s observability.

Using our IT professional example, if the administrator only has ping responses to determine the network’s health, the observability of the system is low compared to one that also reports on bandwidth, throughput, network flows, and CPU utilization.

It’s important to note that one absolutely affects the other. How the IT pro monitors the network, what tools they employ, what depth of metrics they collect, all determine what system outputs are available. That’s the relationship between monitoring and observability:

Monitoring enables visibility of system outputs (e.g. metrics) that help determine the observability of the system.

Is observability just a buzzword?

Why does observability matter? Is it just another marketing buzzword? After all, when you monitor a system, you can see what goes wrong. If throughput drops or CPU utilization on a firewall spikes, monitoring can tell you. Further, network monitoring, APM, and remote network management tools are what network engineers use to enable “observability” in practice.

All those are fair points, and to an extent, the industry has made observability into a vague buzzword. But, when you drill down to what observability actually is, you can see why it matters. To understand what I mean, let’s start with traditional “monitoring”.

With monitoring, you need to know in advance which data points you are going to track. You can then report and alert against those data points. This allows you to determine when a metric you thought about in advance falls out of range. You can then respond to and address the problem. So far, so good.

The challenge with complex systems—like enterprise networks—is that problems you can’t predict often arise. With traditional monitoring, you only have information related to the problems you could predict beforehand, and the root cause is often several steps removed from the original symptom. That makes debugging and root cause analysis difficult. When we talk about observability in the context of modern networks, we’re talking about addressing this problem.

As a result, improving observability with network monitoring software means going beyond simple polling of metrics. You need to be able to go beyond identifying a problem (e.g. high levels of packet loss) and drill down to a root cause (a failing switch or broadcast storm). Doing that requires going beyond simple data aggregation over time and actually correlating data from multiple endpoints.

This need for correlation and insights beyond simple threshold alerts are a big part of why artificial intelligence and machine learning should be a part of your network monitoring strategy in 2021. The faster you can correlate the data from multiple endpoints to get to that root cause, the less overall downtime and performance issues your end-users will experience.

What are some benefits of increased observability?

Observability allows you to determine why problems occur in complex systems.

Focusing on system-wide observability can improve your visibility, knowledge, and debugging capabilities of a given system. In practical terms, observability can deliver these business benefits:

  • Reduced mean time to resolution (MTTR). This benefit is simple and intuitive. The faster you can determine why a system fails, the faster you can resolve the issue.
  • Lower operational costs. The faster a network engineer can identify a problem, the less time they need to spend troubleshooting and debugging. The result? Over time, the cost of network management goes down. Not to mention the additional time you, as a network engineer, will get back in your day.
  • More effective issue remediation. Suppose a web server locks up intermittently. Traditional monitoring may show you that memory utilization spikes at the same time. But a true observability solution should enable you to drill down to the root cause. For example, suppose you can correlate the memory spikes to specific HTTP requests. From there, log data may expose a bug in the web server code.
  • Better network planning. When you have visibility over your entire network, you have a better picture of data flows, bottlenecks, and where you need to optimize. This allows you to utilize existing resources more efficiently and spend your money on infrastructure that will have the most impact on performance.
  • Happier end users. At the end of the day, there is one reason observability matters: it leads to better performance, which makes end users happier and more productive.
  • How is observability implemented?

    Monitoring and the intelligent aggregation, presentation, and interpretation of MELT data help you improve a system’s observability.

    At a high level, effective monitoring is how you “implement observability”. Of course, that means going beyond simple polling and thresholds and being able to correlate discrete data points to determine root cause.

    APM and network monitoring tools help enhance observability when they go beyond simple data aggregation to determine exactly why a problem occurred or understand why a system is in its current state. To truly understand observability in the context of network management, it helps to understand the data types associated with it. Generally, observability data fall under the umbrella of MELT: metrics, events, logs, and traces. Let’s take a look at each.

    Metrics

    Metrics are aggregated measurements captured over specific time intervals. The key components of metrics are:

    • A name
    • A timestamp
    • A (usually numeric) value

    For example, if you poll bandwidth utilization every minute using SNMP, you’re capturing a metric. When you think about traditional network monitoring using protocols like IPMI, SNMP, WMI to poll network devices, most of the telemetry involved are examples of metrics.

    Metrics enable time-series reporting, and allow you to get a snapshot of overall system health. As well, they can be structured in a way that makes for efficient storage and processing. You must plan ahead to gain value from metrics. If a data point you didn’t consider causes an issue, you have no visibility.

    Events

    Events are discrete occurrences at a specific point in time. A switch port going down or up is an example of an event. Generally, event telemetry includes:

    • An event name (e.g. “port down”)
    • A timestamp
    • Data relevant to the event (e.g. device, category, specific action, user)

    Events provide visibility into specific actions, and enable granular reporting and ad-hoc queries. However, events generally require significantly more bandwidth and storage space than metrics.

    Logs

    Similar to events, logs are more granular records generated by applications, services, and network devices. On Linux systems, /var/log/messages and /var/log/syslog are examples of logs. When you are in the weeds troubleshooting a specific problem, log data is often your best friend. Log data generally includes:

    • A timestamp
    • Descriptive text

    Logs enable granular debugging. They may be structured (e.g. a syslog that follows RFC5424 formatting) or unstructured, which can make them difficult to aggregate and parse at scale. Storing and parsing logs at scale can be difficult and resource-intensive.

    Traces

    Traces are chains of related events. Traces are unique in that they are relative to a specific end-to-end action. That means they can be distributed across multiple systems, whereas a log is relative to a specific device, service, or application. They’re also most common for microservices and HTTP-based transactions. For example, a user authenticating to a web application may generate a trace.

    Traces are useful for debugging microservices and to help visualize transactions end-to-end. Currently, there is a lack of standardization, although W3C’s Trace Context recommendation aims to change this.

    Final thoughts: key takeaways on monitoring and observability

    Now that we’ve gone through the relationship between monitoring and observability, the importance of both should be clear: Complex systems — like networks — with high levels of observability are easier to manage and troubleshoot. Network monitoring and APM tools that can aggregate MELT data and help you correlate information to determine root cause, can improve network observability.


    Are you looking for tools to help you be more effective and efficient? Try Auvik risk-free for 14 days, and see the difference going beyond monitoring can do for you.