Skip to main content

Observability

Introduction

  • Observability is a practice of monitoring and understanding the internal state of a system by analyzing its external outputs
  • It’s composed of three pillars: logs, metrics, and traces
  • Logs are discrete events that happen in a system
  • Metrics are measurements of the state of a system at a point in time
  • Traces are a series of events that happen in a system
  • Monitoring is the practice of collecting and analyzing logs, metrics, and traces to understand the state of a system
  • Alerts are notifications that are sent when a system is in a particular state
  • Dashboards are visualizations of the state of a system

Links:

  • OpenTelemetry • “OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior” • OpenTelemetry docs 📚
  • opentelemetry-js • OpenTelemetry JavaScript Client 🛠️

General

  • The Power of Open Telemetry with Dr. Sally Wahba • Monitoring is proactive and focused on reacting to anticipated problems; instrumenting a codebase for observability is focused on making it possible to ask questions you haven’t thought of yet • Sally Wahba & Scott Hanselman 🎧

Logging Events

  • Why log events?
    • Able to answer more “what happened” questions
    • Can search logs to answer questions
    • Can set up alerts when certain events happen

Using logs to debug an error:

  • Pull the whole stack trace from Grafana and find the offending query
  • If you can find the error message in a Grafana response, you can usually you narrow it down and find the adjacent bits like where it logged the query, what it tried to use, etc.

Using logs to debug slow API call response times:

  • investigate slow prod query response times
  • How to measure average query response times in production using Grafana

Grafana Loki:

Other logging tools:

Metrics

Prometheus:

  • Prometheus • Monitoring system and time series database 🛠️
  • PromQL: …

Thanos:

  • Searching Thanos metrics with PromQL

Sloth:

  • uses Prometheus client
  • define SLIs in a short config file
    • Sloth converts that config into lengthy Prometheus alerting rules
  • display results in Grafana

Other monitoring tools:

  • highlight.io • The open source monitoring platform 🛠️
  • pyrra • Making SLOs with Prometheus manageable, accessible, and easy to use for everyone 🛠️

Using metrics to define SLIs

Measuring Before and After Optimizing

Traces

  • Use tracing to answer questions…
  • Grafana Tempo OSS • Distributed tracing backend 🛠️
  • Distributed Tracing adds visual instrumentation for microservices • Axiom 📖
  • Zipkin
    • UI for local traces
  • Honeycomb
    • UI for production traces
    • Query:
      • “Query in” - choose app
      • “Visualize” - choose aggregation method (e.g. AVG(duration_ms) if interested in average response times for an endpoint over time)
      • “Where”
        • e.g. set name to endpoint name
        • e.g. set service.name to app/service that contains that endpoint
      • Click “Run Query”
      • Go to the “traces” tab and see a breakdown per request

Monitoring

  • Site Reliability Engineering • Google SRE Book 📕
  • Can monitor…
    • Production errors
    • Performance
      • CPU usage
      • Memory usage
      • Request times & latency
      • Page load & UX metrics
  • Can monitor via…
    • Watching for specific logS
    • Aggregating log characteristics into metrics with thresholds
    • Watching metrics provided by platforms like K8s
    • Collecting real user metrics (RUM) to measure UX impact of waiting for things to load
    • Capturing unhandled errors
    • Using synthetic checks to make requests and exercise certain code paths

How to monitor production errors:

  • How to make it easy to detect and fix production bugs?
  • Why I Don’t Unit Test • Theo 📺
    • Don’t write tests; build safety nets
    • unit tests slow developers down
    • It’s far more important to be able to quickly detect production issues and roll back or fix them
    • Optimize the pipeline, not the tests
    • Tests are guardrails, not safety nets
    • 80% of unit testing is solved by TS validating inputs and outputs
    • The rest are for bugs that it’s better to prepare to fix quickly when they occur so detecting and deploying fixes isn’t slow
    • How to make it easy to detect and fix production bugs?

Monitoring tools:

  • LogRocket • Logging and Session Replay for JavaScript Apps 🛠️
  • Sentry • JavaScript Error Tracking and Performance Monitoring 🛠️

Alerts

  • When to use log-based vs metric-based alerts?
    • Monitor your logs • Google Cloud docs 📚
      • “When you want to monitor recurring events in your logs over time, use log-based metrics…Log-based metrics are suitable when you want to do any of the following:
        • Count the occurrences of a message, like a warning or error, in your logs and receive a notification when the number of occurrences crosses a threshold.
        • Observe trends in your data, like latency values in your logs, and receive a notification if the values change in an unacceptable way.
        • Create charts to display the numeric data extracted from your logs.”
      • “When you want to be notified anytime a specific message occurs in a log, use log-based alerts…Log-based alerts are well suited for events that you expect to be both rare and important. You don’t want to know about a trend or pattern; you want to know that something occurred.”
  • What’s worth alerting about?
    • Errors logged in production
    • Metrics we committed to keeping within a certain range (according to an SLO) now outside that range (according to an SLI)

Dashboards

Inbox