Skip to main content

Observability

Introduction

  • Observability is a practice of monitoring and understanding the internal state of a system by analyzing its external outputs
  • It’s composed of three pillars: logs, metrics, and traces
  • Logs are discrete events that happen in a system
  • Metrics are measurements of the state of a system at a point in time
  • Traces are a series of events that happen in a system
  • Monitoring is the practice of collecting and analyzing logs, metrics, and traces to understand the state of a system
  • Alerts are notifications that are sent when a system is in a particular state
  • Dashboards are visualizations of the state of a system

Links:

  • OpenTelemetry • OpenTelemetry docs 📚

    OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.

  • opentelemetry-js • OpenTelemetry JavaScript Client 🛠️

Logging Events

  • Why log events?
    • Able to answer more “what happened” questions
    • Can search logs to answer questions
    • Can set up alerts when certain events happen

Using logs to debug an error:

  • Pull the whole stack trace from Grafana and find the offending query
  • If you can find the error message in a Grafana response, you can usually you narrow it down and find the adjacent bits like where it logged the query, what it tried to use, etc.

Using logs to debug slow API call response times:

  • investigate slow prod query response times
  • How to measure average query response times in production using Grafana

Grafana Loki:

Other logging tools:

  • Axiom 🛠️
  • LogRocket • Logging and Session Replay for JavaScript Apps 🛠️

Metrics

Define metrics using SLAs/SLOs/SLIs:

Prometheus:

  • Prometheus • Monitoring system and time series database 🛠️

Sloth:

  • uses Prometheus client
  • define SLIs in a short config file
    • Sloth converts that config into lengthy Prometheus alerting rules
  • display results in Grafana

Other monitoring tools:

  • highlight.io • The open source monitoring platform 🛠️
  • pyrra • Making SLOs with Prometheus manageable, accessible, and easy to use for everyone 🛠️

Traces

Monitoring

How to monitor production errors:

  • How to make it easy to detect and fix production bugs?
  • Why I Don’t Unit Test • Theo 📺
    • Don’t write tests; build safety nets
    • unit tests slow developers down
    • It’s far more important to be able to quickly detect production issues and roll back or fix them
    • Optimize the pipeline, not the tests
    • Tests are guardrails, not safety nets
    • 80% of unit testing is solved by TS validating inputs and outputs
    • The rest are for bugs that it’s better to prepare to fix quickly when they occur so detecting and deploying fixes isn’t slow
    • How to make it easy to detect and fix production bugs?

Monitoring tools:

  • LogRocket • Logging and Session Replay for JavaScript Apps 🛠️
  • Sentry • JavaScript Error Tracking and Performance Monitoring 🛠️

Alerts

Issues that might we worth alerting about:

  • Errors logged in production
  • Metrics we committed to keeping within a certain range (according to an SLO) now outside that range (according to an SLI)

Dashboards

Inbox