Observability • Michael Uloth

Google Cloud • how to implement observability in a GC project

Introduction

Observability is a practice of monitoring and understanding the internal state of a system by analyzing its external outputs
It’s composed of three pillars: logs, metrics, and traces
Logs are discrete events that happen in a system
Metrics are measurements of the state of a system at a point in time
Traces are a series of events that happen in a system
Monitoring is the practice of collecting and analyzing logs, metrics, and traces to understand the state of a system
Alerts are notifications that are sent when a system is in a particular state
Dashboards are visualizations of the state of a system

Links:

OpenTelemetry • “OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior” • OpenTelemetry docs 📚
opentelemetry-js • OpenTelemetry JavaScript Client 🛠️

General

The Power of Open Telemetry with Dr. Sally Wahba • Monitoring is proactive and focused on reacting to anticipated problems; instrumenting a codebase for observability is focused on making it possible to ask questions you haven’t thought of yet • Sally Wahba & Scott Hanselman 🎧

Logging Events

Why log events?
- Able to answer more “what happened” questions
- Can search logs to answer questions
- Can set up alerts when certain events happen

Using logs to debug an error:

Pull the whole stack trace from Grafana and find the offending query
If you can find the error message in a Grafana response, you can usually you narrow it down and find the adjacent bits like where it logged the query, what it tried to use, etc.

Using logs to debug slow API call response times:

investigate slow prod query response times
How to measure average query response times in production using Grafana

Grafana Loki:

Grafana Loki OSS • Log aggregation system inspired by Prometheus 🛠️
Promtail • Grafana Docs 📚
Using the Loki Query Language…
- Log queries
- Metric queries
- Query examples
- Most LokiQL line filters (e.g. |=, !=, |~, !~) are case sensitive by default, but you can ignore case by using the regex line filters (|~, !~) with the (?i) flag
  - So, use {app="foo"} |~ "(?i)error" instead of {app="foo"} |= "error" + {app="foo"} |= "Error", + {app="foo"} |= "ERROR", etc.
  - How to search logs in Loki without worrying about the case | Grafana Labs
  - Loki Regex Syntax: Syntax · google/re2 Wiki
Searching Loki Logs with LokiQL
- …

Other logging tools:

Axiom 🛠️
- Axiom plugin for Grafana | Grafana Labs
LogRocket • Logging and Session Replay for JavaScript Apps 🛠️

Metrics

Prometheus:

Prometheus • Monitoring system and time series database 🛠️
PromQL: …

Thanos:

Searching Thanos metrics with PromQL
- …

Sloth:

uses Prometheus client
define SLIs in a short config file
- Sloth converts that config into lengthy Prometheus alerting rules
display results in Grafana

Other monitoring tools:

highlight.io • The open source monitoring platform 🛠️
pyrra • Making SLOs with Prometheus manageable, accessible, and easy to use for everyone 🛠️

Using metrics to define SLIs

Service Level Agreements (SLAs):
- what we’ve committed to
Service Level Objectives (SLOs):
- what we intend
- very customized per service
- Chapter 2: Implementing SLOs • Google SRE Book 📕
- Appendix A: Example SLO Document • Google SRE Book 📕
- Appendix B: Example Error Budget Policy • Google SRE Book 📕
Service Level Indicators (SLIs):
- what metrics we monitor to determine if we’re achieving our SLOs

Traces

Use tracing to answer questions…
Grafana Tempo OSS • Distributed tracing backend 🛠️
Distributed Tracing adds visual instrumentation for microservices • Axiom 📖
Zipkin
- UI for local traces
Honeycomb
- UI for production traces
- Query:
  - “Query in” - choose app
  - “Visualize” - choose aggregation method (e.g. AVG(duration_ms) if interested in average response times for an endpoint over time)
  - “Where”
    - e.g. set name to endpoint name
    - e.g. set service.name to app/service that contains that endpoint
  - Click “Run Query”
  - Go to the “traces” tab and see a breakdown per request

Monitoring

Site Reliability Engineering • Google SRE Book 📕
Can monitor…
- Production errors
- Performance
  - CPU usage
  - Memory usage
  - Request times & latency
  - Page load & UX metrics
Can monitor via…
- Watching for specific logS
- Aggregating log characteristics into metrics with thresholds
- Watching metrics provided by platforms like K8s
- Collecting real user metrics (RUM) to measure UX impact of waiting for things to load
- Capturing unhandled errors
- Using synthetic checks to make requests and exercise certain code paths

How to monitor production errors:

How to make it easy to detect and fix production bugs?
Why I Don’t Unit Test • Theo 📺
- Don’t write tests; build safety nets
- unit tests slow developers down
- It’s far more important to be able to quickly detect production issues and roll back or fix them
- Optimize the pipeline, not the tests
- Tests are guardrails, not safety nets
- 80% of unit testing is solved by TS validating inputs and outputs
- The rest are for bugs that it’s better to prepare to fix quickly when they occur so detecting and deploying fixes isn’t slow
- How to make it easy to detect and fix production bugs?

Monitoring tools:

LogRocket • Logging and Session Replay for JavaScript Apps 🛠️
Sentry • JavaScript Error Tracking and Performance Monitoring 🛠️

Alerts

When to use log-based vs metric-based alerts?
- Monitor your logs • Google Cloud docs 📚
  - “When you want to monitor recurring events in your logs over time, use log-based metrics…Log-based metrics are suitable when you want to do any of the following:
    - Count the occurrences of a message, like a warning or error, in your logs and receive a notification when the number of occurrences crosses a threshold.
    - Observe trends in your data, like latency values in your logs, and receive a notification if the values change in an unacceptable way.
    - Create charts to display the numeric data extracted from your logs.”
  - “When you want to be notified anytime a specific message occurs in a log, use log-based alerts…Log-based alerts are well suited for events that you expect to be both rare and important. You don’t want to know about a trend or pattern; you want to know that something occurred.”
What’s worth alerting about?
- Errors logged in production
- Metrics we committed to keeping within a certain range (according to an SLO) now outside that range (according to an SLI)

Dashboards

Grafana • The open observability platform 🛠️
Annotating visualizations: Annotate visualizations • Grafana Docs 📚

Inbox

Set up and observe a Spring Boot application with Grafana Cloud, Prometheus, and OpenTelemetry | Grafana Labs • A step-by-step guide to setting up a Spring Boot app and correlating your metrics, logs, and traces in Grafana Cloud.
Free selfhosted lab monitoring with Google Cloud Platform - Academy @ PointToSource
- Followed up with this simpler option (though it doesn’t practice the GCP/terraform workflow) - Free Remote Status Monitoring for your Server
Logging in Kubernetes: EFK vs PLG Stack - covers Promtail, Loki, Grafana (PLG) stack
grafana: alerting: notifications: Create mute timings | Grafana documentation
Alerting | Grafana documentation - intro to Grafana Alerting
Grafana Monitoring on a Raspberry Pi | Alex Hyett - instructions for self-hosted docker compose setup
grafana: 6 easy ways to improve your log dashboards with Grafana and Grafana Loki | Grafana Labs
thanos: Thanos (Multi Cluster Prometheus) Tutorial: Global View - Long Term Storage - Kubernetes
Cheat Sheet - Loki - Seb’s IT blog
Cheat Sheet - Prometheus - Seb’s IT blog
How to Setup Alerting With Loki - Ruan Bekker’s Blog
Detecting Unexpected Errors in Production
- crash/exception detection
- Custom log-based and metric-based alerts only catch anticipated issues
- Same goes for tests
- It’s useful to also add automatic reporting of unanticipated errors in production (recurring patterns can tell you what additional tests, metrics and alerts may be worth adding)
- Tools:
  - Sentry vs Logging • Sentry does smart stuff with error data to make bugs easier to find and fix. Logs keep complete, auditable history. Both are complementary practices.
  - Rollbar docs • Rollbar 📚
Grafana: switch from “Loki” to “Prometheus” in dropdown in nav bar when querying metrics (instead of logs)
Grafana: can copy any query (after lcicking “edit” on any dashboard) into grafana explore and adjust it (need to change any $ variables except $__interval to constants)
What is Thanos? (metrics-related)
- Thanos = longer term storage of prometheus; choose “Thanos” from Grafana dropdown to reach back farther in time; Thanos also stores other non-prometheus logs
Introduction to PromQL, the Prometheus query language | Grafana Labs
Thanos - Highly available Prometheus setup with long term storage capabilities
pagerduty: Maintenance Windows - use maintenance windows to temporarily disable incident notifications during a scheduled time (e.g. a holiday break)
grafana: prometheus: promql: Query functions | Prometheus
The Memory Leak Solution You Wish You Knew Sooner - how to profile and analyze why your applications memory usage keeps rising and rising (a.k.a. a memory leak) - DevOps Toolbox
Benchmarking: This Lesson Taught Me How To Do Better Benchmarks • Improving benchmark accuracy by measuring different things in a cleaner environment instead of timing function runs on your laptop • The Primeagen 📺
What Is Observability? Key Components and Best Practices • Honeycomb 📖
GitHub - hatoo/oha: Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation. • Use instead of current custom script to get avg response times? Or is this intended for testing challenging numbers of simultaneous requests?

Related