Introduction
- Observability is a practice of monitoring and understanding the internal state of a system by analyzing its external outputs
- It’s composed of three pillars: logs, metrics, and traces
- Logs are discrete events that happen in a system
- Metrics are measurements of the state of a system at a point in time
- Traces are a series of events that happen in a system
- Monitoring is the practice of collecting and analyzing logs, metrics, and traces to understand the state of a system
- Alerts are notifications that are sent when a system is in a particular state
- Dashboards are visualizations of the state of a system
Links:
- OpenTelemetry • “OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior” • OpenTelemetry docs 📚
- opentelemetry-js • OpenTelemetry JavaScript Client 🛠️
General
- The Power of Open Telemetry with Dr. Sally Wahba • Monitoring is proactive and focused on reacting to anticipated problems; instrumenting a codebase for observability is focused on making it possible to ask questions you haven’t thought of yet • Sally Wahba & Scott Hanselman 🎧
Logging Events
- Why log events?
- Able to answer more “what happened” questions
- Can search logs to answer questions
- Can set up alerts when certain events happen
Using logs to debug an error:
- Pull the whole stack trace from Grafana and find the offending query
- If you can find the error message in a Grafana response, you can usually you narrow it down and find the adjacent bits like where it logged the query, what it tried to use, etc.
Using logs to debug slow API call response times:
- investigate slow prod query response times
- How to measure average query response times in production using Grafana
Grafana Loki:
- Grafana Loki OSS • Log aggregation system inspired by Prometheus 🛠️
- Promtail • Grafana Docs 📚
- Using the Loki Query Language…
- Log queries
- Metric queries
- Query examples
- Most LokiQL line filters (e.g.
|=
,!=
,|~
,!~
) are case sensitive by default, but you can ignore case by using the regex line filters (|~
,!~
) with the(?i)
flag- So, use
{app="foo"} |~ "(?i)error"
instead of{app="foo"} |= "error"
+{app="foo"} |= "Error"
, +{app="foo"} |= "ERROR"
, etc. - How to search logs in Loki without worrying about the case | Grafana Labs
- Loki Regex Syntax: Syntax · google/re2 Wiki
- So, use
- Searching Loki Logs with LokiQL
- …
Other logging tools:
Metrics
Prometheus:
- Prometheus • Monitoring system and time series database 🛠️
- PromQL: …
Thanos:
- Searching Thanos metrics with PromQL
- …
Sloth:
- uses Prometheus client
- define SLIs in a short config file
- Sloth converts that config into lengthy Prometheus alerting rules
- display results in Grafana
Other monitoring tools:
- highlight.io • The open source monitoring platform 🛠️
- pyrra • Making SLOs with Prometheus manageable, accessible, and easy to use for everyone 🛠️
Using metrics to define SLIs
- Service Level Agreements (SLAs):
- what we’ve committed to
- Service Level Objectives (SLOs):
- what we intend
- very customized per service
- Chapter 2: Implementing SLOs • Google SRE Book 📕
- Appendix A: Example SLO Document • Google SRE Book 📕
- Appendix B: Example Error Budget Policy • Google SRE Book 📕
- Service Level Indicators (SLIs):
- what metrics we monitor to determine if we’re achieving our SLOs
Measuring Before and After Optimizing
- What to Expect When You’re Optimizing • Tim Kadlec 📖
Traces
- Use tracing to answer questions…
- Grafana Tempo OSS • Distributed tracing backend 🛠️
- Distributed Tracing adds visual instrumentation for microservices • Axiom 📖
- Zipkin
- UI for local traces
- Honeycomb
- UI for production traces
- Query:
- “Query in” - choose app
- “Visualize” - choose aggregation method (e.g.
AVG(duration_ms)
if interested in average response times for an endpoint over time) - “Where”
- e.g. set
name
to endpoint name - e.g. set
service.name
to app/service that contains that endpoint
- e.g. set
- Click “Run Query”
- Go to the “traces” tab and see a breakdown per request
Monitoring
- Site Reliability Engineering • Google SRE Book 📕
- Can monitor…
- Production errors
- Performance
- CPU usage
- Memory usage
- Request times & latency
- Page load & UX metrics
- Can monitor via…
- Watching for specific logS
- Aggregating log characteristics into metrics with thresholds
- Watching metrics provided by platforms like K8s
- Collecting real user metrics (RUM) to measure UX impact of waiting for things to load
- Capturing unhandled errors
- Using synthetic checks to make requests and exercise certain code paths
How to monitor production errors:
- How to make it easy to detect and fix production bugs?
- Why I Don’t Unit Test • Theo 📺
- Don’t write tests; build safety nets
- unit tests slow developers down
- It’s far more important to be able to quickly detect production issues and roll back or fix them
- Optimize the pipeline, not the tests
- Tests are guardrails, not safety nets
- 80% of unit testing is solved by TS validating inputs and outputs
- The rest are for bugs that it’s better to prepare to fix quickly when they occur so detecting and deploying fixes isn’t slow
- How to make it easy to detect and fix production bugs?
Monitoring tools:
- LogRocket • Logging and Session Replay for JavaScript Apps 🛠️
- Sentry • JavaScript Error Tracking and Performance Monitoring 🛠️
Alerts
- When to use log-based vs metric-based alerts?
- Monitor your logs • Google Cloud docs 📚
- “When you want to monitor recurring events in your logs over time, use log-based metrics…Log-based metrics are suitable when you want to do any of the following:
- Count the occurrences of a message, like a warning or error, in your logs and receive a notification when the number of occurrences crosses a threshold.
- Observe trends in your data, like latency values in your logs, and receive a notification if the values change in an unacceptable way.
- Create charts to display the numeric data extracted from your logs.”
- “When you want to be notified anytime a specific message occurs in a log, use log-based alerts…Log-based alerts are well suited for events that you expect to be both rare and important. You don’t want to know about a trend or pattern; you want to know that something occurred.”
- “When you want to monitor recurring events in your logs over time, use log-based metrics…Log-based metrics are suitable when you want to do any of the following:
- Monitor your logs • Google Cloud docs 📚
- What’s worth alerting about?
- Errors logged in production
- Metrics we committed to keeping within a certain range (according to an SLO) now outside that range (according to an SLI)
Dashboards
- Grafana • The open observability platform 🛠️
- Annotating visualizations: Annotate visualizations • Grafana Docs 📚
Inbox
-
Set up and observe a Spring Boot application with Grafana Cloud, Prometheus, and OpenTelemetry | Grafana Labs • A step-by-step guide to setting up a Spring Boot app and correlating your metrics, logs, and traces in Grafana Cloud.
-
Free selfhosted lab monitoring with Google Cloud Platform - Academy @ PointToSource
- Followed up with this simpler option (though it doesn’t practice the GCP/terraform workflow) - Free Remote Status Monitoring for your Server
-
Logging in Kubernetes: EFK vs PLG Stack - covers Promtail, Loki, Grafana (PLG) stack
-
grafana: alerting: notifications: Create mute timings | Grafana documentation
-
Alerting | Grafana documentation - intro to Grafana Alerting
-
Grafana Monitoring on a Raspberry Pi | Alex Hyett - instructions for self-hosted docker compose setup
-
grafana: 6 easy ways to improve your log dashboards with Grafana and Grafana Loki | Grafana Labs
-
thanos: Thanos (Multi Cluster Prometheus) Tutorial: Global View - Long Term Storage - Kubernetes
-
Detecting Unexpected Errors in Production
- crash/exception detection
- Custom log-based and metric-based alerts only catch anticipated issues
- Same goes for tests
- It’s useful to also add automatic reporting of unanticipated errors in production (recurring patterns can tell you what additional tests, metrics and alerts may be worth adding)
- Tools:
- Sentry vs Logging • Sentry does smart stuff with error data to make bugs easier to find and fix. Logs keep complete, auditable history. Both are complementary practices.
- Rollbar docs • Rollbar 📚
-
Grafana: switch from “Loki” to “Prometheus” in dropdown in nav bar when querying metrics (instead of logs)
-
Grafana: can copy any query (after lcicking “edit” on any dashboard) into grafana explore and adjust it (need to change any $ variables except $__interval to constants)
-
What is Thanos? (metrics-related)
- Thanos = longer term storage of prometheus; choose “Thanos” from Grafana dropdown to reach back farther in time; Thanos also stores other non-prometheus logs
-
Introduction to PromQL, the Prometheus query language | Grafana Labs
-
Thanos - Highly available Prometheus setup with long term storage capabilities
-
pagerduty: Maintenance Windows - use maintenance windows to temporarily disable incident notifications during a scheduled time (e.g. a holiday break)
-
grafana: prometheus: promql: Query functions | Prometheus
-
The Memory Leak Solution You Wish You Knew Sooner - how to profile and analyze why your applications memory usage keeps rising and rising (a.k.a. a memory leak) - DevOps Toolbox
-
Benchmarking: This Lesson Taught Me How To Do Better Benchmarks • Improving benchmark accuracy by measuring different things in a cleaner environment instead of timing function runs on your laptop • The Primeagen 📺
-
What Is Observability? Key Components and Best Practices • Honeycomb 📖
-
GitHub - hatoo/oha: Ohayou(おはよう), HTTP load generator, inspired by rakyll/hey with tui animation. • Use instead of current custom script to get avg response times? Or is this intended for testing challenging numbers of simultaneous requests?