Practical Challenges and Solutions for Modern Architecture
Logs describe discrete events and transactions within a system. They consist of messages generated by your application over a precise period of time that can tell you a story about what’s happening.
Metrics consist of time-series data that describes a measurement of resource utilization or behavior. They are useful because they provide insights into the behavior and health of a system, especially when aggregated.
Traces use unique IDs to track down individual requests as they hop from one service to another. They can show you how a request travels from one end to the other.
Suddenly, multiple alerts fire off to notify you of failures. You now know that requests are failing.
Next, you can triage the alerts to learn which failures are most urgent. You identify which teams you need to coordinate. Then learn if there is any customer impact. You scale up the infrastructure serving those requests and remediate the issue.
Later on, you and your team perform a postmortem investigation of the issue. You learn that one of the components in the payments processor system is scanning multiple users and causing CPU cycles to increase tenfold—far more than necessary. You determine that this increase was the root cause of the incident. You and the team proceed to fix the component permanently.
1 Richard Cook, “How Complex Systems Fail,” Cognitive Technologies Laboratory, 2000, https://oreil.ly/zw73j.
2 Cindy Sridharan, Distributed Systems Observability (O’Reilly Media, 2018), https://oreil.ly/v8PUu.
3 Sridharan, Distributed Systems Observability.
4 Rob Skillington, “SREcon21—Taking Control of Metrics Growth and Cardinality: Tips for Maximizing Your Observability,” USENIX, October 14, 2021, YouTube video, 27:21, https://oreil.ly/gvAq7.
5 Adapted from an image in Rachel Dines, “Explain It Like I’m 5: The Three Phases of Observability,” Chronosphere, August 10, 2021, https://chronosphere.io/learn/explain-it-like-im-5-the-three-phases-of-observability.
Counters are cumulative metrics that can only increase. In the preceding example, the myapp_request_count_total is a counter, and it either increases or does not move. As the name implies, counter metrics are best used for counting things like HTTP requests, RPC calls, or even business metrics like number of sales.
Gauges are singular metrics that can either increase or decrease. Examples of metrics measured in gauges include temperature, speed, or memory usage. Gauges can also be used for metrics that decrease, such as concurrent HTTP requests.
Histograms and summaries are sample observation metrics. Examples of metrics expressed in histograms and summaries include request duration, request sizes, or ranking. Histograms and summaries are similar, but it is advisable to use a histogram for aggregation, since you can use quantiles. For more information about the difference between histograms and summaries, visit the Prometheus histogram documentation.
sum(metric_name)max(metric_name)count(metric_name)
sum(rate(myapp_request_count_total[1m]))max(rate(myapp_request_count_total[1m]))rate(myapp_request_count_total[1m])increase(business_sales[1h])rate(business_question_set_completed[1h])
rate(http_request_duration_seconds_sum[5m])rate(http_request_duration_seconds_count[5m])
sumby(status)(metric_name),sumwithout(status)(metric_name),
max(rate(myapp_request_count_total{status="200"}[1m]))
rate(myapp_request_count_total{status="503"}[1m])>0
myapp_request_count_total{endpoint="/test"}226
myapp_request_count_total{endpoint="/test"}226myapp_request_count_created{endpoint="/test"}163myapp_request_bounce_total{endpoint="/test"}440
api_http_requests_total{method="POST",handler="/messages"}60
api_http_requests_total{method="POST",handler="/messages",\status="200"}30api_http_requests_total{method="POST",handler="/messages",\status="503"}30
job_data_processed{database="bronze",pod="pod1"}1000
job_data_processed{database="bronze",pod="pod1"}10job_data_processed{database="bronze",pod="pod2"}10...job_data_processed{database="bronze",pod="pod100"}10
job_data_processed{database="bronze",pod="pod1",\type="json"}5job_data_processed{database="bronze",pod="pod2",\type="json"}5...job_data_processed{database="bronze",pod="pod100",\type="json"}job_data_processed{database="bronze",pod="pod1",\type="csv"}5job_data_processed{database="bronze",pod="pod2",\type="csv"}5...job_data_processed{database="bronze",pod="pod100",\type="csv"}5
api_http_requests_total{method="POST",handler="/messages"\pod="pod2"}600
api_http_requests_total{method="POST",handler="/messages"\pod="pod2"from="frontendservice"}300api_http_requests_total{method="POST",handler="/messages"\pod="pod2"from="backendservice"}300
1 Sridharan, Distributed Systems Observability.
2 Rob Ewaschuk, “Monitoring Distributed Systems,” chap. 6 in Site Reliability Engineering, ed. Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (O’Reilly Media, 2016), https://oreil.ly/3xCYp.
3 Rob Skillington, “What Is High Cardinality,” Chronosphere, February 24, 2022, https://chronosphere.io/learn/what-is-high-cardinality.
4 Joel Bastos and Pedro Araujo, “Cardinality,” in Hands-On Infrastructure Monitoring with Prometheus (Packt, 2019), https://oreil.ly/vmk4Z.
5 Bastos and Araujo, “Cardinality.”
6 Lydia Parziale, et al., Chapter 10, in Getting Started with z/OS Container Extensions and Docker (Redbooks, 2021), https://oreil.ly/e21Wc.
7 Eric Carter, 2018 Docker Usage Report (Sysdig, May 29, 2018), https://oreil.ly/ftZum.
Prometheus uses a dimensional metric data model that allows flexibility when labeling metric data. You can use these dimensions to query metrics using the PromQL language.
Prometheus can use Service Discovery native to the system Prometheus is monitoring. For example, Prometheus can self-discover pods endpoints using Kubernetes’s own Service Discovery APIs.
Prometheus has a built-in Alertmanager subsystem that can push to paging systems like PagerDuty and Slack. Alertmanager uses PromQL to build alerts and thresholds.
The use case for Prometheus is too generic: it isn’t built for any one type of application, so you have to configure it for your specific system, including creating metadata labels for each metric type. The relabeling configuration becomes complex as you collect more metrics.
The more dimensions your metrics have, the more complicated it gets to configure Prometheus scraping. You can solve this problem easily by using tools like PromLens and taking care to annotate metrics only when absolutely necessary.
Prometheus is hard to operate reliably. Prometheus runs as a single binary, which means it’s easy to stand up but harder to keep running on unexpected errors. Having Prometheus run in production means tweaking and fine-tuning to keep Prometheus reliable. You end up spending time on Prometheus that you could (and should!) be spending on your core business applications instead. The only exception to this rule is if your business runs Prometheus full time, as with the fully managed options we will discuss later in this chapter.
The biggest disadvantage of Prometheus is that its server uses vertical scaling.
If there is an outage in the Prometheus server’s region or data center, that server becomes a single point of failure. Compare that with cloud native systems, which are built on the assumption that networks are unreliable and compute is ephemeral.
The Prometheus server tends to get overloaded with tasks. As more applications go online, it scrapes metric data more slowly.
Vertical scalability means that the Prometheus server must be treated like a pet, not cattle, as the famous analogy goes. Patches and configurations cannot be automated easily—and should not be—in case of any version incompatibility.3
In the cloud, there is a high ceiling of compute types available for machines, but this is by no means infinite. At some point, vertically scaling Prometheus is no longer an option.
Prometheus conformance measures what it takes to lift and shift metrics from your old agent-based system to Prometheus. The more your option conforms to Prometheus, the more flexibility you get to move between vendors and open source. You also get backward compatibility to any version of Prometheus in the open source ecosystem, so you can access all the benefits of both the open source and proprietary tools.
It is much easier to implement a system that has native integration with your existing system; otherwise, you have to build translation APIs. Having a good level of integration means you can easily get up and running and start to realize a return on your investment.
Feature sets let you use existing functionalities in your fully managed option rather than building things on top of that option. This saves time and allows you to focus on building observability tailored to your organization’s needs.
In our view, the most important heuristic of any observability system is reliability. After all, if you are outsourcing your monitoring system to a fully managed system, the last thing you want is unreliable monitoring in the middle of an incident. Reliability heuristics means that your monitoring solution should allow you to monitor your systems consistently without fail, even during massive internet outages, force majeure events, and cybersecurity incidents, and with all the scalability you require to understand your system.
1 Ian Malpass, “Measure Anything, Measure Everything,” Etsy, February 15, 2011, https://oreil.ly/GQpxT.
2 Anne McCrory, “Ubiquitous? Pervasive? Sorry, They Don’t Compute,” Computerworld, March 20, 2000, https://oreil.ly/juHHV.
3 “What [would] happen if several of your servers went offline right now?” Viktor Farcic asks. “If they are pets, such a situation will cause significant disruption for your users. If they are cattle, such an outcome will go unnoticed. Since you are running multiple instances of [a] service distributed across multiple nodes, failure of a single server (or a couple of them) would not result in a failure of all replicas. The only immediate effect would be that some services would run fewer instances and would have a higher load.” See Farcic, The DevOps 2.1 Toolkit: Docker Swarm (Packt, 2017).
4 “Architecture,” Thanos, accessed March 16, 2022, https://oreil.ly/XPHpF.
5 Adapted from an image in Tom Wilkie, “Grafana Labs at KubeCon: The Latest on Cortex,” Grafana Labs, May 21, 2019, https://oreil.ly/QMYZ3.
6 Jonah Kowall, “Why We Chose the M3DB Data Store for Logz.io Prometheus-as-a-Service,” Logz.io, March 24, 2021, https://oreil.ly/JqtLQ.
7 Tom Wilkie, “How We Responded to a 2-Hour Outage in Our Grafana Cloud-Hosted Prometheus Service,” Grafana Labs, March 26, 2021, https://oreil.ly/TX0hk; “Grafana Cloud Intermittent Loadbalancer Errors,” Grafana Labs, accessed March 16, 2022, https://oreil.ly/d4HiG.
These are the dimensions we need to measure to understand our systems, and they are always (or at least often) preserved when consuming metrics in alerts and dashboards.
The value of these dimensions is more questionable. They may be an unintentional by-product of how you collect metrics instead of dimensions that you purposefully collected.
Collecting useless or harmful dimensions is essentially an antipattern, to be avoided at all costs. Including such dimensions can explode the amount of data you collect, resulting in serious consequences for your metric system’s health and significant problems querying metrics.
-job_name:nginx_ingressscrape_interval:1mscrape_timeout:10smetrics_path:/metricsscheme:https
downsample:rules:mappingRules:-name:"mysql metrics"filter:"app:mysql*"aggregations:["Last"]storagePolicies:-resolution:1mretention:48h-name:"nginx metrics"filter:"app:nginx*"aggregations:["Last"]storagePolicies:-resolution:30sretention:24h-resolution:1mretention:48h
groups:-name:noderules:-record:job:process_cpu_seconds:rate5mexpr:>sumwithout(instance)(rate(process_cpu_seconds_total{job="node"}[5m]))
downsample:rules:mappingRules:-name:"http_requestlatencybyroute\andgit_shadropraw"filter:"__name__:http_request_bucketk8s_pod:*\le:*git_sha:*route:*"filter:"__name__:http_request_bucketk8s_pod:*\le:*git_sha:*route:*"-name:"http_requestlatencybyroute\andgit_shawithoutpod"filter:"__name__:http_request_bucketk8s_pod:*\le:*git_sha:*route:*"-name:"http_requestlatencybyroute\andgit_shawithoutpod"filter:"__name__:http_request_bucketk8s_pod:*\le:*git_sha:*route:*"metricName:"http_request_bucket"\# metric name doesn't changegroupBy:["le", "git_sha", "route", \"status_code","region"]filter:"__name__:http_request_bucketk8s_pod:*\le:*git_sha:*route:*"metricName:"http_request_bucket"\# metric name doesn't changegroupBy:["le", "git_sha", "route", \"status_code","region"]filter:"__name__:http_request_bucketk8s_pod:*\le:*git_sha:*route:*"metricName:"http_request_bucket"\# metric name doesn't changegroupBy:["le", "git_sha", "route", \"status_code","region"]type:"Increase"-rollup:metricName:"http_request_bucket"\# metric name doesn't changegroupBy:["le", "git_sha", "route", \"status_code","region"]aggregations:["Sum"]-transform:type:"Add"storagePolicies:-resolution:30sretention:720h
1 Rachel Dines, “New ESG Study Uncovers Top Observability Concerns in 2022,” Chronosphere, February 22, 2022, https://chronosphere.io/learn/new-study-uncovers-top-observability-concerns-in-2022.
2 Adapted from an image by Chronosphere.
3 John Potocny, “Classifying Types of Metric Cardinality,” Chronosphere, February 15, 2022, https://chronosphere.io/learn/classifying-types-of-metric-cardinality.
4 “Mapping Rules,” M3, accessed March 16, 2022, https://oreil.ly/xJ5wT.
5 “Rollup Rules,” M3, accessed March 16, 2022, https://oreil.ly/Oz5eT.
1 Stavros Foteinopoulos, “How We Use Sloth to Do SLO Monitoring and Alerting with Prometheus,” Mattermost, October 26, 2021, https://oreil.ly/e35u8.
2 Steven Thurgood, “Example Error Budget Policy,” in The Site Reliability Workbook (O’Reilly Media, 2018), https://oreil.ly/yEg2b.