Prometheus is one of the standard-bearing open-source solutions for monitoring and observability. From its humble origins at SoundCloud in 2012, Prometheus quickly garnered widespread adoption and later became one of the first CNCF projects and just the second to graduate (after Kubernetes). It’s used by tons of forward-thinking companies in production, including heavyweights like DigitalOcean, Fastly, and Weaveworks, and has its own dedicated yearly conference, PromCon.
Prometheus: powerful but intentionally limited
Prometheus has succeeded in part because the core Prometheus server and its various complements, such as Alertmanager, Grafana, and the exporter ecosystem, form a compelling end-to-end solution to a crucial but difficult problem. Prometheus does not, however, provide some of the capabilities that you’d expect from a full-fledged “as-a-Service” platform, such as multi-tenancy, authentication and authorization, and built-in long-term storage.
Cortex, which joined the CNCF in September as a sandbox project, is an open-source Prometheus-as-a-Service platform that seeks to fill those gaps and to thereby provide a complete, secure, multi-tenant Prometheus experience. I’ll say a lot about Cortex further down; first, let’s take a brief excursion into the more familiar world of Prometheus to get our bearings.
As a CNCF developer advocate, I’ve had the opportunity to become closely acquainted both with the Prometheus community and with Prometheus as a tool (mostly working on docs and the Prometheus Playground). Its great success is really no surprise to me for a variety of reasons:
Prometheus instances are easy to deploy and manage. I’m especially fond of near-instant configuration reloading and of the fact that all Prometheus components are available as static binaries.
Prometheus offers a simple and easily adoptable metrics exposition format that makes it easy to write your own metrics exporters. This format is even being turned into an open standard via the OpenMetrics project (which also recently joined the CNCF sandbox).
Prometheus offers a simple but powerful label-based querying language, PromQL, for working with time series data. I find PromQL to be highly intuitive
Early on, Prometheus’ core engineers made the wise decision to keep Prometheus lean and composable. From the get-go, Prometheus was designed to do a small set of things very well and to work seamlessly in conjunction with other, optional components (rather than overburdening Prometheus with an ever-growing array of hard-coded features and integrations). Here are some things that Prometheus was not meant to provide:
Long-term storage — Individual Prometheus instances provide durable storage of time series data, but they do not act as a distributed data storage system with features like cross-node replication and automatic repair. This means that durability guarantees are restricted to that of a single machine. Fortunately, Prometheus offers a remote write API that can be used to pipe time series data to other systems.
A global view of data — As described in the bullet point above, Prometheus instances act as isolated data storage units. Prometheus instances can be federated but that adds a lot of complexity to a Prometheus setup and again, Prometheus simply wasn’t built as a distributed database. This means that there’s no simple path to achieving a single, consistent, “global” view of your time series data.
Multi-tenancy — Prometheus by itself has no built-in concept of a tenant. This means that it can’t provide any sort of fine-grained control over things like tenant-specific data access and resource usage quotas.
As a Prometheus-as-a-Service platform, Cortex fills in all of these crucial gaps with aplomb and thus provides a complete out-of-the-box solution for even the most demanding monitoring and observability use cases.
It supports four long-term storage systems out of the box: AWS DynamoDB, AWS S3, Apache Cassandra, and Google Cloud Bigtable.
It offers a global view of Prometheus time series data that includes data in long-term storage, greatly expanding the usefulness of PromQL for analytical purposes.
It has multi-tenancy built into its very core. All Prometheus metrics that pass through Cortex are associated with a specific tenant.
The architecture of Cortex
Cortex has a fundamentally service-based design, with its essential functions split up into single-purpose components that can be independently scaled:
Distributor — Handles time series data written to Cortex by Prometheus instances using Prometheus’ remote write API. Incoming data is automatically replicated and sharded, and sent to multiple Cortex ingesters in parallel.
Ingester — Receives time series data from distributor nodes and then writes that data to long-term storage backends, compressing data into Prometheus chunks for efficiency.
Ruler — Executes rules and generates alerts, sending them to Alertmanager (Cortex installations include Alertmanager).
Querier — Handles PromQL queries from clients (including Grafana dashboards), abstracting over both ephemeral time series data and samples in long-term storage.