CacheKit Docs

High-performance cache policies and supporting data structures.

View the Project on GitHub OxidizeLabs/cachekit

Metrics

Status: design rationale for the metrics infrastructure under src/metrics/, gated by the metrics Cargo feature. Companion to design.md §6.

cachekit’s metrics surface is bigger than “two counters behind a feature flag.” It mirrors the cache trait hierarchy — recorder / snapshot / exporter — so each concern lives in the smallest trait that captures it, and policy code stays free of monitoring plumbing. This document explains the three-trait separation, the &self-vs-&mut self split, the MetricsCell interior-mutability escape hatch, the Prometheus exporter contract, and what guarantees counters do and do not provide.

Goals and non-goals

The metrics module is shaped for:

It is not shaped for:

Three-trait separation

                                ┌─────────────────────────────┐
                                │     CoreMetricsRecorder     │
                                │  record_get_hit, _miss,     │
                                │  _insert_*, _evict_*,       │
                                │  _clear                     │
                                └──────────────┬──────────────┘
                                               │ extends
        ┌──────────┬───────────┬───────────────┼───────────┬────────────┐
        ▼          ▼           ▼               ▼           ▼            ▼
   FifoRec    LruRec       LfuRec          ArcRec      ClockRec    S3FifoRec
                │                                                       …
                ▼
            LruKRec
            (further extends LruRec)

   Consumption (decoupled from recording):
   ┌──────────────────────────────┐    ┌──────────────────────────────┐
   │ MetricsSnapshotProvider<S>   │    │ MetricsExporter<S>           │
   │ + MetricsReset               │    │ PrometheusTextExporter       │
   │ (bench / test)               │    │ (production monitoring)      │
   └──────────────────────────────┘    └──────────────────────────────┘

Three responsibilities, three trait families:

Splitting these three lets:

Per-policy recorder traits

Every policy gets its own recorder trait extending CoreMetricsRecorder. The shipped set:

Trait Adds counters for
FifoMetricsRecorder scan steps, stale skips, pop_oldest calls
LruMetricsRecorder pop_lru, peek_lru, touch, recency_rank
LruKMetricsRecorder extends LruMetricsRecorder + K-distance counters
LfuMetricsRecorder pop_lfu, peek_lfu, frequency reads / mutates
MfuMetricsRecorder mirrors LFU for most-frequent eviction
ArcMetricsRecorder T1→T2 promotions, B1/B2 ghost hits, p movement
CarMetricsRecorder recent→frequent, ghost hits, hand sweeps
ClockMetricsRecorder hand advances, ref-bit resets
ClockProMetricsRecorder cold↔hot transitions, test entries
NruMetricsRecorder sweep steps, ref-bit resets
SlruMetricsRecorder probationary→protected, protected evictions
TwoQMetricsRecorder A1in→Am promotions, A1out ghost hits
S3FifoMetricsRecorder promotions, main reinserts, ghost hits

Two design principles drive the granularity:

The trade is API surface: 14 recorder traits with ~5-10 methods each. The mitigation is that users do not implement them — they implement the shipped *Metrics structs through inherent methods on each policy, and they read snapshots, not recorders.

The &self-vs-&mut self split

Several Cache<K, V> methods take &self: trait-hierarchy.md explains why. The metrics system has to honour this — a &self read path cannot call a &mut self recorder. The shipped solution is a parallel *MetricsReadRecorder family for each policy whose read paths increment counters:

Mutable trait Read-only counterpart
FifoMetricsRecorder FifoMetricsReadRecorder
LruMetricsRecorder LruMetricsReadRecorder
LruKMetricsRecorder LruKMetricsReadRecorder
LfuMetricsRecorder LfuMetricsReadRecorder
MfuMetricsRecorder MfuMetricsReadRecorder

The read-only traits take &self on every method. They are implemented through interior mutability on the concrete metrics struct — specifically MetricsCell, the internal type that wraps Cell<u64> with an unsafe impl Sync (covered below).

Two questions this design avoided:

MetricsCell: interior mutability under external lock

#[repr(transparent)]
#[derive(Debug, Default, Clone, PartialEq, Eq)]
pub(crate) struct MetricsCell(Cell<u64>);

unsafe impl Sync for MetricsCell {}
unsafe impl Send for MetricsCell {}

This is the only unsafe impl Sync in the metrics surface, so its contract must be narrow:

The alternatives considered and rejected:

MetricsCell is the smallest tool for single-threaded or exclusively locked metric counters. Any policy or wrapper that records metrics from a read-locked path must not rely on MetricsCell for soundness.

Snapshots: cheap, copyable, optionally serializable

Every snapshot struct in src/metrics/snapshot.rs follows the same shape:

#[derive(Debug, Default, Clone, Copy, PartialEq, Eq)]
#[non_exhaustive]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
pub struct LruMetricsSnapshot {
    pub get_calls: u64,
    pub get_hits: u64,
    pub get_misses: u64,
    pub insert_calls: u64,
    pub insert_updates: u64,
    pub insert_new: u64,
    pub evict_calls: u64,
    pub evicted_entries: u64,
    pub pop_lru_calls: u64,
    pub pop_lru_found: u64,
    pub peek_lru_calls: u64,
    pub peek_lru_found: u64,
    pub touch_calls: u64,
    pub touch_found: u64,
    pub recency_rank_calls: u64,
    pub recency_rank_found: u64,
    pub recency_rank_scan_steps: u64,
    pub cache_len: usize,
    pub insertion_order_len: usize,
    pub capacity: usize,
}

Five intentional properties:

Gauges (cache_len, insertion_order_len, capacity) live alongside counters and snapshot together. The Prometheus exporter writes the right # TYPE line for each, which matters for the scraper.

Recording is push, consumption is pull

Two operating models coexist:

Specifically, the policy does not push to the exporter. There is no observer-pattern hook from the recorder to the exporter, no synchronous flush on every increment, and no async channel between them. The pull model lets benches consume at known checkpoints (once per iteration), and lets production scrapers poll on their own cadence (every 10 s, every minute, etc.).

The cost of the pull model is that an exporter cannot react to a specific event (e.g. “evictions spiked above N”). cachekit users who need event-driven reactions instrument at the application layer, not the metrics layer.

Prometheus text exporter

The shipped exporter (PrometheusTextExporter in src/metrics/exporter.rs) writes the Prometheus text exposition format to any W: Write + Send:

let exporter = PrometheusTextExporter::new("myapp_cache", io::stdout());
let snapshot = lru_cache.snapshot();
exporter.export(&snapshot);

Three design choices worth naming:

Other exporters (StatsD, OpenTelemetry, custom) plug in by implementing MetricsExporter<S> for each snapshot type they care about. No changes elsewhere in the crate are required.

Feature gating: all-or-nothing at compile time

The entire metrics subsystem is gated on the metrics Cargo feature:

// src/lib.rs
#[cfg(feature = "metrics")]
pub mod metrics;

Inside each policy, recorder calls are wrapped:

#[cfg(feature = "metrics")]
self.metrics.record_get_hit();

When metrics is off:

When metrics is on:

The trade-off is deliberate. No “low-cardinality always-on, detailed-on-demand” two-tier scheme exists — every counter is either always present (feature on) or absent (feature off). The discipline that keeps “always present” cheap is the recorder contract: methods do no work beyond incrementing a counter.

What about StoreMetrics?

StoreMetrics (src/store/traits.rs) is a separate, simpler structure that ships unconditionally (not behind metrics). It carries the universal counters every store-layer implementation tracks:

#[non_exhaustive]
pub struct StoreMetrics {
    pub hits: u64,
    pub misses: u64,
    pub inserts: u64,
    pub updates: u64,
    pub removes: u64,
    pub evictions: u64,
    pub expirations: u64,
}

The two systems coexist:

A store typically backs StoreMetrics with AtomicU64 counters (see StoreCounters in src/store/weight.rs), because stores are often behind concurrent wrappers and the increment paths can be &self. The split mirrors the sequential-vs-concurrent split at the trait level (concurrency.md).

Counter discipline

Three rules every recorder method follows:

  1. No allocation. Counter increments are O(1) and allocation-free.
  2. No fallible operations. A counter must not be in a position where it can fail — += always succeeds; saturation is acceptable for u64 wrap (it takes years at billions/sec).
  3. No conditional logic beyond the counter itself. A recorder method that branches on cache state belongs in the policy, not in metrics.

The corollary: a policy that wants a derived counter (“number of evictions where the victim’s recency rank was > 10”) computes the condition itself and calls one of two existing methods accordingly. Putting the branching inside the recorder would couple metrics to policy state.

Adding a new metric

Checklist for adding a per-policy counter:

  1. Add the field. Plain u64 if it’s updated on &mut self paths; MetricsCell if it’s updated on &self paths. Place it in the corresponding *Metrics struct under src/metrics/metrics_impl.rs.
  2. Add the recorder method. On the relevant *MetricsRecorder trait (or its *ReadRecorder counterpart for &self).
  3. Implement on the policy’s metrics struct. One-line += 1 body.
  4. Wire the call site in the policy. Wrap with #[cfg(feature = "metrics")].
  5. Add the field to the snapshot. In src/metrics/snapshot.rs. The snapshot’s From<&*Metrics> (or equivalent) needs the new field.
  6. Update the exporter. Add a write_counter / write_gauge call in PrometheusTextExporter::export for the new field.

Six locations is a lot of friction for a new counter. The friction is intentional — adding a counter is rarely the right answer to a debugging question, and the friction encourages reuse of existing counters where possible.

Adding a new metric type (gauge vs counter, histogram)

Histograms and sliding windows are deliberately out of scope. Adding either is a wider design change:

If histograms become needed (the most likely use case is latency distribution per policy), the design has space: introduce a HistogramRecorder trait alongside CoreMetricsRecorder and a matching HistogramSnapshot. The existing exporter stays counter- and-gauge-only; a new PrometheusHistogramExporter handles the new shape. The current omission is a coverage decision, not a foundation problem.

Guarantees and non-guarantees

What the metrics system guarantees:

What it does not guarantee:

See also