High-performance cache policies and supporting data structures.
Status: design rationale for the metrics infrastructure under
src/metrics/, gated by themetricsCargo feature. Companion todesign.md§6.
cachekit’s metrics surface is bigger than “two counters behind a
feature flag.” It mirrors the cache trait hierarchy — recorder /
snapshot / exporter — so each concern lives in the smallest trait
that captures it, and policy code stays free of monitoring plumbing.
This document explains the three-trait separation, the
&self-vs-&mut self split, the MetricsCell interior-mutability
escape hatch, the Prometheus exporter contract, and what guarantees
counters do and do not provide.
The metrics module is shaped for:
metrics module
compiles away under #[cfg(feature = "metrics")].hand_advance count
matters; a FIFO’s pop_oldest_empty_or_stale count matters. The
trait surface preserves these signals rather than flattening to
one shape.It is not shaped for:
Relaxed atomics
(src/store/weight.rs) and wrap on
overflow in release. Best-effort observability, not financial
ledger. ┌─────────────────────────────┐
│ CoreMetricsRecorder │
│ record_get_hit, _miss, │
│ _insert_*, _evict_*, │
│ _clear │
└──────────────┬──────────────┘
│ extends
┌──────────┬───────────┬───────────────┼───────────┬────────────┐
▼ ▼ ▼ ▼ ▼ ▼
FifoRec LruRec LfuRec ArcRec ClockRec S3FifoRec
│ …
▼
LruKRec
(further extends LruRec)
Consumption (decoupled from recording):
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ MetricsSnapshotProvider<S> │ │ MetricsExporter<S> │
│ + MetricsReset │ │ PrometheusTextExporter │
│ (bench / test) │ │ (production monitoring) │
└──────────────────────────────┘ └──────────────────────────────┘
Three responsibilities, three trait families:
*MetricsRecorder traits live in
src/metrics/traits.rs. Every
policy-specific recorder extends CoreMetricsRecorder and adds
policy-specific methods (record_hand_advance for Clock,
record_b1_ghost_hit for ARC, etc.). The policy itself calls
these methods on its hot path.MetricsSnapshotProvider<S> returns a Copy
*MetricsSnapshot struct (src/metrics/snapshot.rs)
— a point-in-time scalar copy of every counter. Snapshots are
#[non_exhaustive] for SemVer headroom and gated on serde for
cross-process transport.MetricsExporter<S> consumes a snapshot and pushes it
to an external system. The shipped implementation,
PrometheusTextExporter (src/metrics/exporter.rs),
writes Prometheus exposition format to any W: Write + Send.Splitting these three lets:
MetricsSnapshotProvider + MetricsReset and never touch
MetricsExporter. Production code does the inverse.impl MetricsExporter<S> for the
snapshot types — no policy changes.Every policy gets its own recorder trait extending
CoreMetricsRecorder. The shipped set:
| Trait | Adds counters for |
|---|---|
FifoMetricsRecorder |
scan steps, stale skips, pop_oldest calls |
LruMetricsRecorder |
pop_lru, peek_lru, touch, recency_rank |
LruKMetricsRecorder |
extends LruMetricsRecorder + K-distance counters |
LfuMetricsRecorder |
pop_lfu, peek_lfu, frequency reads / mutates |
MfuMetricsRecorder |
mirrors LFU for most-frequent eviction |
ArcMetricsRecorder |
T1→T2 promotions, B1/B2 ghost hits, p movement |
CarMetricsRecorder |
recent→frequent, ghost hits, hand sweeps |
ClockMetricsRecorder |
hand advances, ref-bit resets |
ClockProMetricsRecorder |
cold↔hot transitions, test entries |
NruMetricsRecorder |
sweep steps, ref-bit resets |
SlruMetricsRecorder |
probationary→protected, protected evictions |
TwoQMetricsRecorder |
A1in→Am promotions, A1out ghost hits |
S3FifoMetricsRecorder |
promotions, main reinserts, ghost hits |
Two design principles drive the granularity:
evictions: u64 cannot answer either.record_a1in_to_am_promotion
belongs to 2Q because A1in/Am are 2Q concepts. Putting it on
CoreMetricsRecorder would force every other policy to either
implement a meaningless method or document a no-op.The trade is API surface: 14 recorder traits with ~5-10 methods
each. The mitigation is that users do not implement them — they
implement the shipped *Metrics structs through inherent methods on
each policy, and they read snapshots, not recorders.
&self-vs-&mut self splitSeveral Cache<K, V> methods take &self:
trait-hierarchy.md
explains why. The metrics system has to honour this — a &self
read path cannot call a &mut self recorder. The shipped solution
is a parallel *MetricsReadRecorder family for each policy whose
read paths increment counters:
| Mutable trait | Read-only counterpart |
|---|---|
FifoMetricsRecorder |
FifoMetricsReadRecorder |
LruMetricsRecorder |
LruMetricsReadRecorder |
LruKMetricsRecorder |
LruKMetricsReadRecorder |
LfuMetricsRecorder |
LfuMetricsReadRecorder |
MfuMetricsRecorder |
MfuMetricsReadRecorder |
The read-only traits take &self on every method. They are
implemented through interior mutability on the concrete metrics
struct — specifically MetricsCell, the internal type that wraps
Cell<u64> with an unsafe impl Sync (covered below).
Two questions this design avoided:
Cell<u64> directly on the metrics struct?”
Because Cell<u64> is !Sync, which propagates and prevents
every policy struct that embeds metrics from being Sync. The
thin MetricsCell wrapper makes the synchronisation discipline
explicit at one site instead of N.AtomicU64 for everything?” Because counters
on &mut self paths (the majority — insert, get, evict)
do not need atomic semantics; the policy already holds exclusive
access. However, MetricsCell is only sound when &self metric
increments are protected by exclusive synchronization or are known
to be single-threaded. It is not a substitute for atomics under
shared RwLock::read access.MetricsCell: interior mutability under external lock#[repr(transparent)]
#[derive(Debug, Default, Clone, PartialEq, Eq)]
pub(crate) struct MetricsCell(Cell<u64>);
unsafe impl Sync for MetricsCell {}
unsafe impl Send for MetricsCell {}
This is the only unsafe impl Sync in the metrics surface, so its
contract must be narrow:
RwLock::read guard does not serialize readers, so it is not
sufficient protection for Cell<u64>. MetricsCell may be used
on single-threaded policy paths, or behind a write lock / mutex,
but not for counters mutated concurrently through read-locked
&self methods.Cell mutation.
Concurrent read-path counters must use AtomicU64, take an
exclusive lock, or be disabled for that path.pub(crate). The type does not escape the crate.
Down-stream code can read counters through the snapshot API but
cannot construct MetricsCell itself, which prevents misuse from
outside the codebase.The alternatives considered and rejected:
Mutex<u64> — cost dominates the counter increment.AtomicU64 — the correct choice for counters that can be
incremented concurrently through shared references; unnecessary
for single-threaded or exclusively locked counters.RefCell<u64> — runtime borrow checking with panic on contention;
not desirable on a metrics increment path.MetricsCell is the smallest tool for single-threaded or exclusively
locked metric counters. Any policy or wrapper that records metrics
from a read-locked path must not rely on MetricsCell for soundness.
Every snapshot struct in src/metrics/snapshot.rs
follows the same shape:
#[derive(Debug, Default, Clone, Copy, PartialEq, Eq)]
#[non_exhaustive]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
pub struct LruMetricsSnapshot {
pub get_calls: u64,
pub get_hits: u64,
pub get_misses: u64,
pub insert_calls: u64,
pub insert_updates: u64,
pub insert_new: u64,
pub evict_calls: u64,
pub evicted_entries: u64,
pub pop_lru_calls: u64,
pub pop_lru_found: u64,
pub peek_lru_calls: u64,
pub peek_lru_found: u64,
pub touch_calls: u64,
pub touch_found: u64,
pub recency_rank_calls: u64,
pub recency_rank_found: u64,
pub recency_rank_scan_steps: u64,
pub cache_len: usize,
pub insertion_order_len: usize,
pub capacity: usize,
}
Five intentional properties:
Copy. A snapshot is a flat block of u64s and usizes.
Copying is a memcpy and snapshots can flow through channels,
futures, and test assertions without ceremony.Default. Equivalent to “no operations recorded.” Useful for
test fixtures and explicit reset comparisons.#[non_exhaustive]. Adding a new counter (e.g. when a
policy variant gains a new internal step) is a minor version
bump. Downstream code matching on the struct must accept new
fields gracefully — the standard non_exhaustive discipline.PartialEq + Eq. Snapshot equality is well-defined and
useful in tests. Two snapshots compare equal iff every counter
matches.serde. Gated on serde, not unconditional, so
the metrics module doesn’t drag serde into builds that don’t
want it.Gauges (cache_len, insertion_order_len, capacity) live
alongside counters and snapshot together. The Prometheus exporter
writes the right # TYPE line for each, which matters for the
scraper.
Two operating models coexist:
m.record_get_hit() directly. The recorder method has the
cheapest possible body (one += 1). This is the hot-path
contract.m.snapshot() whenever they want a value, and
MetricsReset::reset_metrics(&self) when they want to clear.
Nothing about the policy timing depends on consumption.Specifically, the policy does not push to the exporter. There is no observer-pattern hook from the recorder to the exporter, no synchronous flush on every increment, and no async channel between them. The pull model lets benches consume at known checkpoints (once per iteration), and lets production scrapers poll on their own cadence (every 10 s, every minute, etc.).
The cost of the pull model is that an exporter cannot react to a specific event (e.g. “evictions spiked above N”). cachekit users who need event-driven reactions instrument at the application layer, not the metrics layer.
The shipped exporter (PrometheusTextExporter in
src/metrics/exporter.rs) writes
the Prometheus text exposition format to any W: Write + Send:
let exporter = PrometheusTextExporter::new("myapp_cache", io::stdout());
let snapshot = lru_cache.snapshot();
exporter.export(&snapshot);
Three design choices worth naming:
myapp_cache) is set at
construction, not per call. This keeps the call site simple and
enforces a single metric namespace per exporter instance.Result. The contract is
“fire-and-forget monitoring” — a transient EPIPE to a metrics
socket must not interrupt cache operations. Callers who need
guaranteed delivery should wrap their writer in something with
retry semantics and accept the cost.Mutex<W>, not RwLock<W>. Writing is
always exclusive; there’s no read path. Using Mutex here is
the right primitive even though most of cachekit uses
parking_lot::RwLock. (Note: this is std::sync::Mutex,
poisoning-aware. export panics on poisoning. This is a
deliberate divergence from parking_lot — the exporter is on
the cold path and the std mutex’s poisoning behaviour is fine
there.)Other exporters (StatsD, OpenTelemetry, custom) plug in by
implementing MetricsExporter<S> for each snapshot type they
care about. No changes elsewhere in the crate are required.
The entire metrics subsystem is gated on the metrics Cargo
feature:
// src/lib.rs
#[cfg(feature = "metrics")]
pub mod metrics;
Inside each policy, recorder calls are wrapped:
#[cfg(feature = "metrics")]
self.metrics.record_get_hit();
When metrics is off:
metrics module disappears from the build.record_* call site becomes a no-op (the #[cfg] block
compiles away).When metrics is on:
u64 += 1 per call (or one Cell::set for
read-only counters). For a 17-policy DynCache that records on
every get / insert, the overhead is sub-nanosecond and shows
up in benches as flat regression.metrics::snapshot and metrics::exporter modules are in
the public API and exporting infrastructure is available.The trade-off is deliberate. No “low-cardinality always-on, detailed-on-demand” two-tier scheme exists — every counter is either always present (feature on) or absent (feature off). The discipline that keeps “always present” cheap is the recorder contract: methods do no work beyond incrementing a counter.
StoreMetrics?StoreMetrics (src/store/traits.rs)
is a separate, simpler structure that ships unconditionally
(not behind metrics). It carries the universal counters every
store-layer implementation tracks:
#[non_exhaustive]
pub struct StoreMetrics {
pub hits: u64,
pub misses: u64,
pub inserts: u64,
pub updates: u64,
pub removes: u64,
pub evictions: u64,
pub expirations: u64,
}
The two systems coexist:
StoreMetrics is the store-layer baseline. Always present, always
cheap, seven counters. expirations stays at 0 on stores that
do not own a TTL surface.src/metrics/ (feature-gated) is the policy-layer detailed
metrics — recorder traits, snapshots, exporter, per-policy signals.A store typically backs StoreMetrics with AtomicU64 counters
(see StoreCounters in src/store/weight.rs),
because stores are often behind concurrent wrappers and the
increment paths can be &self. The split mirrors the
sequential-vs-concurrent split at the trait level
(concurrency.md).
Three rules every recorder method follows:
+= always succeeds; saturation is
acceptable for u64 wrap (it takes years at billions/sec).The corollary: a policy that wants a derived counter (“number of evictions where the victim’s recency rank was > 10”) computes the condition itself and calls one of two existing methods accordingly. Putting the branching inside the recorder would couple metrics to policy state.
Checklist for adding a per-policy counter:
u64 if it’s updated on &mut self
paths; MetricsCell if it’s updated on &self paths. Place it
in the corresponding *Metrics struct under
src/metrics/metrics_impl.rs.*MetricsRecorder
trait (or its *ReadRecorder counterpart for &self).+= 1 body.#[cfg(feature = "metrics")].src/metrics/snapshot.rs. The
snapshot’s From<&*Metrics> (or equivalent) needs the new
field.write_counter /
write_gauge call in PrometheusTextExporter::export for the
new field.Six locations is a lot of friction for a new counter. The friction is intentional — adding a counter is rarely the right answer to a debugging question, and the friction encourages reuse of existing counters where possible.
Histograms and sliding windows are deliberately out of scope. Adding either is a wider design change:
&mut u64 += 1 semantics. A histogram
needs observe(value) semantics and an aggregation strategy.Copy and u64 fields. A histogram
snapshot needs bucket arrays.If histograms become needed (the most likely use case is latency
distribution per policy), the design has space: introduce a
HistogramRecorder trait alongside CoreMetricsRecorder and a
matching HistogramSnapshot. The existing exporter stays counter-
and-gauge-only; a new PrometheusHistogramExporter handles the
new shape. The current omission is a coverage decision, not a
foundation problem.
What the metrics system guarantees:
snapshot() for the same thread.u64 reads on 64-bit
platforms.PrometheusTextExporter on poisoned mutex), or alter
cache state.What it does not guarantee:
hits = 100, misses = 99
while a concurrent writer is mid-update; the next snapshot may
show hits = 100, misses = 101. There is no “snapshot epoch.”MetricsCell recording. MetricsCell must not be
incremented from multiple read-locked callers. Shared read locks do
not serialize readers, so those paths must use atomics or acquire an
exclusive lock before recording. Metrics may be best-effort, but
the implementation still has to be data-race-free.overflow-checks = false. Counters wrap silently. At one billion
events per second, u64 wraps in ~585 years — practically a
non-issue, formally not a guarantee.&self / &mut self
split that drives the read-vs-mutate recorder forkMetricsCell may be usedsrc/metrics/ — the canonical implementationsrc/store/traits.rs —
StoreMetrics, the unconditional store-layer counterpart