Hashing and Key Identity

Status: design rationale for hasher choices, key interning, and hash-based routing. Companion to concurrency.md, sharding.md, and the security notes in store/data-structure modules.

cachekit uses hashing in three different roles:

Lookup indexes (HashMapStore, policy maps, ghost indexes).
Compact key identity (KeyInterner).
Shard routing (ShardSelector).

Those roles have different threat models. Some code paths choose FxHash for speed on trusted keys; others default to RandomState or keyed SipHash because untrusted keys can create HashDoS or single-shard contention. This document explains those choices and when callers should override them.

The Decision Matrix

Component	Default hasher	Why	Caller override?
`HashMapStore`	`RandomState`	public store API, safer default	yes, `with_hasher`
`ClockRing`	`RandomState`	can be keyed by user input	yes, with explicit trust acknowledgement
`KeyInterner`	`FxBuildHasher`	hot internal mapping, trusted-key bias	yes, `with_hasher`
`WeightStore`	`FxHashMap`	speed, large-value target	no generic hasher today
Policy internals	mostly `FxHashMap`	hot metadata paths	generally no
`ShardSelector`	keyed SipHash-1-3	routing must resist shard pinning	seed or randomized constructor

The rule: default to DoS-resistant hashing at public boundaries; use faster hashing inside policy metadata when keys are trusted or already admitted.

`RandomState`: Safe Public Default

HashMapStore and ClockRing default to std::collections::hash_map::RandomState. This is the right public default because callers often pass keys derived from request paths, tenant ids, URLs, or filenames. Randomized hashing prevents an attacker from precomputing many keys that collide in one bucket.

The cost is per-hash overhead. For workloads with fully trusted keys (for example, dense integer ids generated by the process), callers can use with_hasher to opt into a faster hasher. That opt-in is intentionally explicit: the call site documents the threat-model decision.

ClockRing goes further by using a KeysAreTrusted acknowledgement for faster non-randomized hashers. The extra marker makes the security trade visible in review rather than hidden in a type alias.

`FxHash`: Hot Internal Default

Many policy internals use rustc_hash::FxHashMap:

LRU-family maps from key to node pointer / slot id.
LFU/MFU frequency maps.
2Q / SLRU / Clock-PRO resident and ghost indexes.
WeightStore’s index.
KeyInterner’s default index.

FxHash is fast and deterministic. It is also non-cryptographic and not HashDoS-resistant. The intended use is trusted, already-admitted keys where the hash map is not directly exposed as an unbounded public endpoint.

The sharp edge is WeightStore: its target use case (variable-size objects like images, documents, blobs) often has user-derived keys. Its module docs call this out directly: pre-hash keys with a keyed hash or use HashMapStore if the key source is adversarial.

`KeyInterner`: Identity Compression, Not Security

KeyInterner maps external keys to compact u64 handles:

index: HashMap<K, u64, S>     keys: Vec<K>
"user:123" -> 0               keys[0] = "user:123"

The design goals:

Avoid repeated key cloning in hot paths.
Use compact handles in policy metadata and frequency maps.
Resolve a handle back to a key in O(1).

Handles are not capability tokens. They are sequential integers. A handle from one interner can silently resolve to a different key in another interner, and handles are reused after clear. Callers that store handles externally must pair them with generation() and reject stale generations.

Security implications:

The default FxBuildHasher is for trusted input.
Use with_hasher / with_capacity_and_hasher with RandomState when keys are derived from untrusted input.
KeyInterner is append-only until clear, so unique-key attacks can drive memory growth. Use try_intern and your own admission bound for untrusted keys.
Debug intentionally omits interned keys to avoid leaking URLs, user ids, or auth material into logs.

`ShardSelector`: Hashing for Routing

Shard routing has a different failure mode than lookup maps. A lookup hash collision slows one map; a routing collision pins the whole workload to one shard and defeats concurrency.

ShardSelector therefore uses keyed SipHash-1-3:

ShardSelector::randomized(shards) draws key material from RandomState. Use this for normal production sharding.
ShardSelector::new(shards, seed) is deterministic and reproducible. Treat seed as secret key material if adversaries can influence keys.

The selector reduces hash output to [0, shards) using fast range reduction rather than %, keeping distribution unbiased and cheap. The shard count is clamped to [1, MAX_SHARDS] to prevent user-controlled configs from allocating an unbounded number of locks or vectors.

Custom Hasher Rules

When adding a hasher parameter to a public type:

Default to RandomState unless the type is clearly internal-only.
Expose with_hasher and try_with_hasher if callers have legitimate trusted-key fast paths.
Document the threat model at the constructor, not only at module level.
Never hide a non-randomized hasher behind a harmless-sounding new.
If the hasher affects shard routing, prefer ShardSelector over ad hoc hashing so the keyed-routing contract stays centralized.

When using FxHashMap internally:

Keep it behind the policy or data-structure boundary.
Do not expose arbitrary insertions from untrusted users without a separate capacity/admission guard.
Mention the assumption in the module’s security notes if keys may be user controlled.

Serialization and Hash Seeds

Do not serialize hash seeds or hasher state unless the type is explicitly a deterministic routing artifact. ShardSelector::new(shards, seed) is the one place where reproducible routing is part of the public contract. RandomState and policy-internal hash maps should be reconstructed on deserialization.

Serializing raw hash-map order is also wrong. Hash-map iteration order changes with seeds and implementation details; serialized cache state should use stable semantic fields (keys, values, policy order) rather than map buckets.

Future Direction: Hasher Audit

The codebase intentionally mixes RandomState, FxHashMap, and SipHash. That mix is valid only while every use site has a documented threat model. A useful future hardening pass:

List every public constructor that accepts a key type.
Classify whether keys are trusted, user-supplied, or mixed.
Ensure user-supplied defaults are randomized.
Add KeysAreTrusted-style acknowledgement to any public non-randomized path.

CacheKit Docs

Hashing and Key Identity

The Decision Matrix

`RandomState`: Safe Public Default

`FxHash`: Hot Internal Default

`KeyInterner`: Identity Compression, Not Security

`ShardSelector`: Hashing for Routing

Custom Hasher Rules

Serialization and Hash Seeds

Future Direction: Hasher Audit

See Also

Hashing and Key Identity

The Decision Matrix

RandomState: Safe Public Default

FxHash: Hot Internal Default

KeyInterner: Identity Compression, Not Security

ShardSelector: Hashing for Routing

Custom Hasher Rules

Serialization and Hash Seeds

Future Direction: Hasher Audit

See Also

`RandomState`: Safe Public Default

`FxHash`: Hot Internal Default

`KeyInterner`: Identity Compression, Not Security

`ShardSelector`: Hashing for Routing