Cache-Aware Patterns¶
Merge This File with design-insights/page-graphs.md
Nonuniform Page Sizes, Working Set Dynamics, Amortized Cost.
Merge This File with philosophy/no-rag.md
Amortized Efficiency.
When context spans billions of tokens across a GPU cluster, cache misses dominate execution time.
GPU Cluster Size Matters
Add a graph showing cache misses vs hits as a function of cluster size and context size, demonstrating the point at which cache management becomes critical.
The Key Insight
Cache awareness is a property of the multi-agent system as a whole, not of individual agents or tools. Individual agents need to collaboratively generate their plans that are aware of cache constraints and to schedule actions in a way that optimizes cache locality and minimizes costly misses for the system as a whole. Colony provides the infrastructure and coordination mechanisms to enable this transparently.
This distinction matters. You cannot make individual tools or actions or agents "cache-aware" in isolation. The most efficient strategy for managing KV cache pages depends on the data being analyzed at runtime: the structure of the context, the relationships between pages, the current state of the reasoning process. Only the planner, which sees the full picture, can make cache-aware decisions.
Colony's architecture reflects this: cache awareness is woven into the planning layer, not bolted onto the execution layer.
Working Sets as Resources¶
In Colony's Virtual Context Manager (VCM), context pages are limited resources that must be allocated and coordinated across agents, exactly like physical memory pages in an operating system.
graph LR
subgraph "Agent 1 Working Set"
WS1A[Page A]
WS1B[Page B]
WS1C[Page C]
end
subgraph "Agent 2 Working Set"
WS2B[Page B - Shared]
WS2D[Page D]
WS2E[Page E]
end
subgraph "GPU KV Cache"
KV[Limited Capacity]
end
WS1A --> KV
WS1B --> KV
WS1C --> KV
WS2B --> KV
WS2D --> KV
WS2E --> KV
This creates planning constraints that traditional agent frameworks never consider:
- Cache contention: Two agents needing different pages that compete for the same KV cache slots
- Working set overflow: An agent's plan requires more pages than cache capacity allows
- Exclusive pages: Some pages must not be evicted while a plan is active (high-frequency access pages)
- Shareable pages: Read-only pages that multiple agents can access without conflict
Plan Conflicts Include Cache Contention
In Colony, conflict detection between agent plans is not limited to logical dependencies (data conflicts, ordering constraints). Cache contention is a first-class conflict type. Two plans that are logically independent may still conflict if their combined working sets exceed cache capacity.
CacheContext in Plans¶
Every plan in Colony includes comprehensive cache context as a core part of the plan that the planner reasons about:
class CacheContext(BaseModel):
# What pages does this plan need?
working_set: list[str]
working_set_priority: dict[str, float] # page_id -> 0.0 to 1.0
# How will pages be accessed?
estimated_access_pattern: dict[str, int] # page_id -> access count
access_sequence: list[str] # Expected order of access
# Page graph context
page_graph_summary: dict[str, Any] # Clusters, relationships, density
# Cache requirements
min_cache_size: int # Can tolerate some misses
ideal_cache_size: int # Optimal performance
# Coordination with other agents
shareable_pages: list[str] # Safe to share
exclusive_pages: list[str] # Do not evict
# Prefetching
prefetch_pages: list[str]
prefetch_priority: dict[str, float]
The LLM planner uses this context to make decisions that a cache-unaware system would get wrong:
- Action ordering: Group actions that access the same pages together to maximize cache hits
- Prefetch scheduling: Declare page needs upfront so the VCM can load pages before they are needed
- Working set sizing: Scale the plan's ambition to fit available cache capacity
- Cache-conscious revision: When revising a plan mid-execution, preserve cache locality when possible
Amortized Cost: From \(O(N^2)\) to \(O(N \log N)\)¶
The initial cost of reasoning over \(N\) pages is \(O(N^2)\): in the first round, queries generated by each page may need to be routed to any other page. This is expensive, but it builds the page attention graph: a record of which pages actually answer queries from which other pages.
As the page attention graph stabilizes over successive reasoning rounds, two efficiency gains emerge:
- Smarter routing: Queries are routed using the graph rather than broadcast to all pages, reducing per-round cost to \(O(N \log N)\)
- Better cache locality: Pages that frequently exchange queries are identified as belonging to the same working set and co-located in cache
The amortized cost per round is:
where \(R\) is the number of reasoning rounds. Deep reasoning tasks inherently require many rounds (\(R\) grows with task complexity), so the quadratic startup cost is amortized away.
graph TD
R1[Round 1: O(N^2) - Full routing] --> G1[Page Graph: Sparse]
G1 --> R2[Round 2: Cheaper routing via graph]
R2 --> G2[Page Graph: Denser]
G2 --> R3[Round 3: Even cheaper]
R3 --> G3[Page Graph: Stabilizing]
G3 --> RN["Round N: O(N log N) amortized"]
style R1 fill:#ff9999
style R2 fill:#ffcc99
style R3 fill:#ffff99
style RN fill:#99ff99
The Working Set Conjecture
Colony hypothesizes that working set drift (measured by Jaccard similarity between page sets at different time steps) exhibits episodic behavior: periods of high stability interspersed with sharp transitions as agents shift focus. Cache-aware scheduling exploits this by accumulating out-of-working-set queries during stable episodes and batching page replacements at transition points.
Missing Implementation
- Query routing based on page graph. Does attention-based query routing modify the page graph?
- Query routing should rank all pages by attention/page-graph proximity, and route the query to the top K pages until a satisfactory answer is found or all pages are exhausted. This is a fundamental primitive for cache-aware scheduling, but its implementation status is not clear.
Model-Predictive Control for Plans¶
Colony does not execute plans to completion. It uses a Model-Predictive Control (MPC) approach: execute part of the plan, observe the results, re-evaluate, and adapt.
This is essential for cache-aware planning because:
- Cache state changes: Other agents load and evict pages, invalidating assumptions about what is cached
- Page graph updates: New relationships are discovered during execution, changing the optimal access pattern
- Working set drift: The reasoning process may discover that different pages are more relevant than initially predicted
The MPC approach means plans are living documents. The planner generates a plan with cache context, the executor runs the first few steps, the planner observes actual cache hit rates and page access patterns, and then revises the remaining plan accordingly.
sequenceDiagram
participant P as Planner
participant E as Executor
participant V as VCM
participant C as Cache
P->>V: Query current cache state
V-->>P: Cached pages, capacity
P->>P: Generate plan with CacheContext
P->>V: Prefetch requested pages
P->>E: Execute steps 1-3
E->>C: Access Page A (HIT)
E->>C: Access Page B (HIT)
E->>C: Access Page D (MISS)
E-->>P: Results + actual cache stats
P->>P: Revise plan (page D less useful than expected)
P->>E: Execute revised steps 4-6
Page Affinity Routing¶
In NUMA-aware scheduling in operating systems, computation should be co-located with the data it needs.
Similarly, an agent can be bound to one or more context pages. The PageAffinityRouter (used by the LLM cluster to route agent spawn requests) selects the Ray node where the most relevant pages are already cached:
- Soft affinity: Best-effort scheduling to nodes with relevant pages. The agent can run elsewhere if no good match exists.
- Hard affinity: The agent must run on a node with specific pages. Used when cache miss penalty would make execution impractical.
Cache-Aware Conflict Resolution¶
This Section Needs Attention
Add specific details and code examples from the codebase.
When two plans conflict over cache resources, Colony resolves it through a hierarchy of strategies:
- Page sharing: If all overlapping pages are read-only, both plans can proceed. Mark pages as shared.
- Priority-based allocation: Higher-priority plans keep contested pages. Lower-priority plans adjust their working sets.
- Temporal separation: Stagger execution so plans run sequentially, reusing cached pages.
- Working set partitioning: Split contested pages between plans based on access frequency -- assign each page to the plan that will access it more often.
Cache Thrashing
The worst outcome is two plans with high working set overlap running simultaneously, causing continuous eviction and reloading of the same pages. Colony's conflict detection flags this as a high-severity conflict when overlap exceeds 70% of either plan's working set.
Nonuniform Page Sizes¶
Page size affects cache efficiency in non-obvious ways:
- Too large: Spurious edges in the page graph (pages appear related because they contain unrelated content that happens to share a page). This causes unnecessary page loads.
- Too small: Edges that should be internalized (handled by in-context learning within a single page) are externalized (handled by inter-page context exchange). This increases coordination overhead.
Colony supports nonuniform page sizes to match the natural structure of the data. A concise configuration file might be one page; a large module might be split across several. The page graph captures the actual semantic relationships regardless of page boundaries.
What This Looks Like in Practice¶
Add Concrete Code
Add the exact code snippets where these behaviors arise.
A concrete example: an agent tasked with understanding a dependency chain across a large codebase.
- The planner analyzes the task and identifies 40 relevant pages (source files, configs, tests)
- Cache capacity on the target node is 25 pages
- The planner generates a plan that:
- Groups the 40 pages into 3 phases based on the page graph (module boundaries)
- Schedules Phase 1 (15 pages) with prefetch hints for Phase 2
- Marks 5 central pages (interfaces, shared types) as exclusive -- they span all phases
- Declares 10 pages as shareable with other agents analyzing the same codebase
- During execution, Phase 1 discovers unexpected dependencies to 3 pages not in the original plan
- The planner revises Phase 2 to include those pages, dropping 3 low-priority pages to stay within cache capacity
- Phase 3 benefits from the stabilized page graph: cache hit rate is 85% vs 40% in Phase 1
The total execution time is dominated by Phase 1's cache misses. Phases 2 and 3 run significantly faster because the page graph guides routing and the cache contains the right pages. This is the amortized cost argument made concrete.