The NoRAG Paradigm¶
Colony rejects retrieval-augmented generation (RAG) as the foundation for deep reasoning over extremely long context.
Why Not RAG?¶
The RAG Problem
RAG activates only sparse subsets of a corpus at a time, missing the dense, cross-cutting connections essential for breakthrough insights. For tasks that require local (sparse) reasoning -- adding type annotations to a codebase, answering factual questions from a knowledge base -- this is fine. But Colony targets a different class of problems: tasks requiring global (systemic, dense) reasoning that synthesize insights from many disparate parts of the context across many iterative passes.
The NoRAG Advantage
In dense reasoning tasks, breakthroughs are unlocked by new insights synthesized from unpredictable combinations of previously known facts. A retrieval system, by definition, must predict which facts are relevant before the reasoning happens. This creates a chicken-and-egg problem: the most valuable connections are precisely the ones a retrieval model would not predict, because they span distant and seemingly unrelated parts of the context.
The major difference between RAG and NoRAG is that RAG optimizes for recall of known-relevant information, while NoRAG optimizes for the ability to synthesize new insights from all available information. RAG hides context that "seems" irrelevant -- precisely the context where breakthroughs come from.
A major difference between RAG and NoRAG is the type of queries they excel at: RAG is suited for queries with known-relevant information, while NoRAG is designed for continuous research queries that require synthesizing new insights from all available information.
The Unifying Idea: Deep Research as a Game¶
Colony reconceptualizes deep research as a game with a large number of possible moves available to agents at every step. One class of moves is combinations of currently known facts that offer the smallest leap to new insights. Because the narrowest leap across the discovery front is often unpredictable, the entire context must remain live -- not filtered through retrieval -- because breakthroughs emerge from unpredictable connections between distant pieces of information.
A dynamic group of agents iteratively walks a page graph, accumulating state, communicating findings, and coordinating their traversal to maximize KV cache reuse. The page graph itself is built and refined as agents explore, creating a self-improving map of how context relates to itself.
graph TD
A((Deep Research Task)) --> B[Build Initial Page Graph]
B --> PG((Page Graph))
PG --> MAS[Agent Swarm Traverses Graph] --> |New Connections| PG
MAS --> CAS[Cache-Aware Scheduling] --> MAS
MAS --> G[Games: Hypothesis, Negotiation, Coalition, Consensus Game] --> MAS
MAS --> M[Memory Architecture] --> MAS
MAS --> CAP[Capabilities: Page Attention, Reflection, Refinement, Validation, Grounding] --> MAS
MAS --> F((Insights Synthesized))
Colony views deep research as a game where the moves available to agents are combinations of facts that offer the smallest leap to new insights. This framing has concrete architectural consequences:
- The game state is the full set of live context pages plus accumulated findings
- A move is a synthesis step that connects facts from different pages into a new insight
- The strategy is the order and combination in which pages are visited and cross-referenced
- Winning means reaching the deepest insights that the context can support
For this game to work, the entire context must remain live. You cannot play chess if most of the board is hidden behind a retrieval layer that only shows you the squares it thinks are relevant.
The Retrieval Trap
Retrieval systems optimize for recall of known-relevant information. Deep reasoning requires discovery of unknown-relevant connections. These are fundamentally different objectives, and optimizing for the first actively harms the second by hiding context that "seems" irrelevant.
Why Not RNNs or State Space Models?¶
Recurrent neural networks (RNNs) and state space models (SSMs) like Mamba offer an alternative to transformers for processing long sequences: they compress context into a fixed-size hidden state. This sounds efficient, but it has a fatal flaw for deep reasoning.
Once an RNN or SSM decides to forget some context, it cannot recover it. The compression is irreversible. Information that seemed unimportant in early layers may turn out to be critical ten reasoning steps later, and there is no mechanism to retrieve it.
LLMs with external memory (Colony's architecture) can always retrieve forgotten context from external storage -- blackboard state, VCM pages, agent findings -- when the reasoning process discovers it is needed. This is the same advantage that random-access memory has over streaming tape: you can go back.
Irreversible Forgetting
This is not a limitation that better training will fix. It is a structural property of recurrent architectures. The hidden state has finite capacity, and any compression scheme must discard information. Deep reasoning over extremely long context requires that nothing be permanently discarded until the task is complete.
Virtual Memory for LLMs¶
Merge this section with architecture/virtual-context-memory.md
If you cannot retrieve-and-forget, you need a system that can manage context at the scale of billions of tokens. Colony's answer is to treat KV cache management like an operating system treats virtual memory.
| OS Virtual Memory | Colony VCM |
|---|---|
| Virtual address space | Virtual context pages |
| Physical RAM | GPU KV cache capacity |
| Page tables | Global page table |
| Page faults | Page faults |
| Working set | Active pages in KV caches of all LLM instances |
| Page replacement (LRU, etc.) | Page replacement (LRU, etc.) |
| Prefetching | Speculative page loading from page graph |
Context is partitioned into pages and managed through a Virtual Context Manager (VCM) that operates at the cluster level -- across GPU nodes, not just within a single device. Pages are loaded into and evicted from KV caches based on access patterns, with a dynamically-updated page attention graph that captures which pages answer queries from which other pages.
This is not a metaphor. Colony implements actual page fault semantics, working set tracking, and cache-aware scheduling -- the same fundamental mechanisms that made virtual memory one of the most successful abstractions in computing history. The difference is that "physical memory" is GPU KV cache capacity distributed across a cluster, and "addresses" are semantic page identifiers rather than integers.
Inference with Suffix
Since LLMs are causal models, KV caches stay valid (and, hence, useful) only if a cached sequence is a prefix of the current input. Colony's infer_with_suffix() API allows agents to specify a suffix to append to the cached page content, enabling flexible reuse of cached tokens across different reasoning steps. Moreover, the page contents themselves are prefixed with a system message that provides context about the page's origin and relevance, and that also explains the reasoning process over sharded/paged context.
A single LLM has a finite context window. Even with million-token windows, many tasks involve context that exceeds what one model instance can hold. And for tasks requiring dense reasoning, the quality of reasoning degrades well before the context window is technically exhausted.
Colony's answer: distributed reasoning over extremely long context. Multiple agents, each with their own context window, collectively reason over a corpus that no single agent could process. A cache-aware scheduler ensures that agents are routed to nodes where their required pages are already loaded, and a page attention graph guides which pages to load next.
This produces reasoning over unbounded-length context: relationships with arbitrary degree in the knowledge hypergraph, discovered by agents that coordinate their exploration.
Agent-Page (Soft) Affinity
If Colony is deployed with self-hosted LLMs, to optimize context loading and reduce latency, an agent can be bound to specific pages, in which case the agent is constrained to run on the Ray node caching those pages in its LLM instance KV cache.
The Payoff: Amortized Efficiency¶
Merge this section with philosophy/cache-awareness.md
Merge this section with design-insights/page-graphs.md
The initial cost of reasoning over all pages is high: \(O(N^2)\) for routing queries among \(N\) pages. But as the page attention graph stabilizes over successive reasoning rounds, the amortized cost per round drops to \(O(N \log N)\). Deep reasoning tasks inherently require many rounds, so the graph has time to stabilize and the amortized cost dominates. This is especially true for research tasks, where: - The context corpus (git repos, docs, KBs) changes slowly (e.g., cumulative knowledge). - The number of reasoning rounds can be very large.
Amortized Efficiency
This is the same insight behind persistent data structures and amortized analysis in algorithms: pay a high upfront cost to build structure that makes all subsequent operations cheaper. Colony applies this principle to multi-agent reasoning over extremely long context.