Skip to content

The NoRAG Paradigm

Colony rejects retrieval-augmented generation (RAG) as the foundation for deep reasoning over extremely long context.

Why Not RAG?

The RAG Problem

RAG activates only sparse subsets of a corpus at a time, missing the dense, cross-cutting connections essential for breakthrough insights. For tasks that require local (sparse) reasoning -- adding type annotations to a codebase, answering factual questions from a knowledge base -- this is fine. But Colony targets a different class of problems: tasks requiring global (systemic, dense) reasoning that synthesize insights from many disparate parts of the context across many iterative passes.

The NoRAG Advantage

In dense reasoning tasks, breakthroughs are unlocked by new insights synthesized from unpredictable combinations of previously known facts. A retrieval system, by definition, must predict which facts are relevant before the reasoning happens. This creates a chicken-and-egg problem: the most valuable connections are precisely the ones a retrieval model would not predict, because they span distant and seemingly unrelated parts of the context.

The major difference between RAG and NoRAG is that RAG optimizes for recall of known-relevant information, while NoRAG optimizes for the ability to synthesize new insights from all available information. RAG hides context that "seems" irrelevant -- precisely the context where breakthroughs come from.

A major difference between RAG and NoRAG is the type of queries they excel at: RAG is suited for queries with known-relevant information, while NoRAG is designed for continuous research queries that require synthesizing new insights from all available information.

RAG Query / Task vector similarity search Retrieve Top-K chunks predicts relevance before reasoning happens sparse context window LLM reasons over sparse subset Cross-cutting connections missed most valuable links span seemingly unrelated pages Optimizes for recall of known-relevant info. Hides context that "seems" irrelevant — precisely the context where breakthroughs come from. vs NoRAG (Colony) Query / Task WorkingSetCapability: cluster-wide KV cache coordination virtual pages ≫ KV cache capacity · page graph guides selection centrality → BFS traversal → cache-aware scoring → eviction candidates state: vcm:working_set:{tenant_id} on Blackboard — shared across all agents request_pages() · release_pages() · score_pages() · identify_eviction_candidates() page fault → VCM loads / evicts infer_with_suffix over loaded pages Scope-aware results → Blackboard tagged: source_agent · source_pages · scope merge · detect_contradictions · synthesize across agents record_query _resolution() → strengthen graph edges Discovers unknown-relevant connections. Graph stabilizes across rounds → O(N²) → O(N log N) amortized.

The Unifying Idea: Deep Research as a Game

Colony reconceptualizes deep research as a game with a large number of possible moves available to agents at every step. One class of moves is combinations of currently known facts that offer the smallest leap to new insights. Because the narrowest leap across the discovery front is often unpredictable, the entire context must remain live -- not filtered through retrieval -- because breakthroughs emerge from unpredictable connections between distant pieces of information.

A dynamic group of agents iteratively walks a page graph, accumulating state, communicating findings, and coordinating their traversal to maximize KV cache reuse. The page graph itself is built and refined as agents explore, creating a self-improving map of how context relates to itself.

graph TD
    A((Deep Research Task)) --> B[Build Initial Page Graph]
    B --> PG((Page Graph))
    PG --> MAS[Agent Swarm Traverses Graph] --> |New Connections| PG
    MAS --> CAS[Cache-Aware Scheduling] --> MAS
    MAS --> G[Games: Hypothesis, Negotiation, Coalition, Consensus Game] --> MAS
    MAS --> M[Memory Architecture] --> MAS
    MAS --> CAP[Capabilities: Page Attention, Reflection, Refinement, Validation, Grounding] --> MAS
    MAS --> F((Insights Synthesized))

Colony views deep research as a game where the moves available to agents are combinations of facts that offer the smallest leap to new insights. This framing has concrete architectural consequences:

  • The game state is the full set of live context pages plus accumulated findings
  • A move is a synthesis step that connects facts from different pages into a new insight
  • The strategy is the order and combination in which pages are visited and cross-referenced
  • Winning means reaching the deepest insights that the context can support

For this game to work, the entire context must remain live. You cannot play chess if most of the board is hidden behind a retrieval layer that only shows you the squares it thinks are relevant.

The Retrieval Trap

Retrieval systems optimize for recall of known-relevant information. Deep reasoning requires discovery of unknown-relevant connections. These are fundamentally different objectives, and optimizing for the first actively harms the second by hiding context that "seems" irrelevant.

Why Not RNNs or State Space Models?

Recurrent neural networks (RNNs) and state space models (SSMs) like Mamba offer an alternative to transformers for processing long sequences: they compress context into a fixed-size hidden state. This sounds efficient, but it has a fatal flaw for deep reasoning.

Once an RNN or SSM decides to forget some context, it cannot recover it. The compression is irreversible. Information that seemed unimportant in early layers may turn out to be critical ten reasoning steps later, and there is no mechanism to retrieve it.

LLMs with external memory (Colony's architecture) can always retrieve forgotten context from external storage -- blackboard state, VCM pages, agent findings -- when the reasoning process discovers it is needed. This is the same advantage that random-access memory has over streaming tape: you can go back.

Irreversible Forgetting

This is not a limitation that better training will fix. It is a structural property of recurrent architectures. The hidden state has finite capacity, and any compression scheme must discard information. Deep reasoning over extremely long context requires that nothing be permanently discarded until the task is complete.

Virtual Memory for LLMs

Merge this section with architecture/virtual-context-memory.md

If you cannot retrieve-and-forget, you need a system that can manage context at the scale of billions of tokens. Colony's answer is to treat KV cache management like an operating system treats virtual memory.

OS Virtual Memory Colony VCM
Virtual address space Virtual context pages
Physical RAM GPU KV cache capacity
Page tables Global page table
Page faults Page faults
Working set Active pages in KV caches of all LLM instances
Page replacement (LRU, etc.) Page replacement (LRU, etc.)
Prefetching Speculative page loading from page graph

Context is partitioned into pages and managed through a Virtual Context Manager (VCM) that operates at the cluster level -- across GPU nodes, not just within a single device. Pages are loaded into and evicted from KV caches based on access patterns, with a dynamically-updated page attention graph that captures which pages answer queries from which other pages.

Virtual Context Space (total size ≫ KV cache capacity) repo/1 repo/2 repo/3 doc/1 doc/2 kb/1 repo/4 repo/5 repo/6 doc/3 doc/4 kb/2 repo/7 repo/8 repo/9 doc/5 doc/6 kb/3 ⋯ ⋯ ⋯ ⋯ ⋯ N pages total (N ≫ KV capacity) Page Attention Graph (NetworkX DiGraph) edges: centrality · BFS · discovered_dependency weights updated by record_query_resolution() Blackboard: cluster-wide working set state vcm:working_set:{tenant_id} · results:partial:{tenant_id}:* (all agents coordinate here) VCM Page Table Cache Scheduling Page Fault Handler Eviction / Prefetch load evict KV Cache Cluster (physical capacity) holds only the current working set — a subset of all virtual pages GPU Node 1 · KV Cache repo/1 ✓ repo/2 ✓ doc/3 ✓ kb/1 ✓ … remaining slots free GPU Node 2 · KV Cache repo/5 ✓ doc/1 ✓ doc/5 ✓ kb/3 ✓ … remaining slots free GPU Node N · KV Cache · … Agents: infer_with_suffix against loaded pages (cached tokens reused) WorkingSetCapability: coordinates which pages are hot request_pages() triggers VCM load · hard/soft affinity keeps related pages co-located

This is not a metaphor. Colony implements actual page fault semantics, working set tracking, and cache-aware scheduling -- the same fundamental mechanisms that made virtual memory one of the most successful abstractions in computing history. The difference is that "physical memory" is GPU KV cache capacity distributed across a cluster, and "addresses" are semantic page identifiers rather than integers.

Inference with Suffix

Since LLMs are causal models, KV caches stay valid (and, hence, useful) only if a cached sequence is a prefix of the current input. Colony's infer_with_suffix() API allows agents to specify a suffix to append to the cached page content, enabling flexible reuse of cached tokens across different reasoning steps. Moreover, the page contents themselves are prefixed with a system message that provides context about the page's origin and relevance, and that also explains the reasoning process over sharded/paged context.

A single LLM has a finite context window. Even with million-token windows, many tasks involve context that exceeds what one model instance can hold. And for tasks requiring dense reasoning, the quality of reasoning degrades well before the context window is technically exhausted.

Colony's answer: distributed reasoning over extremely long context. Multiple agents, each with their own context window, collectively reason over a corpus that no single agent could process. A cache-aware scheduler ensures that agents are routed to nodes where their required pages are already loaded, and a page attention graph guides which pages to load next.

This produces reasoning over unbounded-length context: relationships with arbitrary degree in the knowledge hypergraph, discovered by agents that coordinate their exploration.

Agent-Page (Soft) Affinity

If Colony is deployed with self-hosted LLMs, to optimize context loading and reduce latency, an agent can be bound to specific pages, in which case the agent is constrained to run on the Ray node caching those pages in its LLM instance KV cache.

The Payoff: Amortized Efficiency

Merge this section with philosophy/cache-awareness.md

Merge this section with design-insights/page-graphs.md

The initial cost of reasoning over all pages is high: \(O(N^2)\) for routing queries among \(N\) pages. But as the page attention graph stabilizes over successive reasoning rounds, the amortized cost per round drops to \(O(N \log N)\). Deep reasoning tasks inherently require many rounds, so the graph has time to stabilize and the amortized cost dominates. This is especially true for research tasks, where: - The context corpus (git repos, docs, KBs) changes slowly (e.g., cumulative knowledge). - The number of reasoning rounds can be very large.

Amortized Efficiency

This is the same insight behind persistent data structures and amortized analysis in algorithms: pay a high upfront cost to build structure that makes all subsequent operations cheaper. Colony applies this principle to multi-agent reasoning over extremely long context.