Skip to content

Observability

Colony provides built-in distributed observability for long-running multi-agent sessions. Both traces (structured span data) and logs (standard Python logging) are durably persisted via Kafka and PostgreSQL, enabling post-mortem debugging even after agents have stopped.

Architecture

Agent / Deployment AgentTracingFacility SpanProducer Python logging.* KafkaLogHandler Kafka colony.spans colony.logs Dashboard Backend SpanConsumer LogConsumer SpanQueryStore LogQueryStore PostgreSQL spans logs trace_id, agent_id, tokens, duration session_id, level, message, context spans logs consume INSERT SELECT (query) durable

Traces

Traces capture structured execution spans (LLM calls, agent steps, tool invocations) with parent-child relationships, token counts, and timing data. See the AgentTracingFacility for details.

Pipeline: AgentTracingFacilitySpanProducer → Kafka (colony.spans) → SpanConsumer → PostgreSQL spans table → SpanQueryStore → Dashboard Traces tab.

Logs

Every Python log record emitted under the polymathera.colony namespace is captured, enriched with execution context, and durably stored.

How it works

  1. KafkaLogHandler — A standard Python logging.Handler attached to the polymathera.colony root logger during deployment initialization. It intercepts all log records without requiring changes to existing log calls.

  2. Context enrichment — Each log record is enriched with the current ExecutionContext (tenant_id, colony_id, session_id, run_id, trace_id) when available. This enables filtering logs by session or correlating them with traces.

  3. Async batching — Log records are queued in-process and flushed to Kafka asynchronously in batches (default: 50 records every 2 seconds). The emit() method never blocks the caller.

  4. LogConsumer — Runs in the dashboard backend container. Reads from the colony.logs Kafka topic and batch-inserts into PostgreSQL.

  5. LogQueryStore — Provides filtered, paginated queries over persisted logs:

  6. Filter by session, run, trace, actor class, log level
  7. Full-text search in messages (case-insensitive)
  8. Time range queries
  9. Aggregate statistics (error counts, actor summaries)

Querying logs

The dashboard exposes persistent log endpoints:

GET /api/v1/logs/persistent?session_id=X&level=WARNING&limit=100
GET /api/v1/logs/persistent?run_id=Y&search=timeout
GET /api/v1/logs/persistent?actor_class=StandaloneAgentDeployment&since=1712000000
GET /api/v1/logs/persistent/stats?session_id=X
GET /api/v1/logs/persistent/actors

These work even after the application stops — logs are durably stored in PostgreSQL as long as the dashboard and database containers are running.

Log record schema

Each log record stored in PostgreSQL contains:

Field Type Description
log_id TEXT Unique identifier
timestamp TIMESTAMPTZ When the log was emitted
level TEXT DEBUG, INFO, WARNING, ERROR, CRITICAL
logger_name TEXT Python logger name (e.g., polymathera.colony.agents.base)
message TEXT Log message
module TEXT Python module name
func_name TEXT Function that emitted the log
line_no INTEGER Source line number
pid INTEGER Process ID
actor_class TEXT Deployment class name (e.g., StandaloneAgentDeployment)
node_id TEXT Ray node ID
tenant_id TEXT Tenant ID (from execution context)
colony_id TEXT Colony ID (from execution context)
session_id TEXT Session ID (from execution context)
run_id TEXT Run ID (from execution context)
trace_id TEXT Trace ID (for correlation with spans)
exc_info TEXT Exception traceback if present

Indexes

Logs are indexed for fast queries on common access patterns:

  • (session_id, timestamp DESC) — all logs for a session, newest first
  • (run_id, timestamp DESC) — all logs for a specific run
  • (actor_class, timestamp DESC) — all logs from a deployment type
  • (level, timestamp DESC) — find errors/warnings quickly
  • (trace_id, timestamp DESC) — correlate logs with traces
  • (timestamp DESC) — global time-ordered access

Setup

The log pipeline is automatic. When KAFKA_BOOTSTRAP is set in the environment (which it is in all Docker containers), every deployment attaches the KafkaLogHandler during initialization. No configuration required.

To disable the log pipeline, unset the KAFKA_BOOTSTRAP environment variable.

Infrastructure requirements

Both traces and logs require:

  • Kafka — Message broker for reliable delivery and replay
  • PostgreSQL — Durable storage and indexed queries
  • Dashboard container — Runs the Kafka consumers that sink to PostgreSQL

All three are included in the default colony-env Docker Compose setup.