Building a Web Cartographer for AI Agents

The problem I kept running into was deceptively simple to describe. Every AI agent I built or worked with that needed to use the web operated the same way: call a fetch tool, get back a wall of raw HTML, try to make sense of it. That works once. It does not work when you need an agent to navigate a documentation site, follow a thread across multiple pages, understand that one page is an API reference and another is a conceptual guide, and know that it has already visited the parent section before drilling into the subtopics.

Agents were fetching the web. They were not understanding it. There is a difference.

What I needed was a cartographer. Something that could map a site the way a researcher would — building a mental model of the structure, classifying what each page is, tracking relationships between pages, building a path that reflects intent rather than just link adjacency. I called it Cortex.

What an agent without a map looks like vs. what Cortex gives it

WITHOUT CORTEX

Random walks. Re-fetches pages it already saw. Loses context. Takes 15 steps to do a 3-step job.

WITH CORTEX

Structured map. Classified pages. Optimal path computed before the first step.

The first version was wrong in a specific way

My initial intuition was that the web requires JavaScript execution to be understood properly. A lot of modern sites render nothing useful in a raw HTTP response — they return a shell document that JavaScript populates. So I launched a headless Chromium instance per crawl session and drove it programmatically.

The approach worked in isolation. In a controlled test against a single documentation site, it produced reasonable maps. Then I tried it at any kind of scale.

The browser process leaked memory across sessions. After a dozen site crawls, resident memory was climbing past 2GB. The startup latency for each page was around 800ms — just the browser spin-up and render time, before any actual analysis. If you are building something that agents call in real time, 800ms is a lifetime. And the failure modes were spectacular: pages that triggered infinite scroll, JavaScript that fired alerts, authentication popups that blocked render entirely. A headless browser is a full user-agent. It inherits all the complexity of being a full user-agent.

I shut it down after two weeks. The architecture was wrong. I had solved for the hard case first and assumed it was the general case. It is not.

The v0.1 browser-first era, approximately

Chromium instance #14 of the day

2.1GB

RAM consumed

800ms

per page, just to start

∞

scroll on page 3

I shut it down after two weeks. The architecture was wrong.

The HTTP-first rewrite

Version 0.2 inverted the assumption. Instead of launching a browser for every page, Cortex tries HTTP-first: make a direct HTTP request, parse the response, and only escalate to a browser-backed render if the page signals that it needs JavaScript execution.

The signals that trigger browser escalation are specific: an empty body with script tags, explicit noscript fallback content indicating the primary content is JS-driven, a body that is a single root div with no readable text. If none of those signals are present, HTTP is enough. For the majority of sites Cortex needs to handle — API documentation, technical blogs, product pages, research sites — HTTP is enough.

The v0.2 layered acquisition strategy

HTTP fetch — direct request, parse raw response

~20ms

has readable content?YES →done

JS signals detected — empty body, noscript fallback, shell div

escalate

Browser render — headless Chromium, only when required

~250ms

Result: 3-4x faster average acquisition. Browser path exercised rarely.

This changed more than latency. Switching to HTTP-first forced a cleaner separation between acquisition and analysis. The acquisition layer fetches a document. The analysis layer figures out what it is. When those are entangled inside a browser session, you end up with analysis logic scattered through the fetch lifecycle. When they are separate, the analysis layer becomes something you can test independently.

The binary format question

Early versions of Cortex passed data between the daemon and client tools as JSON. JSON is readable, easy to debug, and understood by every language. It is also slow to parse at any volume, verbose on the wire, and schema-less in practice — nothing stops a field from disappearing or changing type between versions.

I designed the CTX binary format. It starts with a 4-byte magic sequence (CTX\0 — ASCII CTX followed by a null byte) that identifies the file type unambiguously. After the header come the page nodes. Each node carries a 128-dimensional f32 feature vector that encodes the semantic characteristics of that page. The content itself is LZ4-compressed and follows the node index. A CRC32 checksum covers the whole structure.

CTX file layout — every byte has a reason to be where it is

CTX\0
HEADER
NODE INDEX (128-dim f32 vectors)
CONTENT [LZ4]
CRC32
4 bytes
fixed
n × 512 bytes per node
variable, compressed
4 bytes

The format enforces structure that JSON defers. You cannot write a CTX file with a missing feature vector.

The feature vector was the hard design decision. 128 dimensions is arbitrary in the sense that I could have chosen 64 or 256. What it is not arbitrary about is what those dimensions represent. The vector has to capture enough information that a Dijkstra pathfinder running over the graph can make meaningful routing decisions without re-fetching pages. It also has to be dense enough that similarity queries across the map produce useful results. Six iterations of the vector layout before it was stable enough to commit.

The Rust problem that cost me a week

I built Cortex in Rust. The async runtime is Tokio. The HTML parsing is the scraper crate.

The scraper crate’s HTML tokenizer is not Send. In Tokio’s multi-threaded executor, this is a problem. Send is Rust’s guarantee that a type can be safely moved across thread boundaries. If a type is not Send, the compiler refuses to let you move it to a different thread while it is in use. Tokio spawns tasks across a thread pool. A non-Send tokenizer cannot be moved between tasks.

Day 4 of the Send-safety problem

error[E0277]
Tokenizer cannot be sent between threads safely
the trait Send is not implemented for Tokenizer

TRIED AND FAILED

Single-threaded runtime — bottlenecked under load

Alternative parsers — missing CSS selector API

Arc wrapping — still not Send at the root

LANDED ON

unsafe AssertSend wrapper. Manually asserting the tokenizer stays on its creation thread. Auditable. Flagged as known risk if concurrency model changes.

The solution I landed on was an unsafe AssertSend wrapper — a newtype wrapper around the scraper tokenizer that manually implements Send using an unsafe impl. The wrapper is used only in contexts where I can verify statically that the tokenizer stays on the thread it was created on. That kind of unsafe code is auditable. What you cannot rule out easily is a future change to the concurrency model that invalidates the assumption. Filed it in the known-risks document.

The classification problem

A web cartographer needs to know what kind of page it is looking at. Cortex has 32 PageTypes: Index, ApiReference, Conceptual, Tutorial, Changelog, Landing, Product, Blog, Forum, and so on. The heuristic classifier went through six iterations before it was stable enough to trust.

Selected classification failures that required dedicated test fixtures

looked likeDocumentation pageactually wasChangelog

looked likeNavigation hubactually wasLanding page

looked likeAPI referenceactually wasMarketing copy

Each of these required a dedicated test fixture and a rule adjustment. Six iterations total.

The classifier is a heuristic pipeline, not a trained model. I considered training a classifier on labeled pages but the labeling cost was high and the edge cases were highly site-specific. The heuristic approach is more brittle in theory but more debuggable in practice — when the classification is wrong, I can trace exactly which signal fired incorrectly and adjust the rule.

The auto-discovery problem

Cortex runs as a daemon. Clients connect via a Unix domain socket. The question of where that socket lives seems trivial. It is not.

My first implementation assumed a fixed path: /tmp/cortex.sock. That works on standard Linux. It does not work on macOS, where the system sometimes moves the socket location depending on sandbox context. It does not work in Docker containers where /tmp may be read-only or mounted differently. It does not work for users running multiple Cortex instances for different projects.

The cortex plug auto-discovery protocol checks candidate paths in priority order: the path set in CORTEX_SOCKET, then a path derived from the project root hash, then the fixed /tmp/cortex.sock fallback. If none resolve to a live socket, the error message tells you exactly which paths were checked and why each one failed.

cortex plug discovery sequence

1

$CORTEX_SOCKET

user-set env var, highest priority

↓ not set?

2

/tmp/cortex-{project_root_hash}.sock

per-project isolation

↓ not found?

3

/tmp/cortex.sock

fixed fallback

↓ not found?

!

error: tried 3 paths, none found a live socket — here is exactly what was checked

Silent connection failures in a daemon tool destroy debugging sessions. An error message that says “tried these paths, here is why each failed” is worth the extra code.

What is not finished

The k-means clustering step that runs at map-build time is working, but cluster quality degrades sharply on sites with highly irregular link structure. Sites where every page links to every other page through a shared navigation bar produce clusters that are nearly uniform and useless as routing hints. I know what the fix is: weight the edge structure more heavily in the clustering feature vectors. I have not implemented it.

The MCP server exposes 7 tools: map, get, search, pathfind, classify, similar, and plug. The pathfind tool is the most useful and the most fragile. When edge weights are accurate, it finds non-obvious routes through documentation sites that a keyword search would miss. When they are off, it produces routes that technically connect source to destination but do not reflect actual informational structure.

Concurrency under heavy load has not been stress-tested.

What Cortex is actually capable of, at the ceiling

I want to say something about what this tool can do when it is working correctly, because it is easy to get lost in the failure stories and miss the point.

The agents most people have used are good at answering questions. They are not good at navigating. Give a standard agent a task that requires moving through a documentation site and it thrashes — fetches pages at random following link text, loses track of where it has been, re-fetches things it already saw, and takes fifteen steps to do something a researcher would do in three.

That is not an intelligence problem. It is a navigation problem. The agent has no map.

Cortex gives it a map. The agent calls map once and Cortex returns a structured graph of the site: every page classified, every link weighted by navigational significance, clusters of related content grouped, a pathfinder that can answer “what is the most direct route from the landing page to the rate limiting section” in sub-millisecond time. The agent does not need to explore the site. The site is already understood.

What Cortex gives every agent that reads the map

Classified structure

32 PageTypes. Every node labeled before the agent makes a single request.

Sub-ms pathfinding

Dijkstra over weighted edges. Optimal route computed before the first step.

Cacheable, shareable

One .ctx file serves every agent. Build the map once. The three-year mental model, free.

A medium-sized documentation site: a few hundred kilobytes. Cached. Incrementally updated as the site changes.

At the ceiling, an agent with Cortex running can navigate a documentation site the way a senior engineer who has used it for three years can navigate it. Not because the agent memorized the site. Because it built a real structural understanding of it and can query that understanding faster than it can re-read a single page.

That is what I am building toward. The current version gets you partway there. The path from partway to fully is clear. It is just implementation.

I will keep writing about this as it develops.