Client, Parasite, or Thief: Governing the Agents on Your Site

by lukasz | Jun 7, 2026 | Essays

Field report from the Governance layer of agent-readiness — the business decisions sitting on top of all the technical work, now that most of your traffic isn't human.

Here is the number that ends the argument about whether this matters yet. As of June 2026, Cloudflare measures automated systems generating about 57.5% of all HTTP requests to web content, against 42.5% from humans — the first time machines have crossed that line in the company's records. The main driver isn't training crawlers anymore. It's agentic AI: autonomous programs browsing on behalf of assistants like ChatGPT and Gemini, where a single agent can hit thousands of pages to finish a task a person would do in a handful of clicks.

So the question at the Governance layer isn't hypothetical. Most of what arrives at your site is already not a person. The first five layers of agent-readiness were technical: can the agent read the page, understand the data, find its way in, act, prove who it is. This layer asks something different — not how, but whether, and on what terms. It's the layer of business and policy decisions sitting on top of all the engineering below. And you can have a technically perfect, fully agent-ready site with no governance at all — which is its own kind of decision, usually the worst one.

First, name what's arriving: client, parasite, or thief

You can't set a policy until you can classify the traffic. The cleanest frame is three kinds of agent, by what they give back.

The client acts on behalf of a user who wants what you offer. It buys, books, asks about products, completes a task the person actually wanted done. This is the agent you want to serve as well as possible — for it you build operability, clean product data, structured offerings. A client is revenue with a different user-agent string.

The parasite consumes your content and returns nothing. The training crawler harvesting your articles. The answer engine that summarizes your work for a user who will never visit you. It isn't breaking the law and it isn't malicious — it just takes value without leaving any. This is where the decision is purely commercial: let it in for the brand exposure, block it to protect the asset, or try to charge it.

The thief scrapes paywalled content, hammers your server, performs actions without authorization, routes around your defenses on purpose. Here there's no dilemma — you block, you log, you escalate. The Amazon v. Perplexity case reached federal court precisely because the behavior alleged was this third kind: accessing pages despite explicit refusal and actively working around technical blocks.

The catch that makes governance hard: these three don't wear labels. The same bot can be a client to one site and a parasite to another. A crawler indexing you for a search feature that sends traffic is closer to client; the identical crawler taking your content into a model that never cites you is parasite. Governance starts with your decision about how to classify the traffic arriving at your door — not with a list someone else wrote.

The four moves of governance in practice

Once you can name the traffic, the layer comes down to four decisions.

Access policy — who gets in, to do what. This is robots.txt with directives for named user-agents, but it's also the deeper choice underneath: do you let GPTBot in for training, or only for inference, or not at all? Do you allow purchasing agents to act autonomously, or require a human in the loop on every transaction? Refusing to decide defaults you to "everyone, on any terms" — which can be a legitimate strategy, but only as a deliberate choice, not as the residue of never having looked.

Monitoring — see it before you judge it. Most site owners have no idea how many agents reach them, from which providers, behaving how. GA4 doesn't show this; it's built for human sessions. Server logs and edge analytics do — but someone has to read them. Monitoring is the foundation the other three moves stand on. Without the data, a policy is just a guess wearing a confident face.

Monetize or protect — what you do with the value agents take. Two clean options and a widening space between them. The open play: let the AI crawlers in, accept that your content feeds models, bank on citations and brand lift over the long term. The closed play: block training crawlers, protect paid content from scraping, look at licensing or pay-per-crawl.

And this is the part that stopped being an experiment since the dictionary last described it. Cloudflare's pay-per-crawl revived HTTP 402 — "Payment Required," dormant in the spec since the early 1990s — as a live negotiation signal: when a crawler requests a protected page, the server can return a 402 carrying a crawler-price header (a cent a page, say), and if the bot agrees it re-requests with payment and gets a 200. Cloudflare acts as merchant of record and settles through Stripe, while Web Bot Auth uses cryptographic signatures so a crawler can't just spoof a friendly name. The scale is already real: Cloudflare customers send more than a billion 402 responses a day, AI Crawl Control is free on every plan, and new domains now ship with known AI crawlers blocked by default. The honest caveat — pay-per-crawl itself is still gated (closed beta, waitlist or enterprise contract as of early-to-mid 2026), and the economics only make sense at meaningful scale. For a small site, the 402 is a sign of where things are heading more than a revenue line today.

Regulatory fit — the AI Act, data protection, copyright. The EU AI Act is phasing in obligations on providers and deployers of AI systems. For a site owner the live questions are concrete: do agents acting on your site process users' personal data (data-protection law), do you owe users disclosure when an agent acts for them, can your content be used to train models without consent (copyright). In mid-2026 these don't have settled answers in most jurisdictions. But they will — and the owners who start thinking about them now will be better placed than the ones who start when a regulator or a court forces it.

Why this is the sixth layer, not the first

Governance sits last for a reason, and it's not that it matters least — it's that it depends on everything below. To govern agents sensibly you first need a page an agent can read, data it understands, knowledge of which agents arrive and from where, a grasp of what they can do, and some way to tell whether they are who they claim. Governing traffic you can't read and don't monitor is governing in the dark. The technical layers earn you the right to make the business decision.

There's a closing window here worth naming. Right now, very few sites have any deliberate policy toward agent traffic — not because the topic is unimportant, but because it's new and unobvious. That window closes the moment the first real legal precedents land (the Ninth Circuit hears Amazon v. Perplexity this month). The sites that have a governance posture before that happens will be in a far stronger position than the ones reacting after the fact.

You don't need all four moves at full strength. For most sites, naming the traffic and turning on monitoring already puts you ahead of nearly everyone — the policy and the monetization can follow once you can actually see what's arriving. The first move is the cheapest and the one almost nobody makes: look.

This is the Governance layer of agent-readiness. For the whole map — all six layers, where to start, what's still unsettled — see The Field Guide to Agent-Readiness.

Client, Parasite, or Thief: Governing the Agents on Your Site

Table of Contents

First, name what's arriving: client, parasite, or thief

The four moves of governance in practice

Why this is the sixth layer, not the first