We Finally Got to See What We Feed the Machines. Half of It Is an Accident.

by lukasz | Jun 8, 2026 | Essays

For the first time there's crawl-scale data on what structured markup the web actually carries — and it says the machine-readable web is narrower, more automated, and more out-of-date than anyone admits.

We have spent two years arguing about how to make websites legible to machines — what schema to add, what markup earns AI citations, which structured data the models prefer. The whole debate ran on guesswork, because nobody outside Google could see what the web actually carries. As of June 2026, we can. Schema.org and Google published the first public dataset of how structured-data vocabulary is really deployed across the web, drawn from Google's crawl infrastructure, and it will be updated monthly.

It is worth reading, because it quietly demolishes three things people believe about the machine-readable web. The common language is tiny. Most of the markup is an accident. And a lot of what remains is optimized for a feature that no longer exists.

The common language of the machine web is 43 words

The dataset covers 5,545 vocabulary terms — 958 Types (like Product, Person, Article) and 4,587 Properties (like price, name, author). That sounds like a rich language. It isn't.

Only 12 Types appear on more than 10 million domains. Just 31 Properties clear the same bar. That's 43 terms — under 0.8% of the entire vocabulary — carrying essentially all the high-volume structured data on the web. Meanwhile, 76.9% of all terms sit below 1,000 domains. More than three-quarters of the vocabulary that was carefully designed, debated, and standardized is, by the measure of actual use, marginal.

This is not a gentle curve from popular to rare. It's a cliff. Above the thousand-domain line the numbers thin out fast; below it, thousands of terms sit essentially unused. The machine-readable web has a vocabulary of roughly 43 common words and a dictionary full of words almost nobody speaks.

That matters because of what those 43 words are.

Most structured data isn't a decision — it's a default

Look at the dozen Types on 10 million-plus domains: BreadcrumbList, EntryPoint, ImageObject, ListItem, Organization, Person, PropertyValueSpecification, ReadAction, SearchAction, Thing, WebPage, WebSite.

Most of these don't describe what a page is about. They describe its plumbing. BreadcrumbList, ListItem, WebPage, WebSite are page architecture — and crucially, they're generated automatically by CMS platforms and SEO plugins, not hand-authored by anyone making a deliberate choice. The giveaway sits one tier down, in the 1-to-10-million bucket: WPHeader, WPFooter, WPSideBar. Those are WordPress structural markers. A large share of all the structured data on the web is WordPress describing its own template furniture, automatically, on millions of sites whose owners never decided anything.

This is the finding that should change how you think about the Data layer of agent-readiness. The instinct — add more schema, the machines are reading — assumes structured data is a signal you send on purpose. The data says most of it is exhaust: emitted by the platform, about the platform, regardless of intent. When an agent reads the structured web, a great deal of what it finds is a theme telling it where the sidebar is.

Which flips the practical advice. The win at the Data layer was never "emit more markup." Almost everyone already emits plenty, most of it accidental. The win is making the deliberate assertions — your Product, your Organization, your Article — true, complete, and consistent with what the page visibly says. Signal, not exhaust. The dataset shows how badly the web confuses the two.

And the official framing makes the audience explicit. Schema.org's own documentation says the dataset exists partly so toolmakers can "improve website plugins (like WordPress SEO tools)." The institutions behind the structured web are looking straight at the platform generating most of the noise, and asking it to generate less.

The web is still feeding machines a dead feature

Here's the part that connects to something that happened just last month. FAQPage and Question both sit in the 1-to-10-million-domain bucket — millions of sites carrying FAQ structured data. That markup was adopted at scale for one reason: for years it earned an expandable FAQ panel in Google's search results, a reliable visibility win that every SEO guide recommended.

That feature is gone. Google stopped showing FAQ rich results in May 2026. The panel those millions of sites were marking up for no longer renders. The dataset captures the residue: an enormous installed base of markup optimized for a search feature that no longer exists — and, because the data updates monthly, we will now get to watch whether that adoption starts to decay or just sits there, a fossil layer of the machine-readable web, indefinitely.

This is the strongest possible argument for the one habit that actually survives: every recommendation at the Data layer has an expiry date. Millions of sites did exactly what the guides said, and the guides went stale underneath them. The markup that survives a feature deprecation is the markup that asserts something true about the page regardless of what visual treatment Google offers this quarter. The markup that was only ever chasing a panel becomes dead weight the moment the panel disappears — and now there's a public counter ticking on exactly how much dead weight is out there.

What the dataset is, and what it isn't

A few honest limits, because they matter and because the temptation to over-read this data is real.

It is the web as Google indexes it — not the whole web. Sites blocked in robots.txt don't appear at all, which is a quiet reminder that the Signals layer (who you let crawl you) determines whether you show up in measurements like this one. It counts domains, not pages or objects: a site using Product on 500 pages counts once, so this measures breadth of adoption, not intensity. It uses range buckets, not exact numbers — deliberately, both to filter daily crawl noise and to stop anyone reverse-engineering Google's crawl patterns. And it doesn't distinguish JSON-LD from Microdata from RDFa; format is invisible here.

And one thing it is not, despite the timing: it is not Google opening its search data under regulatory pressure. The EU's Digital Markets Act proceedings would force Google to share query and click data — what users search and tap — with rival engines. This dataset is a different animal entirely. It describes how the web labels itself, not how users behave. Reading it as Google's answer to Brussels gets the story wrong. The more interesting truth is simpler: a transparency the structured-data community had requested for over a decade finally shipped, and it happens to arrive in the same month the web crossed over to majority-machine traffic. The timing is the story — not a conspiracy, just a field maturing fast enough that the measurement tools are finally catching up to the thing being measured.

Why this is good news, oddly

It would be easy to read all this as bleak: the machine-readable web is shallow, automated, and partly fossilized. But a measurement you can see beats a guess you can't, every time. For the first time, a site owner deciding what structured data to invest in can look at real adoption instead of inferring from rich-results galleries and blog posts. The 43 common terms are the ones worth getting right. The dead FAQ markup is worth not adding more of. The deliberate assertions — who you are, what you sell, what you published — are worth making true, because they're the signal an agent can actually use, sitting in a sea of automated exhaust.

The machines are reading. We finally got to see what we've been handing them. Most of it, it turns out, we handed them by accident — and the most useful thing you can do now is decide, on purpose, what your own page asserts.

This extends the Data layer of agent-readiness — the principle that an agent trusts your data over your prose, now measured at the scale of the whole web. For the full map of how machines read a site, see The Field Guide to Agent-Readiness.