Connectors

Cloudflare AI crawlers

Capture AI-bot visits and AI-engine referrals at the Cloudflare edge, traffic the browser SDK can't see.

AI crawlers like GPTBot, ClaudeBot and PerplexityBot don't run JavaScript. They hit /robots.txt, sitemaps and raw HTML endpoints, so the TrustData JS SDK never sees them. A Cloudflare Worker deployed on your zone classifies every request at the edge and forwards a small log line to TrustData. Each forwarded hit is one of:

AI bot visit: the user agent matches a known AI crawler
AI referral visit: the referrer is an AI engine (chatgpt.com, perplexity.ai, claude.ai, …) and the visitor is human
Traffic sample: a small anonymized share (~2% by default) of everything else, used only as the denominator for AI-share metrics

All other traffic never leaves your zone. The Worker matches against a bot list it syncs from TrustData every few hours, so brand-new crawlers are picked up without a redeploy.

How it works

Loading diagram…

Setup

1. Issue an API key

In your organization settings, open the Integrations tab → Log ingest keys
Pick Cloudflare Worker (or Cloudflare Logpush for Enterprise) as the provider, name the key, and click Issue ingest key
Optionally pin a default attribution ID. The Worker will tag events with this ID when the payload omits one
Copy the key immediately, as it is shown only once

Keys use the format td_cf_<random>. Only the prefix and a SHA-256 hash are stored in TrustData; the full key cannot be recovered. If lost, revoke it and generate a new one.

2. Send logs to TrustData

Pick your ingest method. Most zones use the Worker; Cloudflare Enterprise customers can forward the same data over Logpush with no Worker. Both authenticate with the td_cf_... key from step 1.

The TrustData AI-bot collector is an open-source Cloudflare Worker you deploy on your own zone. It runs as middleware on every request, clones the response to read size/status, and fires a JSON payload at TrustData. It never alters the response.

Deploy it straight from the repository, no terminal needed:

→ TrustData AI-bot collector on GitHub

Click Deploy to Cloudflare in the repo's README. The guided setup runs entirely in your browser: it clones the repo to your account, auto-provisions the KV namespace, prompts for your td_cf_... API key (from step 1) and your attribution ID, then deploys the Worker to your zone.

Required after deploy: add your route, or nothing is captured. The Deploy button can't set up routing (Cloudflare's deploy flow never asks which zone you own), so a freshly deployed Worker is live but sees no traffic: it only answers at its *.workers.dev URL, which crawlers never hit. In the Cloudflare dashboard, open Workers & Pages → trustdata-ai-bot-collector → Settings → Domains & Routes → Add route, pick your zone, and set *.yourdomain.com/* (adjust to your traffic shape). Only then does the Worker wrap your site's traffic and start picking up AI bots.

No GitHub or GitLab account?

The Deploy button forks the repo into a Git account. If you don't have one, build the Worker by hand in the Cloudflare dashboard, with copy-paste only, no terminal, no Git:

Workers & Pages → Create → Start with Hello World → Deploy. Open the new Worker and click Edit code.
Open the collector's bundled code, worker.bundle.js on GitHub (viewing needs no account). Click Copy raw file, paste it over the default code in the editor (replace everything), and click Deploy.
Settings → Variables and Secrets: add each row from the table below. Add TRUSTDATA_API_KEY as type Secret (your td_cf_... key from step 1); add the rest as plaintext Text vars.
Settings → Bindings → Add → KV namespace: create one (any name, e.g. webmcp_cache) and bind it as WEBMCP_CACHE. Optional: skip it to disable WebMCP manifest caching.
Settings → Domains & Routes → Add → Route: set *.yourdomain.com/* so the Worker sees traffic.

Variable	Type	Purpose
`TRUSTDATA_INGEST_URL`	var	Pre-filled to `https://t.trustdata.tech/v1/logs/cloudflare_worker`
`TRUSTDATA_ATTRIBUTION_ID`	var	Your attribution ID UUID from Analytics → Properties → your property → Attribution IDs
`TRUSTDATA_BOTLIST_URL`	var	Pre-filled to `https://t.trustdata.tech/v1/config/ai-bots`. The Worker refreshes its edge bot list from here every ~6 hours, so new crawlers are covered without a redeploy
`TRUSTDATA_SAMPLE_RATE`	var	Share of non-AI traffic forwarded as an anonymized sample. Default `0.02` (2%). Set to `0` to forward AI traffic only
`TRUSTDATA_MANIFEST_URL`	var	Pre-filled to `https://app.trustdata.tech/api/v1/webmcp`. Powers WebMCP hosting (see below); leave empty to disable
`TRUSTDATA_API_KEY`	secret	The `td_cf_...` key from step 1

Endpoint

POST https://t.trustdata.tech/v1/logs/cloudflare_worker

Header	Value
`X-API-Key`	`td_cf_...`
`Content-Type`	`application/json`

Body is a JSON array of log objects (batched: today one element per request, growing into larger batches without a wire-format change):

[
  {
    "timestamp": 1740000000000,
    "attribution_id": "prop-uuid",
    "host": "example.com",
    "method": "GET",
    "pathname": "/blog/post",
    "query_params": { "utm_source": "chatgpt" },
    "ip": null,
    "user_agent": "Mozilla/5.0 (compatible; GPTBot/1.0; …)",
    "referer": "",
    "status": 200,
    "bytes": 4821,
    "country": "US",
    "asn": 13335,
    "verified": true,
    "verified_by": "signature"
  }
]

For a matched AI bot the Worker omits the raw ip (sends null) and adds the edge anti-spoof verdict in verified / verified_by: "signature" for a Web Bot Auth request signature, or "edge_cidr" for a published IP-range match. Anonymized traffic samples instead carry a sample_rate field and a truncated IP; weight them by 1 / sample_rate to estimate total traffic.

Cloudflare Enterprise can push edge logs straight to TrustData with no Worker. Create a Logpush job to an HTTP destination pointing at the endpoint below, using Cloudflare's native field names. Same td_cf_... key, different payload format.

Endpoint

POST https://t.trustdata.tech/v1/logs/cloudflare_logpush

Authentication accepts either an X-API-Key header or, for Cloudflare's HTTP destination which cannot set custom headers, the header-injection query parameter:

?header_X-API-Key=td_cf_...&attribution_id=<prop-uuid>

Body is NDJSON (one log line per newline), using Cloudflare's native field names: EdgeStartTimestamp, ClientRequestHost, ClientRequestMethod, ClientRequestURI, ClientRequestUserAgent, ClientRequestReferer, ClientIP, ClientCountry, EdgeResponseStatus, EdgeResponseBytes. If your zone has Bot Management, also send VerifiedBotCategory: when it carries an AI_* value Cloudflare has already IP-validated the bot, so TrustData trusts it directly and skips its own check.

Optional: filter at the edge to cut volume

A plain "HTTP requests" job forwards every line and lets TrustData classify it. To reduce egress, add a Logpush filter so only AI traffic leaves your zone:

VerifiedBotCategory in [AI_CRAWLER, AI_ASSISTANT, AI_SEARCH]
  or ClientRequestReferer contains "chatgpt"
  or ClientRequestReferer contains "perplexity"
  or ClientRequestReferer contains "claude.ai"
  or ClientRequestReferer contains "gemini"
  or ClientRequestReferer contains "copilot"

This filter is static. A new AI crawler that isn't already in VerifiedBotCategory or this referrer list won't be forwarded, so TrustData can't classify it retroactively. The Worker has no such limit, since it refreshes its bot list from TrustData every few hours. Leave the filter off for full retroactive coverage if you don't mind the extra log volume. Logpush also forwards no traffic sample, so AI-share denominators are Worker-only.

3. Verify

From any terminal, simulate a GPTBot visit:

curl -X POST \
  https://your-domain.com/ \
  -H "User-Agent: Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"

Within ~60 seconds the event should appear in the Live Events view for your attribution ID, tagged with an AI bot badge. Navigate via Analytics → Properties → your property → Attribution IDs → your attribution ID → Live Events. If not, see Troubleshooting.

What gets captured

Badge	Trigger	Example
AI bot	User agent matches a known AI crawler	`GPTBot`, `PerplexityBot`, `ClaudeBot`, `Bytespider`, `Amazonbot`, `meta-externalagent`
AI referral	Referrer is an AI engine, visitor is human	`perplexity.ai`, `chatgpt.com`, `claude.ai`
Traffic sample	Anonymized ~2% of everything else (Worker only)	the weighted denominator for AI-share metrics
(dropped)	The remaining non-AI traffic	never leaves your zone

The full bot and referrer lists are maintained on TrustData's side and synced into the Worker every few hours, so a new AI crawler is covered automatically with no redeploy. Over Logpush, coverage is retroactive only for the traffic your filter forwards (see the Logpush tab).

Viewing crawler activity

Crawler events surface in the Live Events view for your attribution ID, one row per hit:

Rows with an AI bot badge, where the user agent matched a known AI crawler
Rows with an AI referral badge, where a human visitor arrived from an AI engine
Any other event with a small bot tag, which is traffic the server flagged as non-human from the browser SDK

Live Events shows the last 30 minutes, refreshing every few seconds. Aggregated per-bot counts by day and page surface in your dashboard and refresh daily.

Troubleshooting

`401 Unauthorized`

Check the header is exactly X-API-Key: td_cf_... (no Bearer prefix)
Verify the key is still Active under Organization → Integrations → Log ingest keys
Confirm no extra whitespace was added to the key when you entered it in the guided deploy form

No events appear in the dashboard

Check the Worker is deployed and receiving traffic via its real-time logs: Workers & Pages → trustdata-ai-bot-collector → Logs → Begin log stream in the Cloudflare dashboard
Confirm the UA matches a known bot by trying a curl with User-Agent: GPTBot/1.0
Verify TRUSTDATA_ATTRIBUTION_ID is a real attribution ID UUID from Analytics → Properties → your property → Attribution IDs
Check that TRUSTDATA_INGEST_URL points to https://t.trustdata.tech/v1/logs/cloudflare_worker (no trailing slash)

Events arrive in Live Events but aggregate dashboards lag

Live Events is real-time. The aggregated per-day / per-page bot dashboards refresh once a day (around 3:00 AM UTC), so those breakdowns lag by one run. If it's been longer than 24 hours, contact support.

Worker causes latency on the customer response

It shouldn't, because the Worker uses ctx.waitUntil() to fire-and-forget the TrustData POST after returning the origin response. If you see added latency, check the Worker's real-time Logs in the Cloudflare dashboard for an errored forward. Transient ingest failures are swallowed, but a bad TRUSTDATA_INGEST_URL can cause DNS resolution delays.

WebMCP hosting (automatic)

Once deployed, the same Worker also hosts your WebMCP manifest at /.well-known/webmcp.json. AI agents read this file to discover which tools your site exposes (search, add-to-cart, booking, contact…) before loading any page.

You don't need a separate build step. TrustData serves the signed manifest and the Worker caches it on the edge.

How the manifest is served

Agent requests https://<your-zone>/.well-known/webmcp.json
Worker checks Cloudflare KV for a cached copy (webmcp:v1:<attribution_id>, 1-hour TTL)
On cache miss, Worker fetches from TRUSTDATA_MANIFEST_URL + your attribution ID
Response is returned with Content-Type: application/json and Cache-Control: public, max-age=3600
Agent verifies the Ed25519 signature inside the JSON body and calls the declared tools

Enable it

WebMCP hosting is on by default in wrangler.jsonc:

"vars": {
  "TRUSTDATA_MANIFEST_URL": "https://app.trustdata.tech/api/v1/webmcp",
  "TRUSTDATA_ATTRIBUTION_ID": "<your property UUID>"
},
"kv_namespaces": [
  { "binding": "WEBMCP_CACHE", "id": "<your KV namespace id>" }
]

The Deploy to Cloudflare button provisions this KV namespace automatically and writes its ID into wrangler.jsonc, so there's nothing to create by hand. To add or change the binding later, use Workers & Pages → trustdata-ai-bot-collector → Settings → Bindings in the Cloudflare dashboard.

Declare the tools

In TrustData, go to Settings → Attribution Properties → your property → WebMCP and add 2–3 tools:

search_products, input schema: { query: string }
add_to_cart, input schema: { sku: string, quantity: number }
contact_sales, input schema: { email: string, message: string }

Save. On the next Worker cache miss (at most 1 hour later), agents will see your updated manifest. Click Rotate keys to invalidate every cached signature immediately, which is useful if you suspect a leaked key.

Disable WebMCP hosting

Remove the kv_namespaces block and set TRUSTDATA_MANIFEST_URL to an empty string. The Worker falls through to your origin for /.well-known/webmcp.json, so any existing static file you serve there keeps working.

Integrations overview on trustdata.tech · Free trial

Overview

Connect your ad platforms into one unified data layer, with no more conflicting reports.

Google Ads

Connect Google Ads and configure tracking parameters.