AI crawlers like GPTBot, ClaudeBot and PerplexityBot don't run JavaScript. They hit /robots.txt, sitemaps and raw HTML endpoints, so the TrustData JS SDK never sees them. A Cloudflare Worker deployed on your zone wraps 100% of requests and forwards a small log line to TrustData, which classifies each hit into one of:
Classification runs server-side, so when a new AI bot appears you get coverage retroactively — no redeploy of the Worker.
┌──────────────────────┐
│ Your Cloudflare zone │
│ (100% of traffic)
└──────────┬───────────┘
│ every request wrapped by the Worker
│
▼
┌──────────────────────┐ forwards a small log line
│ TrustData Worker │ ─────────────────────────────► TrustData
│ ai-bot-collector │
└──────────────────────┘
│
▼
┌──────────────────────────┐
│ TrustData classifies: │
│ AI bot? AI referral? │
│ Everything else dropped │
└────────────┬─────────────┘
│
▼
┌──────────────────────────┐
│ Live Events view │
│ (AI bot / AI referral │
│ badges) + daily roll- │
│ ups in your dashboard │
└──────────────────────────┘
Keys use the format td_cf_<random>. Only the prefix and a SHA-256 hash are stored in TrustData; the full key cannot be recovered. If lost, revoke it and generate a new one.
The TrustData AI-bot collector is an open-source Cloudflare Worker you deploy on your own zone. It runs as middleware on every request, clones the response to read size/status, and fires a JSON payload at TrustData. It never alters the response.
# Clone the collector
git clone https://github.com/trstdata/trustdata-integrations
cd trustdata-integrations/cloudflare/ai-bot-collector
npm install
# 1. Edit wrangler.jsonc — replace example.com with your domain in `route`
# 2. Set your attribution ID (Analytics → Properties → your property →
# Attribution IDs) in the TRUSTDATA_ATTRIBUTION_ID var
# 3. Add your API key as a Cloudflare secret
npx wrangler secret put TRUSTDATA_API_KEY
# 4. Deploy
npx wrangler deploy
| Variable | Type | Purpose |
|---|---|---|
TRUSTDATA_INGEST_URL | var | Pre-filled to https://t.trustdata.tech/v1/logs/cloudflare_worker |
TRUSTDATA_ATTRIBUTION_ID | var | Your attribution ID UUID from Analytics → Properties → your property → Attribution IDs |
TRUSTDATA_API_KEY | secret | The td_cf_... key from step 1 |
From any terminal, simulate a GPTBot visit:
curl -X POST \
https://your-domain.com/ \
-H "User-Agent: Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"
Within ~60 seconds the event should appear in the Live Events view for your attribution ID, tagged with an AI bot badge. Navigate via Analytics → Properties → your property → Attribution IDs → your attribution ID → Live Events. If not, see Troubleshooting.
POST https://t.trustdata.tech/v1/logs/cloudflare_worker
| Header | Value |
|---|---|
X-API-Key | td_cf_... |
Content-Type | application/json |
Body is a JSON array of log objects (batched — today one element per request, grows into larger batches without a wire-format change):
[
{
"timestamp": 1740000000000,
"attribution_id": "prop-uuid",
"host": "example.com",
"method": "GET",
"pathname": "/blog/post",
"query_params": { "utm_source": "chatgpt" },
"ip": "1.2.3.4",
"user_agent": "Mozilla/5.0 (compatible; GPTBot/1.0; …)",
"referer": "",
"status": 200,
"bytes": 4821,
"country": "US",
"asn": 13335
}
]
POST https://t.trustdata.tech/v1/logs/cloudflare_logpush
Authentication accepts either an X-API-Key header or, for Cloudflare's HTTP destination which cannot set custom headers, the header-injection query parameter:
?header_X-API-Key=td_cf_...&attribution_id=<prop-uuid>
Body is NDJSON (one log line per newline), using Cloudflare's native field names: EdgeStartTimestamp, ClientRequestHost, ClientRequestURI, ClientRequestUserAgent, ClientRequestReferer, ClientIP, ClientCountry, EdgeResponseStatus, EdgeResponseBytes.
| Badge | Trigger | Example |
|---|---|---|
| AI bot | User agent matches a known AI crawler | GPTBot, PerplexityBot, ClaudeBot, Google-Extended, Bytespider, Amazonbot |
| AI referral | Referrer is an AI engine, visitor is human | perplexity.ai, chat.openai.com, claude.ai |
| (dropped) | Everything else | Organic search, direct traffic, social referrals |
The full bot and referrer lists are maintained on TrustData's side. When a new AI crawler appears, you get coverage automatically from the next hit forward — no redeploy of the Worker.
Crawler events surface in the Live Events view for your attribution ID, one row per hit:
Live Events shows the last 30 minutes, refreshing every few seconds. Aggregated per-bot counts by day and page surface in your dashboard and refresh daily.
401 UnauthorizedX-API-Key: td_cf_... (no Bearer prefix)wrangler secret putnpx wrangler tailUser-Agent: GPTBot/1.0TRUSTDATA_ATTRIBUTION_ID is a real attribution ID UUID from Analytics → Properties → your property → Attribution IDsTRUSTDATA_INGEST_URL points to https://t.trustdata.tech/v1/logs/cloudflare_worker (no trailing slash)Live Events is real-time. The aggregated per-day / per-page bot dashboards refresh once a day (around 3:00 AM UTC), so those breakdowns lag by one run. If it's been longer than 24 hours, contact support.
It shouldn't — the Worker uses ctx.waitUntil() to fire-and-forget the TrustData POST after returning the origin response. If you see added latency, check wrangler tail for an errored forward — transient ingest failures are swallowed, but a bad TRUSTDATA_INGEST_URL can cause DNS resolution delays.
Once deployed, the same Worker also hosts your WebMCP manifest at /.well-known/webmcp.json. AI agents read this file to discover which tools your site exposes (search, add-to-cart, booking, contact…) before loading any page.
You don't need a separate build step — TrustData serves the signed manifest and the Worker caches it on the edge.
https://<your-zone>/.well-known/webmcp.jsonwebmcp:v1:<attribution_id>, 1-hour TTL)TRUSTDATA_MANIFEST_URL + your attribution IDContent-Type: application/json and Cache-Control: public, max-age=3600WebMCP hosting is on by default in wrangler.jsonc:
"vars": {
"TRUSTDATA_MANIFEST_URL": "https://app.trustdata.tech/api/v1/webmcp",
"TRUSTDATA_ATTRIBUTION_ID": "<your property UUID>"
},
"kv_namespaces": [
{ "binding": "WEBMCP_CACHE", "id": "<your KV namespace id>" }
]
Create the KV namespace once (any name works — webmcp_cache is the convention):
npx wrangler kv:namespace create webmcp_cache
Paste the returned id into wrangler.jsonc and redeploy:
npx wrangler deploy
In TrustData, go to Settings → Attribution Properties → your property → WebMCP and add 2–3 tools:
search_products — input schema: { query: string }add_to_cart — input schema: { sku: string, quantity: number }contact_sales — input schema: { email: string, message: string }Save. On the next Worker cache miss (at most 1 hour later), agents will see your updated manifest. Click Rotate keys to invalidate every cached signature immediately — useful if you suspect a leaked key.
Remove the kv_namespaces block and set TRUSTDATA_MANIFEST_URL to an empty string. The Worker falls through to your origin for /.well-known/webmcp.json, so any existing static file you serve there keeps working.
Integrations overview on trustdata.tech · Free trial