TrustData
Connectors

Cloudflare AI crawlers

Capture AI-bot visits and AI-engine referrals at the Cloudflare edge — traffic the browser SDK can't see.

AI crawlers like GPTBot, ClaudeBot and PerplexityBot don't run JavaScript. They hit /robots.txt, sitemaps and raw HTML endpoints, so the TrustData JS SDK never sees them. A Cloudflare Worker deployed on your zone wraps 100% of requests and forwards a small log line to TrustData, which classifies each hit into one of:

  • AI bot visit — user agent matches a known AI crawler
  • AI referral visit — referrer is an AI engine (chat.openai.com, perplexity.ai, claude.ai, …) and the visitor is human
  • Everything else is dropped at ingest — we never store general traffic logs.

Classification runs server-side, so when a new AI bot appears you get coverage retroactively — no redeploy of the Worker.

How it works

┌──────────────────────┐
│ Your Cloudflare zone │
│      (100% of traffic)
└──────────┬───────────┘
           │  every request wrapped by the Worker
           │
           ▼
┌──────────────────────┐      forwards a small log line
│  TrustData Worker    │ ─────────────────────────────►  TrustData
│  ai-bot-collector    │
└──────────────────────┘
                                             │
                                             ▼
                                ┌──────────────────────────┐
                                │  TrustData classifies:   │
                                │  AI bot? AI referral?    │
                                │  Everything else dropped │
                                └────────────┬─────────────┘
                                             │
                                             ▼
                                ┌──────────────────────────┐
                                │  Live Events view        │
                                │  (AI bot / AI referral   │
                                │   badges) + daily roll-  │
                                │   ups in your dashboard  │
                                └──────────────────────────┘

Setup

1. Issue an API key

  1. In your organization settings, open the Integrations tab → Log ingest keys
  2. Pick Cloudflare Worker (or Cloudflare Logpush for Enterprise) as the provider, name the key, and click Issue ingest key
  3. Optionally pin a default attribution ID — the Worker will tag events with this ID when the payload omits one
  4. Copy the key immediately — it is shown only once

Keys use the format td_cf_<random>. Only the prefix and a SHA-256 hash are stored in TrustData; the full key cannot be recovered. If lost, revoke it and generate a new one.

2. Deploy the Worker

The TrustData AI-bot collector is an open-source Cloudflare Worker you deploy on your own zone. It runs as middleware on every request, clones the response to read size/status, and fires a JSON payload at TrustData. It never alters the response.

# Clone the collector
git clone https://github.com/trstdata/trustdata-integrations
cd trustdata-integrations/cloudflare/ai-bot-collector
npm install

# 1. Edit wrangler.jsonc — replace example.com with your domain in `route`
# 2. Set your attribution ID (Analytics → Properties → your property →
#    Attribution IDs) in the TRUSTDATA_ATTRIBUTION_ID var
# 3. Add your API key as a Cloudflare secret
npx wrangler secret put TRUSTDATA_API_KEY

# 4. Deploy
npx wrangler deploy
VariableTypePurpose
TRUSTDATA_INGEST_URLvarPre-filled to https://t.trustdata.tech/v1/logs/cloudflare_worker
TRUSTDATA_ATTRIBUTION_IDvarYour attribution ID UUID from Analytics → Properties → your property → Attribution IDs
TRUSTDATA_API_KEYsecretThe td_cf_... key from step 1
Logpush (Cloudflare Enterprise) is supported as an alternative to the Worker — see the Logpush endpoint below. Same API key, different payload format.

3. Verify

From any terminal, simulate a GPTBot visit:

curl -X POST \
  https://your-domain.com/ \
  -H "User-Agent: Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"

Within ~60 seconds the event should appear in the Live Events view for your attribution ID, tagged with an AI bot badge. Navigate via Analytics → Properties → your property → Attribution IDs → your attribution ID → Live Events. If not, see Troubleshooting.


Endpoint reference

Worker ingest

POST https://t.trustdata.tech/v1/logs/cloudflare_worker
HeaderValue
X-API-Keytd_cf_...
Content-Typeapplication/json

Body is a JSON array of log objects (batched — today one element per request, grows into larger batches without a wire-format change):

[
  {
    "timestamp": 1740000000000,
    "attribution_id": "prop-uuid",
    "host": "example.com",
    "method": "GET",
    "pathname": "/blog/post",
    "query_params": { "utm_source": "chatgpt" },
    "ip": "1.2.3.4",
    "user_agent": "Mozilla/5.0 (compatible; GPTBot/1.0; …)",
    "referer": "",
    "status": 200,
    "bytes": 4821,
    "country": "US",
    "asn": 13335
  }
]

Logpush ingest (Enterprise)

POST https://t.trustdata.tech/v1/logs/cloudflare_logpush

Authentication accepts either an X-API-Key header or, for Cloudflare's HTTP destination which cannot set custom headers, the header-injection query parameter:

?header_X-API-Key=td_cf_...&attribution_id=<prop-uuid>

Body is NDJSON (one log line per newline), using Cloudflare's native field names: EdgeStartTimestamp, ClientRequestHost, ClientRequestURI, ClientRequestUserAgent, ClientRequestReferer, ClientIP, ClientCountry, EdgeResponseStatus, EdgeResponseBytes.


What gets captured

BadgeTriggerExample
AI botUser agent matches a known AI crawlerGPTBot, PerplexityBot, ClaudeBot, Google-Extended, Bytespider, Amazonbot
AI referralReferrer is an AI engine, visitor is humanperplexity.ai, chat.openai.com, claude.ai
(dropped)Everything elseOrganic search, direct traffic, social referrals

The full bot and referrer lists are maintained on TrustData's side. When a new AI crawler appears, you get coverage automatically from the next hit forward — no redeploy of the Worker.


Viewing crawler activity

Crawler events surface in the Live Events view for your attribution ID, one row per hit:

  • Rows with an AI bot badge — user agent matched a known AI crawler
  • Rows with an AI referral badge — human visitor arrived from an AI engine
  • Any other event with a small bot tag — traffic the server flagged as non-human from the browser SDK

Live Events shows the last 30 minutes, refreshing every few seconds. Aggregated per-bot counts by day and page surface in your dashboard and refresh daily.


Troubleshooting

401 Unauthorized

  • Check the header is exactly X-API-Key: td_cf_... (no Bearer prefix)
  • Verify the key is still Active under Organization → Integrations → Log ingest keys
  • Confirm no extra whitespace around the token when you ran wrangler secret put

No events appear in the dashboard

  1. Check the Worker is deployed and receiving traffic: npx wrangler tail
  2. Confirm the UA matches a known bot — try a curl with User-Agent: GPTBot/1.0
  3. Verify TRUSTDATA_ATTRIBUTION_ID is a real attribution ID UUID from Analytics → Properties → your property → Attribution IDs
  4. Check that TRUSTDATA_INGEST_URL points to https://t.trustdata.tech/v1/logs/cloudflare_worker (no trailing slash)

Events arrive in Live Events but aggregate dashboards lag

Live Events is real-time. The aggregated per-day / per-page bot dashboards refresh once a day (around 3:00 AM UTC), so those break­downs lag by one run. If it's been longer than 24 hours, contact support.

Worker causes latency on the customer response

It shouldn't — the Worker uses ctx.waitUntil() to fire-and-forget the TrustData POST after returning the origin response. If you see added latency, check wrangler tail for an errored forward — transient ingest failures are swallowed, but a bad TRUSTDATA_INGEST_URL can cause DNS resolution delays.


WebMCP hosting (automatic)

Once deployed, the same Worker also hosts your WebMCP manifest at /.well-known/webmcp.json. AI agents read this file to discover which tools your site exposes (search, add-to-cart, booking, contact…) before loading any page.

You don't need a separate build step — TrustData serves the signed manifest and the Worker caches it on the edge.

How the manifest is served

  1. Agent requests https://<your-zone>/.well-known/webmcp.json
  2. Worker checks Cloudflare KV for a cached copy (webmcp:v1:<attribution_id>, 1-hour TTL)
  3. On cache miss, Worker fetches from TRUSTDATA_MANIFEST_URL + your attribution ID
  4. Response is returned with Content-Type: application/json and Cache-Control: public, max-age=3600
  5. Agent verifies the Ed25519 signature inside the JSON body and calls the declared tools

Enable it

WebMCP hosting is on by default in wrangler.jsonc:

"vars": {
  "TRUSTDATA_MANIFEST_URL": "https://app.trustdata.tech/api/v1/webmcp",
  "TRUSTDATA_ATTRIBUTION_ID": "<your property UUID>"
},
"kv_namespaces": [
  { "binding": "WEBMCP_CACHE", "id": "<your KV namespace id>" }
]

Create the KV namespace once (any name works — webmcp_cache is the convention):

npx wrangler kv:namespace create webmcp_cache

Paste the returned id into wrangler.jsonc and redeploy:

npx wrangler deploy

Declare the tools

In TrustData, go to SettingsAttribution Properties → your property → WebMCP and add 2–3 tools:

  • search_products — input schema: { query: string }
  • add_to_cart — input schema: { sku: string, quantity: number }
  • contact_sales — input schema: { email: string, message: string }

Save. On the next Worker cache miss (at most 1 hour later), agents will see your updated manifest. Click Rotate keys to invalidate every cached signature immediately — useful if you suspect a leaked key.

Disable WebMCP hosting

Remove the kv_namespaces block and set TRUSTDATA_MANIFEST_URL to an empty string. The Worker falls through to your origin for /.well-known/webmcp.json, so any existing static file you serve there keeps working.


Integrations overview on trustdata.tech · Free trial