Cloudflare Connect on Tour London

Cloudflare Connect on Tour London - Notes

On Wednesday the 15th I attended Cloudflare connect London and it was really good. I learnt a lot about their capabilities, in quite a short morning and early afternoon had a lot of high quality conversations. I wanted to write up my thoughts while they are fresh. (Through much of this post I am going to refer to Cloudflare as CF).

Cloudflare is an interesting company. By taking on raw infrastructure they have landed in a pivotal place for value delivery. Literally sitting between the customer and your company, on an increasingly hostile internet. The opening keynote spent some time acknowledging the downtime they had last year and I appreciated that. (The BMJ offices were unavailable for a few hours because you needed an app to unlock the front door and the app couldn’t be served during the CF downtime, which just shows how integrated they have become to an increasing number of digital experiences).

There was a balance between roo-haa and deeply impressive engineering. These folks are engineering at scale. As one of their principal engineers said in one session, they have ended up building things to manage core needs at web scale, and often these things have ended up being wildly useful for other use cases. I really appreciate this.

Sessions were effectively focussed on either AI or bots, or both, and how CF is providing infrastructure at scale to help with emerging challenges. Speaking of LLMs, they can present security risks. Cloudflare One is CF’s layer for security. It is an SSO layer where your staff (devs and regular staff) can access models and things like MCP endpoints, and it routes traffic through this, and can be configured with specific security rules. I was impressed. I briefly talked to the product manager for this product. One challenge for this kind of setup is who manages it? At the moment she is seeing a variety of different kinds of teams in different companies managing this.

Speaking of bots, they are seeing more than 50% of traffic on the web is bot traffic (7% of that is agentic bot traffic). The scale of agents on the web, and bots on the web, is a big thing. (More in this post later). CF are doing a lot at trying to understand the intent of agents, and making that insight easier for customers to understand and action. There are a lot of challenges with agents, many of them can present significant security and commercial risk. They talked about the threat engine, and at the moment one receives a threat score of between 1 and 100, but there is not much transparency into what makes up the scoring. They are going to bring a bit more transparency here. Bot protection is a hard problem, the more sophisticated bots will emerge from clean IP ranges, will try credential spraying, and a host of other activities. Some bots are not necessarily bad, but even good bots might have characteristics that you as a publisher don’t want to put up with. Cloudflare is moving to give more transparency on things like the crawl to referrer ratio. It used to be that google would send one visitor for every two crawls of your site. That’s gone down a good bit now, but anthropic has a crawl to referrer ratio of 7k to1.

For bots that you like they have a feature to enable them to serve markdown versions of content to bots. https://developers.cloudflare.com/fundamentals/reference/markdown-for-agents/ - the idea is to make context more efficient for agents consuming your site. I think Scholarly publishing is ahead of most of the rest of the industry when it comes to structured metadata and I had a quick chat with the person who presented this on stage, and he broadly agreed, and mentioned that this is a very early version of an idea.

For bad bots CF are working on improving their bot defences. E.g. giving bots a random set of unpredictable responses, so that the bot is unable to learn how to predict what to do on your site. Their new tooling will allow publishers to set specific rules on bots based on a collection of information, and also allow you to see what bot or agents have been blocked by others.

Three example defences they mentioned were:

serving dumb versions of your site (not their wording) where CF will serve a generated version of your pages content that give the bot the right context, but that is not valuable for AI training.
crawl labyrinths that cost the crawler/bot resources
poison content, where CF serves a version of your page with deliberately incorrect information, that will poison the trained LLM

Overall their work on improving bot protection is serious and a welcome tool for publishers, I am very looking forward to getting our hands on these tools.

I went to two of the developer tracks, one on “Project Think” and one on creating a data lake in R2.

In a nutshell the project think talk is about creating the operating system for agents, and allowing different layers of composability at scale. It’s built on top of durable objects, web sockets, and on these primitives they have created a rich SDK. Some new things include voice, but overall the key thing is low latency and hot restarts. https://blog.cloudflare.com/project-think/. One thing that is fun is this infra allows the agents to self-create extensions that while sort of looking ephemeral, can persist. That sounds like a contraction, but I think that was the message. You can kind of think of it as serverless but with state. A phrase that stuck with me is that this gives you an execution ladder for the LLM, with increasing “agency” as you walk up the ladder. I am not doing this talk justice at all, but I am looking forward to getting some of the Cloudflare team in to meet with my team for a hackathon where we can learn more about this hands on. One demo that Sunil Pai showed is how he created a CF MCP with two tools that allows the LLM to create code to send in to the MCP where the code acts as a query, making the interaction with the MCP far more efficient, and giving a much reduced surface area for the developer to explore the thousands of CF API endpoints.

The second dev session was about creating a data lake on top of R2. (R2 is so named because it is a “cheaper” version of S3. R <– S, 2 <– 3!)

The key thing is that R2 has no data egress fees whereas S3 and google buckets do.

By building an SQL engine on top of data stored in (there is a small blog post here - https://www.cloudflare.com/en-gb/learning/cloud/what-is-a-data-lake/). I got the impression that this is a very good first start, but it’s not GA yet, and it doesn’t have some of the kinds of formal data governance infrastructure that we need - data cleaning, more complexity on managing ETL, we are looking at moving to dbt, I wasn’t sure if it supported that, a data catalog infra that is enterprise ready. But on the other hand, they have built a highly performant and cost scaling solution, so it might be one to consider a migration to at some point in the future.

Key takeaways

So it was a great day, and I learnt a lot. I was really impressed. Here are some bullet points to think about.

BMJ is one of the top 10 most scraped sites across their network, according to them - but we have not seen the data on that yet.
CloudFlare network scale is currently 500TB per second. (In comparison Google manages 2B transactions per second).
They are seeing more than 50% of traffic on the web is bot traffic (7% of that is agentic bot traffic).
A lot of the architecture is built around durable objects “https://developers.cloudflare.com/durable-objects/”. I had not known about these before. It’s a worker that has a UID and storage. Like a Lambda + Dynamo/S3 combo.
You can run Claude code in a sandbox - https://developers.cloudflare.com/sandbox/tutorials/claude-code/. Which sounds amazing, but it’s actually just npm install in a worker!
Their Agents ADK is getting a lot of upgrades.
They have a tool called “Cloudflare Compare” that is meant to show bot providers how much better their bots would be with access to publisher information.
Other things worth thinking about – serving markdown, routing MCP through a secure CloudFlare wall, service bots inside your estate to do crufty jobs, the agentic platform, how to save state and save money, egress fees, bots damn bots.