some AI reflections from week 37, 2025

Last week I had the opportunity to attend an executive round table on "securing the agent workforce", and then later in the week I had the chance to have lunch with one of the co-founders of one of the hot AI startups, he was doing a tour of London, but is based in the US. The roundtable was organised by HotTopics https://hottopics.ht/, I'd not been to an event run by them before, but it was really excellent.

Both were effectively under Chatham house rules, so I'm not going to name any names, but I wanted to note some reflections from those two events. There was a common theme from both events.

The main theme for me is about credulity. LLMs can be persuasive and determined, and without needing to get into Nick Bostrom -- AI is going to destroy humanity -- territory, this combination of persuasiveness and determination pitted against humans who are not half as conscious as I think we think we are, means that there is an open security hole right there when we start to deploy AI into our workflows.

In the roundtable we heard about a few cases of where AI agents running inside corporate systems accessed company sensitive data and shared it with employees. The key thing is that the agent will try different ways to complete its task, ways that you might not think about, and that means that the attack surface is not going to be obvious to you.

I heard another failure mode that I'd not heard about before. In this mode one company had put in place a set of evals (where you get experts to check the answers of the AI). If the expert gives the green light, then OK. Well, at least that's the theory. What they found in this failure mode was that the AI was getting the right answer (for a financial question) but the answer could not have been inferred from the data that the AI was being given access to. The AI had "hallucinated" the right answer, but it should not have been able to get to that answer in the controlled context that it was operating in. This makes it hard to create a simple eval.

During the lunch with the CEO I got a chance to try out their deep research mode. I asked their agent to do a piece of research on a topic that my wife knows a lot about. She reviewed the report, and while the references were accurate the report had generalised broad facts about the clinical context that were just not correct. Convincing, but wrong.

The risk of both modes above are present because humans are busy and credulous. In April 2023 I was at a GenAI hackathon (https://world.hey.com/ian.mulvany/london-generative-ai-hackathon-some-reflections-a7724fbe) where the team I was on explored some of the more evil aspects of GenAI -- https://llmsaregoinggreat.com/, but its now very interesting to see these become more widespread.

Last week in BMJ we turned on an internal MCP server that has access to our published content (this is safe, it's not doing any modification of the content, nor does it have access to any other data within BMJ). One of the team asked it on what days do we publish most content - a straightforward question for the DB. The LLM decided that the best way to answer this was to do a broad web search, and to try to scrape the answer off of BMJ's website. We had to kick it a bit to let it know that it could just ask the content store directly!

some AI reflections from week 37, 2025 - on human credulity.