Putting Chatbots to the Test

Would you trust an AI chatbot with family planning? Investing $1 million? How about writing your wedding vows? the Wall Street Journal asks.

I'd like to think that I started this trend with Project Maestro, but LLM technology has reached a point where folks are looking past the hype and benchmarking output in the real world.

Stanford has benchmarked for a while, but it's very technical and honestly not that accessible to the lay person. I recently found this Wall Street Journal article that put the top 5 LLMs to task against more real-world scenarios.

I love the format, and I appreciate the WSJ with its breadth of audience dedicating some time to show the practicality (and impracticality) of the current generation of LLMs.

Without a doubt enthusiasts will declare "imagine what it can do next month/year/decade." It's a familiar refrain, the future value of LLM. It's fun to dream, but technologists need to understand current capabilities of the tech to make smart and lasting decisions.