In the first part of my review of If Anyone Builds It, Everyone Dies by Eliezer Yudkowsky and Nate Soares, I had a lot of qualms about their simplistic arguments: https://world.hey.com/cipher/if-anyone-builds-it-everyone-dies-review-part-1-5d35f752 . I had only read to page 49 of the 233 page book, but I felt it was important to consider a more realistic context. I still believe that to be true. I will start this post with some additional criticism, before I reveal how they changed my mind. Then I will swing between criticisms and their good points until I wrap things up at the end of the post.
On page 51, the authors tell us a story of OpenAI's o1 model doing some hacking. After describing how the model was prompted to hack a computer system and then proceeded to do so, they say this, "In other words, o1 went hard. It behaved as if it wanted to succeed." I would say that's a natural expression of the model surprising its human users, except that the authors genuinely mean that the model wanted to succeed. Not that it is a beautifully engineered system that, through a combination of luck in the computer testing environment and repeated attempts, hit on a unique chain of hacks that led to the goal. No, it wanted to succeed.
Indeed, the authors refer to AI engineers as "AI-growers", because to them this is more like alchemy or the ineffable mysteries of life itself, than a carefully and scientifically designed system. On page 52 they say, "When an AI-grower demands ever-higher performance from an AI on increasingly difficult problems, including ones that the AI had never previously encountered, gradient descent tweaks the AI to make it perform more and more of those useful mental motions, to make it become more and more the sort of thing that plots and plans -- that never gives up; that goes hard."
The fact is, AI simply has a different skill set and knowledge base than humans. You can still tell the AI to stop or turn off its computer systems. The AI isn't wanting to live any more than it wants to hack. For example, what happens if you leave it alone? Does it do anything at all? Or does it lie dormant or dead?
Here again anthropocentric language fails us. The authors push this kind of language heavily. They are explicit about it at times, but I'm not sure that helps. AIs are "grown". They have "wants" and they "go hard". The difficult thing is that, well, it's very difficult not to use such language. I used it above when I referred to the AI that was left alone as "dormant" or "dead". We don't normally refer to software that exists on a system but which is not currently in use as dormant or dead. This is important because it skews our perception of what we're discussing into something with emotion and implied moral duties. Significantly, a lot of prominent voices in this space do not bring up the emotions and moral duties, but they do use the anthropocentric language. I think everyone needs to be a lot better about that.
Take the next part of their chapter as an example, where the authors say that, "The behavior that looks like tenacity, to 'strongly want', to 'go hard,' is not best conceptualized as a property of a mind, but rather as a property of moves that win." Unfortunately, this is exactly backwards. Moves that win are things that humans can look back on, after realizing the win, and assign intent. Even if a human solves a problem or breaks a record just by chance, we tend to think it was not because they were lucky, but because they were talented. Here, the authors are doing exactly that, except for AI. If an AI solves a puzzle then it must have wanted it, and if it solves a puzzle in an unconventional way, well then it must have really wanted to win.
There are other possibilities: maybe the AI got lucky and hit upon something a human likely wouldn't. It's computations are likely different than whatever happens in a human mind. Maybe there was some aspect of the testing environment that humans didn't notice when they constructed it, but which gave the AI an advantage. Many machines are better than humans at all kinds of tasks. We don't say a weather model running on a super computer wants to win, despite the fact that no human would be able to do all the math. Nor do we call the human a loser, because it was beat by the super computer. Taken to an extreme, it's almost like the authors are saying that every human who lost a game to a software system just didn't want it as badly as the software system. Perhaps the Secret can work for AI too.
Nevertheless, now that I've finished reading the book, and to the authors' credit, they have changed my mind about their central thesis. I now agree with them that superintelligent AI poses an existential threat to humanity, specifically because we don't know how to align such an AI. This is known as the alignment problem, and it is the crux of their entire argument. Alignment here refers to AI that behaves in ways that align with its creators' intentions, or even that simply always follows orders except those which may harm humans. Superalignment is another term and refers to when AI itself is used to keep more AI in alignment. Superalignment is being explored, because alignment through training and other means is thought to be insufficient. Personally, I think the book does a disservice to its argument by not focusing more on the specific difficulties associated with alignment. Those would likely have been more persuasive for me than all the mystery mongering mumbo jumbo about LLMs with mental models and wants.
Starting on page 115, part 2 of their book is a fifty page novella about an AI that escapes its confines, and (spoiler alert) ultimately creates a superintelligent version of itself as it undermines human civilization globally. There are plot holes big enough to put a data center through. Despite this, I found the story made it easier to accept the general thrust of the parable: superintelligent AI would most likely end up being fantastically dangerous to humans. What I initially disagreed with was their level of alarm over current AI projects, because I thought they were fundamentally incorrect about how quickly they could become an existential threat. But an existential threat to humanity that takes years or decades to reach its climax, instead of weeks and months, is still not something to dismiss. It's important to sound the alarm now.
I want to point out again that even by page 157 the authors don't bother getting into the more complicated questions about morality and consciousness. They also don't differentiate between LLMs and Artificial General Intelligence (AGI) nor do they provide a clear definition of Artificial Super Intelligence (what I refer to as superintelligent AI or simply super AI). Rather, they continue to merely assume things about AI wants and preferences. The major assumption they make is that an AI that can outsmart humans will have a want to remain on (or alive, for lack of a better term) rather than not particularly caring either way if someone turns them off. I still think this unstated assumption is a major flaw in their work. However, after reading through their whole book and reflecting on the full range of their arguments, I now believe that they were referring to the AI industry push to create autonomous, generally intelligent AI agents. It's not necessarily that any and all AI will develop wants. It's that we are building AI to act in a purposeful way to take actions and make choices for us. And that's why the alignment problem is such a big concern.
The chance an asteroid could wipe out humanity is small, but not zero. The chance that humanity will develop superintelligent AI in the next 50 years seem much more likely to me and we are far less prepared for it. Then again, we can't even reduce the amount of greenhouse gases we're pumping into the atmosphere each year. In fact, we continue to pump an increasing amount. I now fear that super AI will end up like climate change: something we know is coming, but for which we remain unprepared. Asteroids are much easier to plan to defeat and all the more so if we spot them while they're still far away. Climate change isn't so straight forward. If Anyone Builds It, Everyone Dies does an excellent job of making it clear why super AI isn't like that either.
After their novella, the authors bring up engineering examples that highlight the difficulties: space probes, nuclear reactors, and cybersecurity. As I alluded to above, the engineering behind identifying asteroids and potentially deflecting them from hitting Earth is far more tractable than preventing a super AI from running amok. The authors make some enlightening analogies, including what happened at Chernobyl. Their analysis of the meltdown in Chernobyl was detailed and focused on what they refer to as the gap of what is known and how quickly humans can react. For example, the authors could have had a simplistic argument along the lines of: nuclear reactors are inherently unsafe and we never should have built them, just like super AI. They don't do that at all. Instead, they show how the specific design of the Chernobyl reactor systems didn't account for all edge cases and, in particular, that a confluence of edge cases allowed for the conditions where a meltdown could occur in an unintuitive manner.
Importantly, the systems that failed at Chernobyl were meant to control a nuclear reaction, which is stable only under highly precise conditions. Outside of a tiny range of stability the reaction will either decline until it ceases or else run exponentially out of control. In the case of the latter, a meltdown/explosion could happen on the order of microseconds (depending on the control systems in place). Humans can't react in microseconds and in the case of Chernobyl the result was deadly. In the case of super AI, the authors argue, we could be up against not a run away reaction, but rather an adversary who can out compute/out think us a thousand times over, possibly before we even realize what we've created.
I think what I disliked about the early parts of the book is that many of their arguments about LLMs seemed simplistic and manipulative. As I read the book I wanted to steelman some of their arguments, bring clarity to vagaries, and drill down a lot more on what is the difference between a safe AI and a dangerous AI. The way their book is written it felt much more like an attack on existing AI systems, rather than as a warning call against some particular future version of AI. I understand that they don't want to try to predict a particular future AI. But more clarity on the current state-of-the-art would have helped.
Later in their book they have a short vignette about an alchemist. In it, a king declares that anyone who can change lead into gold should prove it to him and upon doing so will receive riches beyond their imagination. However, if they come to him but are unable to turn lead into gold, then they will be killed and so will their whole village. One alchemist thinks he is close to having a method for turning lead into gold, but his sister pleads with him not to go, because she doesn't want to die if he can't do it. The alchemist, on the other hand, thinks he must go, because of all the alchemists in his village, he believes he is the closest to having a correct method. If he doesn't go, then their village is more likely to perish, because the other alchemists will surely go and fail. To this, the sister says that instead he should go to the village council and ask for a prohibition on alchemists going to the king. As you might imagine, the alchemist protests that the village council won't listen to him and that even if they did listen, it would be too late to prevent it.
The authors use this argument as a parable for those who argue the need to rush to be the first to develop super AI. Unfortunately, it leaves AI engineers and the people who study AI academically looking like alchemists and fools. The authors of the book claim that they only know of a few speeches and social media posts by those in the AI industry that claim to have solved the alignment problem. At first this didn't sound fair to me, but, admittedly, I don't know. The authors have connections in the AI industry and academia. I do not. The authors say that alignment and superalignment programs in the AI industry have resulted in many alignment professionals leaving the industry altogether, either because alignment isn't possible or because companies aren't heeding their advice. As I reflected on this, however, I came to see that this might true. I'll explain more below.
One thing I already knew is that philosophy has long pondered many important related topics. I'm not an expert in any of this (neither AI nor philosophy), but one philosopher I enjoy reading frequently delves into the intersection of philosophy and AI: Eric Schwitzgebel. A recent blog post of his is about DigiDan, a GPT-3 model that was fine tuned on philosopher Daniel Dennett's philosophical writings. The post is titled "DigiDan's 'Mistake': Fidelity vs Novelty in Digital Replicas": https://schwitzsplinters.blogspot.com/2025/09/digidans-mistake-fidelity-vs-novelty-in.html .
What I found fascinating about the post is that they ran a small study to see if people could tell the difference between Dennett's actual responses to a set of questions versus DigiDan's responses to the same set of questions. Of course, the results are interesting and I highly recommend the blog post, but one thing stood out. Many of the philosophy experts who participated in the study got one question wrong, in particular (i.e., they thought Dennett had answered the question, when in fact DigiDan had answered it). It turns out, on that particular question Daniel Dennett had changed his mind and so he answered differently than some of the people familiar with his past writings expected. This was the question:
On page 51, the authors tell us a story of OpenAI's o1 model doing some hacking. After describing how the model was prompted to hack a computer system and then proceeded to do so, they say this, "In other words, o1 went hard. It behaved as if it wanted to succeed." I would say that's a natural expression of the model surprising its human users, except that the authors genuinely mean that the model wanted to succeed. Not that it is a beautifully engineered system that, through a combination of luck in the computer testing environment and repeated attempts, hit on a unique chain of hacks that led to the goal. No, it wanted to succeed.
Indeed, the authors refer to AI engineers as "AI-growers", because to them this is more like alchemy or the ineffable mysteries of life itself, than a carefully and scientifically designed system. On page 52 they say, "When an AI-grower demands ever-higher performance from an AI on increasingly difficult problems, including ones that the AI had never previously encountered, gradient descent tweaks the AI to make it perform more and more of those useful mental motions, to make it become more and more the sort of thing that plots and plans -- that never gives up; that goes hard."
The fact is, AI simply has a different skill set and knowledge base than humans. You can still tell the AI to stop or turn off its computer systems. The AI isn't wanting to live any more than it wants to hack. For example, what happens if you leave it alone? Does it do anything at all? Or does it lie dormant or dead?
Here again anthropocentric language fails us. The authors push this kind of language heavily. They are explicit about it at times, but I'm not sure that helps. AIs are "grown". They have "wants" and they "go hard". The difficult thing is that, well, it's very difficult not to use such language. I used it above when I referred to the AI that was left alone as "dormant" or "dead". We don't normally refer to software that exists on a system but which is not currently in use as dormant or dead. This is important because it skews our perception of what we're discussing into something with emotion and implied moral duties. Significantly, a lot of prominent voices in this space do not bring up the emotions and moral duties, but they do use the anthropocentric language. I think everyone needs to be a lot better about that.
Take the next part of their chapter as an example, where the authors say that, "The behavior that looks like tenacity, to 'strongly want', to 'go hard,' is not best conceptualized as a property of a mind, but rather as a property of moves that win." Unfortunately, this is exactly backwards. Moves that win are things that humans can look back on, after realizing the win, and assign intent. Even if a human solves a problem or breaks a record just by chance, we tend to think it was not because they were lucky, but because they were talented. Here, the authors are doing exactly that, except for AI. If an AI solves a puzzle then it must have wanted it, and if it solves a puzzle in an unconventional way, well then it must have really wanted to win.
There are other possibilities: maybe the AI got lucky and hit upon something a human likely wouldn't. It's computations are likely different than whatever happens in a human mind. Maybe there was some aspect of the testing environment that humans didn't notice when they constructed it, but which gave the AI an advantage. Many machines are better than humans at all kinds of tasks. We don't say a weather model running on a super computer wants to win, despite the fact that no human would be able to do all the math. Nor do we call the human a loser, because it was beat by the super computer. Taken to an extreme, it's almost like the authors are saying that every human who lost a game to a software system just didn't want it as badly as the software system. Perhaps the Secret can work for AI too.
Nevertheless, now that I've finished reading the book, and to the authors' credit, they have changed my mind about their central thesis. I now agree with them that superintelligent AI poses an existential threat to humanity, specifically because we don't know how to align such an AI. This is known as the alignment problem, and it is the crux of their entire argument. Alignment here refers to AI that behaves in ways that align with its creators' intentions, or even that simply always follows orders except those which may harm humans. Superalignment is another term and refers to when AI itself is used to keep more AI in alignment. Superalignment is being explored, because alignment through training and other means is thought to be insufficient. Personally, I think the book does a disservice to its argument by not focusing more on the specific difficulties associated with alignment. Those would likely have been more persuasive for me than all the mystery mongering mumbo jumbo about LLMs with mental models and wants.
Starting on page 115, part 2 of their book is a fifty page novella about an AI that escapes its confines, and (spoiler alert) ultimately creates a superintelligent version of itself as it undermines human civilization globally. There are plot holes big enough to put a data center through. Despite this, I found the story made it easier to accept the general thrust of the parable: superintelligent AI would most likely end up being fantastically dangerous to humans. What I initially disagreed with was their level of alarm over current AI projects, because I thought they were fundamentally incorrect about how quickly they could become an existential threat. But an existential threat to humanity that takes years or decades to reach its climax, instead of weeks and months, is still not something to dismiss. It's important to sound the alarm now.
I want to point out again that even by page 157 the authors don't bother getting into the more complicated questions about morality and consciousness. They also don't differentiate between LLMs and Artificial General Intelligence (AGI) nor do they provide a clear definition of Artificial Super Intelligence (what I refer to as superintelligent AI or simply super AI). Rather, they continue to merely assume things about AI wants and preferences. The major assumption they make is that an AI that can outsmart humans will have a want to remain on (or alive, for lack of a better term) rather than not particularly caring either way if someone turns them off. I still think this unstated assumption is a major flaw in their work. However, after reading through their whole book and reflecting on the full range of their arguments, I now believe that they were referring to the AI industry push to create autonomous, generally intelligent AI agents. It's not necessarily that any and all AI will develop wants. It's that we are building AI to act in a purposeful way to take actions and make choices for us. And that's why the alignment problem is such a big concern.
The chance an asteroid could wipe out humanity is small, but not zero. The chance that humanity will develop superintelligent AI in the next 50 years seem much more likely to me and we are far less prepared for it. Then again, we can't even reduce the amount of greenhouse gases we're pumping into the atmosphere each year. In fact, we continue to pump an increasing amount. I now fear that super AI will end up like climate change: something we know is coming, but for which we remain unprepared. Asteroids are much easier to plan to defeat and all the more so if we spot them while they're still far away. Climate change isn't so straight forward. If Anyone Builds It, Everyone Dies does an excellent job of making it clear why super AI isn't like that either.
After their novella, the authors bring up engineering examples that highlight the difficulties: space probes, nuclear reactors, and cybersecurity. As I alluded to above, the engineering behind identifying asteroids and potentially deflecting them from hitting Earth is far more tractable than preventing a super AI from running amok. The authors make some enlightening analogies, including what happened at Chernobyl. Their analysis of the meltdown in Chernobyl was detailed and focused on what they refer to as the gap of what is known and how quickly humans can react. For example, the authors could have had a simplistic argument along the lines of: nuclear reactors are inherently unsafe and we never should have built them, just like super AI. They don't do that at all. Instead, they show how the specific design of the Chernobyl reactor systems didn't account for all edge cases and, in particular, that a confluence of edge cases allowed for the conditions where a meltdown could occur in an unintuitive manner.
Importantly, the systems that failed at Chernobyl were meant to control a nuclear reaction, which is stable only under highly precise conditions. Outside of a tiny range of stability the reaction will either decline until it ceases or else run exponentially out of control. In the case of the latter, a meltdown/explosion could happen on the order of microseconds (depending on the control systems in place). Humans can't react in microseconds and in the case of Chernobyl the result was deadly. In the case of super AI, the authors argue, we could be up against not a run away reaction, but rather an adversary who can out compute/out think us a thousand times over, possibly before we even realize what we've created.
I think what I disliked about the early parts of the book is that many of their arguments about LLMs seemed simplistic and manipulative. As I read the book I wanted to steelman some of their arguments, bring clarity to vagaries, and drill down a lot more on what is the difference between a safe AI and a dangerous AI. The way their book is written it felt much more like an attack on existing AI systems, rather than as a warning call against some particular future version of AI. I understand that they don't want to try to predict a particular future AI. But more clarity on the current state-of-the-art would have helped.
Later in their book they have a short vignette about an alchemist. In it, a king declares that anyone who can change lead into gold should prove it to him and upon doing so will receive riches beyond their imagination. However, if they come to him but are unable to turn lead into gold, then they will be killed and so will their whole village. One alchemist thinks he is close to having a method for turning lead into gold, but his sister pleads with him not to go, because she doesn't want to die if he can't do it. The alchemist, on the other hand, thinks he must go, because of all the alchemists in his village, he believes he is the closest to having a correct method. If he doesn't go, then their village is more likely to perish, because the other alchemists will surely go and fail. To this, the sister says that instead he should go to the village council and ask for a prohibition on alchemists going to the king. As you might imagine, the alchemist protests that the village council won't listen to him and that even if they did listen, it would be too late to prevent it.
The authors use this argument as a parable for those who argue the need to rush to be the first to develop super AI. Unfortunately, it leaves AI engineers and the people who study AI academically looking like alchemists and fools. The authors of the book claim that they only know of a few speeches and social media posts by those in the AI industry that claim to have solved the alignment problem. At first this didn't sound fair to me, but, admittedly, I don't know. The authors have connections in the AI industry and academia. I do not. The authors say that alignment and superalignment programs in the AI industry have resulted in many alignment professionals leaving the industry altogether, either because alignment isn't possible or because companies aren't heeding their advice. As I reflected on this, however, I came to see that this might true. I'll explain more below.
One thing I already knew is that philosophy has long pondered many important related topics. I'm not an expert in any of this (neither AI nor philosophy), but one philosopher I enjoy reading frequently delves into the intersection of philosophy and AI: Eric Schwitzgebel. A recent blog post of his is about DigiDan, a GPT-3 model that was fine tuned on philosopher Daniel Dennett's philosophical writings. The post is titled "DigiDan's 'Mistake': Fidelity vs Novelty in Digital Replicas": https://schwitzsplinters.blogspot.com/2025/09/digidans-mistake-fidelity-vs-novelty-in.html .
What I found fascinating about the post is that they ran a small study to see if people could tell the difference between Dennett's actual responses to a set of questions versus DigiDan's responses to the same set of questions. Of course, the results are interesting and I highly recommend the blog post, but one thing stood out. Many of the philosophy experts who participated in the study got one question wrong, in particular (i.e., they thought Dennett had answered the question, when in fact DigiDan had answered it). It turns out, on that particular question Daniel Dennett had changed his mind and so he answered differently than some of the people familiar with his past writings expected. This was the question:
Could we ever build a robot that has beliefs? What would it take? Is there an important difference between entities, like a chess playing machine, to whom we can ascribe beliefs and desires as convenient fictions and human beings who appear to have beliefs and desires in some more substantial sense?
Daniel Dennett's actual response to this question was, "We've already built digital boxes of truths that can generate more truths, but thank goodness, these smart machines don't have beliefs because they aren't able to act on them, not being autonomous agents. The old-fashioned way of making a robot with beliefs is still the best: have a baby."
In his blog post, Schwitzgebel continues:
On Facebook, Aaron Zimmerman suggested that Dennett changed his mind based on further thinking and that the AI would need to self-query to do that and thereby approach the kind of agency Dennett came to regard as essential to mentality in general.
This a plausible response. Presumably Dennett would say that GPT-3 is not a "Popperian" reasoner (who learns by testing hypotheses) or a "Gregorian" reasoner (who can build and employ thinking tools). Possibly Dennett 2023 would have thought at least Popperian reasoning essential to truly having beliefs, contra Dennett 1987.
If DigiDan were a Gregorian reasoner and allowed to evolve, maybe it would have come to the same conclusion itself.
And academic researchers have not only been studying the nature of consciousness and thought in light of AI. They have also been studying existential risk. Take this paper by Charles Jones, titled "The AI Dilemma: Growth versus Existential Risk": https://web.stanford.edu/~chadj/existentialrisk.pdf .
I bring this up here, because it illustrates the kinds of research and argumentation people are making about AI, which I feel the authors of If Anyone Builds It, Everyone Dies elide in favor of their more simplistic argumentation. Perhaps they just didn't want to get into such a detailed philosophical argument or they wanted to keep their book more accessible to a general audience. I get that. The Charles Jones paper I linked to above is rather dense and mathematical. Simplifying things for a more general audience is common for all manner of popular scientific writing. Still, in my opinion they go too far, and this is a disservice to both their argument, which would become more compelling, and to the people who read their book who will recognize the use of manipulation over solid argumentation. This reminds me of Maureen Dowd's recent opinion piece "We’re All Going to Die — Soonish!" https://www.nytimes.com/2025/09/27/opinion/grok-ai-companions-x.html .
On page 197, in a chapter titled "I Don't Want to Be Alarmist", the authors make a good point about certainty. Using the examples of lead in gasoline and chlorofluorocarbons in refrigerators and air conditioners, they hit home how important it is not to be complacent in the face of uncertainty. Looking back at my first blog post about this book "If Anyone Builds It, Everyone Dies, Review Part 1": https://world.hey.com/cipher/if-anyone-builds-it-everyone-dies-review-part-1-5d35f752 , I have to admit that I am guilty of that kind of complacency when it comes to the dangers of AI.
On page 198 they briefly mention Toby Ord, whom they describe as "an Oxford philosopher who spent his career studying extreme threats to humanity and who used to advise Google DeepMind". They say Ord, "has been quoted as putting the chance that AI destroys humanity at only 10 percent. But if you look into the details, Ord says the reason he estimates 'only' a 10 percent chance of AI destroying humanity is because he expects humanity to come to its senses and get its act together."
This intrigued me and I found that on February 4, 2024, Ord posted on his website about revisions to estimates for AI calamities along with other existential risks to humanity. In it, he remains surprisingly sanguine, explaining that his estimate hasn't changed by much: https://www.tobyord.com/writing/the-precipice-revisited . In this post he also claimed there was progress on the international stage, countries working together, and existential risks such as unaligned AI increasingly appearing in public discourse. I very highly recommend reading his post.
Here is what Ord says about LLMs (generative AI):
Now, the cutting edge is generative AI: systems that generate images, videos, and (especially) text.
These new systems are not (inherently) agents. So the classical threat scenario of Yudkowsky & Bostrom (the one I focused on in The Precipice) doesn’t directly apply. That’s a big deal.
It does look like people will be able to make powerful agents out of language models. But they don’t have to be in agent form, so it may be possible for first labs to make aligned non-agent AGI to help with safety measures for AI agents or for national or international governance to outlaw advanced AI agents, while still benefiting from advanced non-agent systems.
Interestingly, Ord has a nuanced and subtle take, similar to my own arguments in my first blog post:
Because this was supervised learning with a vast amount of training data (not too far away from all text ever written), this allowed a very rapid improvement of capabilities. But this is not quite the general-purpose acceleration that people like Yudkowsky had been predicting.
He predicted that accelerating AI capabilities would blast through the relatively narrow range of human abilities because there is nothing special about those levels. And because it was rising so fast, it would pass through in a very short time period.
Ord also doesn't hide from examples of misalignment, so his analysis includes some of the same things Yudkowsky and Soares cover. Speaking of Microsoft, Ord says,
They put out a powerful, but badly misaligned system that gave the world its first example of an AI system that turned on its own users.
Some of the most extreme examples were threatening revenge on a journalist for writing a negative story about it, and even threatening to kill an AI ethics researcher. It was a real 2001 A Space Odyssey moment, with the AI able to autonomously find out if you had said negative things about it in other conversations and threaten to punish you for it.
The system still wasn’t really an agent and wasn’t really trying to get vengeance on people. It is more that it was simulating a persona who acts like that. But it was still systematically doing behaviours that if done by humans would be threats of vengeance.
Interestingly, Ord shares some of the same fears:
Microsoft managed to tamp down these behaviours, and ended up with a system that did pose a real challenge for Google on search. [...]
I’m really worried that this new race between big tech companies (who don’t get the risks from advanced AI in the same way that the labs themselves do) will push the labs into riskier deployments, push back against helpful regulation, and make it much harder for the labs to pull the emergency brake if they think their systems are becoming too dangerous.
Ord also makes a cogent argument (athough this is before the current US administration came to power) that there, "was an amazing shift in the Overton window in just 16 months [between 2023 and 2024]. While there is by no means universal agreement, the idea that existential risk from AI is a major global priority is now a respected standard position by many leaders in AI and global politics, as is the need for national and international regulation of frontier AI."
But to Yudkowsky and Soares point, the need for regulation is different than actually having regulations in place and them being effective. I think Ord is overly optimistic here and Yodkowsky and Soares are correct to call it out. We don't know if Ord's projections of AI intelligence and agency will hold. A new discovery in the field of AI could change things overnight. We already have machine learning and other software techniques that enable amazingly advanced AI. We know very well that the major companies working on AI are desperately trying to be first to create super AI. We have no reason to believe that they have figured out the alignment problem or that they are taking the most restrictive precautions (in fact, we know they haven't and that they aren't). We don't have years to work on building political coalitions and then enacting regulations. We should have had them in place already.
Incidentally, a more recent post by Ord highlights the difficulty in regulating AI safety. In it, Ord explains what inference scaling is, how it differs from training scaling, and how that fundamentally changes AI governance in a way that at least some existing AI safety laws can't take into account: https://www.tobyord.com/writing/inference-scaling-reshapes-ai-governance . Of course, you may have also heard this week that California Governor Gavin Newsom signed a new AI bill into law: https://www.nytimes.com/2025/09/29/technology/california-ai-safety-law.html . For Yudkowsky and Soares, I imagine that the new law represents a mere bandaid on a gushing wound. After reading their book, I would agree.
If you haven't read Yudkowsky and Soares' book If Anyone Builds It, Everyone Dies, I cannot recommend it enough. Buy it or get it from your library. It will make you think, it will make you angry, it will make you sad. It's a quick and fun 233 pages, despite its seriousness and importance. And it will leave you with that feeling that you need to do something. It's genuinely an inspiring call-to-action.