Some worry that the development of advanced artificial intelligence will result in existential catastrophe … My current view is that there is a disturbingly substantive chance that a scenario along these lines occurs, and that many people alive today – including myself – live to see humanity permanently disempowered by AI systems we’ve lost control over.
Joe Carlsmith, “Is power-seeking AI an existential risk?“
1. Recap
This is Part 6 of the series Exaggerating the risks. In this series, I look at some places where leading estimates of existential risk look to have been exaggerated.
Part 1 introduced the series. Parts 2-5 looked at climate risk. Part 2 introduced Toby Ord’s claim that humanity faces a 1/1,000 risk of irreversible existential catastrophe due to climate change by 2100. Parts 3, 4, and 5 of the series drew on work by John Halstead to take a look at all plausible mechanisms by which climate change could pose a serious near-term existential risk, and found them unconvincing.
Part 5 also used discussions of climate risk to illustrate what I have called a regression to the inscrutable in leading discussions of existential risk:
We saw throughout this series what I have called a regression to the inscrutable. Effective altruists begin by chronicling risks that are relatively tractable using ordinary scientific methods, such as crop failure, heat stress, or flooding. Finding these risks to be slight, effective altruists increasingly place most of their confidence in esoteric risks such as tipping cascades or runaway greenhouse effects. These risks are distinguished by the fact that they are increasingly inaccessible to our best scientific methods.
This passage suggests that we see a regression to the inscrutable within risk areas: for example, most climate risk is claimed to come from some of the least scrutable sources. However, I also argued that we see a regression to the inscrutable across risk areas: effective altruists take most existential risk to come from threats far less scrutable than climate risk:
Increasingly, effective altruists invest most of their confidence in relatively inscrutable risks such as bioterrorism and risks from misaligned artificial intelligence. Indeed, Ord puts the combined existential risk from these two sources at over 13% by 2100. As the focus of this series shifts to more inscrutable risks, readers would do well to bear in mind the failure of more scrutable risks, such as climate risk, to stand up to scientific scrutiny, as well as the tendency within more scrutable areas, such as climate risk, to concentrate risk estimates within the least scrutable sources of risk.
As promised, the focus of this series will shift to more speculative risks, beginning with existential risks posed by artificial intelligence. However, I hope that lessons from the more scrutable case of climate risk will not be lost on readers of this series. Part 5 suggested three lessons: in addition to a warning about regressions to the inscrutable, we also saw:
- Risk estimates can be inflated by orders of magnitude, even when offered by leading authors in texts often treated as authoritative. For example, Ord puts climate risk at 1/1,000 by 2100, whereas Halstead struggles to get climate risk above 1/100,000 not just by 2100, but over all time.
- The evidential basis for existential risk estimates is often very slim, in many cases resting on a few thin sentences in a long discussion. For example, Ord’s positive case for climate risk rests entirely on a few sentences about lessons from the paleoclimate, which we saw in Part 4 to be not only thin, but also misleading.
Bearing these lessons in mind, let us turn our attention to AI safety. I will begin my discussion by looking at a report by Joe Carlsmith on the dangers of power-seeking artificial intelligence. But first, I want to situate this discussion in a broader perspective that is often lost in discussions by effective altruists.
2. AI safety
With each passing year, effective altruists invest ever more confidence in the claim that artificial intelligence poses a significant existential risk to humanity. Will MacAskill estimates existential risk from artificial intelligence in this century at 5%. Joe Carlsmith estimates risk by 2070 at 10% or higher. Eliezer Yudkowsky suggests, quite possibly in earnest, that we are doomed:
It’s obvious at this point that humanity isn’t going to solve the alignment problem, or even try very hard, or even go out with much of a fight. Since survival is unattainable, we should shift the focus of our efforts to helping humanity die with slightly more dignity.
Eliezer Yudkowsky, “MIRI announces new `Death with dignity’ strategy“
Because effective altruists have become more concerned about AI risk, they have begun to devote an increasing fraction of resources to combatting AI risk. New organizations such as the Center for AI Safety have been founded and have begun pouring unprecedented amounts of money into the field. (An example: $100,000 of prizes for conference papers presented at a single symposium).
I am, for my own part, deeply concerned about what future developments in artificial intelligence may bring. I worry that AI may decimate the labor market, with harms falling heaviest among the most vulnerable workers. I worry that decisionmaking will become increasingly less transparent and explainable. But I am not very worried that AI may one day murder us all.
3. The Carlsmith report
One challenge in addressing the issue of AI safety is that there is no single orthodox statement of the concern. However, one of the most prominent recent attempts to articulate the concern is a report by Joe Carlsmith, “Is power-seeking AI an existential risk?“.
Joe Carlsmith is a serious guy. Carlsmith is a senior research analyst at Open Philanthropy, and is finishing a PhD in philosophy at the University of Oxford. Before that, Carlsmith graduated with a perfect 4.0 from Yale University, and completed a BPhil in philosophy at Oxford.
Carlsmith argues for the conclusion that existential risk from artificial intelligence by 2070 is at least 5% (now updated to “>10%”).
I was one of a panel of reviewers commissioned to comment on Carlsmith’s report. As part of the review process, I was asked to provide my own credence that AI will lead to existential catastrophe by 2070. I returned an estimate of 0.00002%.
This view did not make me especially popular among effective altruists. For a while, the most upvoted comment on the Less Wrong thread containing the Carlsmith reviews opined:

In this spirit, I am often asked if I genuinely meant to suggest that the existential risk from artificial intelligence (AI risk) is so low. I did mean it, with a couple of caveats.
The estimate I offered was a bit generous, as one tends to be when being compensated by the foundation which commissioned a report. And since offering this estimate, I have had more extended contact with arguments for AI risk, causing me to lower my estimate further. I now think that an estimate of 0.00002% was unduly generous, (although, as I will explain shortly, I am also not convinced that we are in a position where estimating AI risk makes good methodological sense). Let me explain why I take AI risk to be lower than Carlsmith does.
Before I begin, we need to talk about methodology. This isn’t an easy conversation to have, but it must be done.
4. Some methodological observations
The study of AI safety raises uncomfortable questions about methodology that are well-known, but rarely discussed in public. I think it is important to be open and honest about these methodological issues because they explain the silence of much of the scholarly community on issues of AI safety as well as the attitude of skepticism with which this work is often viewed.
At the outset of this discussion, I want to make my allegiance plain. I am an academic. I think that scholarly methods are among the richest and most reliable methods on offer for studying the world. I am deeply suspicious of discussions that cannot be brought up to the standards of rigor, clarity and evidence expected of scholarly discussions, even when those discussions are carried out by world-class scholars. Like many academics, I tend to think that such a situation is evidence of a defect in the underlying discussion and a good reason to stay away.
There is some risk that I will be accused of snobbery in making these remarks. I understand and appreciate this risk. That will be particularly true given that a small, but growing number of serious academics have taken some first steps towards putting the field of AI safety on a secure academic footing (a good development). Nevertheless, I think that these words must be said.
One central issue concerns the degree of permissible speculation. Academics are, as a rule, quite hesitant to discuss matters which can only be approached in a speculative manner. We want clear models, good evidence, and promising theories. When we do not find these things, we think it is best to hold off on speculation, because we think that highly speculative theorizing runs a significant risk of being misleading.
For my own part, I spent more than a year refusing to engage with any work in AI safety, precisely on the grounds that I judged it to be too speculative. This is the stance of most scholars today, even scholars interested in artificial intelligence. My attitude has not changed. I engage in these discussions against my better judgment, and conscious of a significant risk that my own remarks will be speculative and misleading. I do this because I think that effective altruists have placed a dangerous degree of confidence in arguments that artificial intelligence poses a significant existential threat in this century, and because that confidence is being used to misdirect billions of dollars of philanthropic funding away from what I view as worthier causes.
A related issue concerns standards of evidence. Standards of evidence are, as a rule, hard to articulate, but in general scholars impose high standards of evidence before they are willing to invest significant confidence in a theory. At many points, the most helpful complaint that I will find myself able to make is that existing arguments for existential risks from artificial intelligence (AI risks) do not meet what many would regard as minimal standards of evidence.
A third issue concerns the notion of a scholarly literature. I am often told by AI safety researchers, even those holding academic positions, that there is a robust and flourishing scholarly literature on AI safety. When I ask them to show me this literature, I am shown a few published papers intermixed with a large volume of blog posts, forum discussions, discord chats, and other nontraditional media. I was once asked, quite possibly in earnest, to read and cite a 700-page transcript of blog debates bearing the illustrious title `AI go FOOM‘.
To my mind, a scholarly literature consists primarily of research articles and monographs published by reputable journals and publishers after a process of pre-publication peer review. The authors of scholarly articles typically hold PhDs in their fields, as do their reviewers. I am hesitant to recognize the current assemblage of blog posts, forum discussions and the like, often written by nonterminal degree holders, as constituting a scholarly literature.
The notion of a scholarly literature will come to the foreground at one key point in our discussion (Section 2.2 of the Carlsmith report), where confident predictions of strong progress in artificial intelligence are made almost entirely without argument. Instead, we are asked to defer to the literature. What literature? Here we are offered citations to three unpublished reports, all by the same foundation that commissioned the Carlsmith report (Cotra 2020, Davidson 2021, Roodman 2020), and a selection of mostly unpublished surveys carried out almost entirely by effective altruist organizations.
This is not a scholarly literature, and it is not a literature worthy of our trust. The authors of all three of the cited reports lack terminal degrees in their fields; have no academic affiliation; and have little record of serious scholarship. These reports and surveys have not been through rigorous processes of peer review (a point I return to below), and have largely been self-published by a few philanthropic foundations which are known for pushing very strong views about existential risk from artificial agents. I do not, and will not, defer to such an assemblage of writings, nor am I terribly excited to read them.
(Edit: A reader points out that some authors have some of the marks of authority emphasized above. For example, Roodman has several well-respected and well-cited publications under his belt, and Carlsmith is finishing a PhD at Oxford. It is important to bear in mind that scholarly authority is not an all-or-nothing question, and to be clear about what marks of authority an author or foundation does or does not have).
A fourth issue concerns the nature of peer review. Scholarly articles are typically evaluated through processes of pre-publication peer review, in which leading and independent specialists commission reviews from experts in the field. Reviewers have the power to decide whether and in what form the article is published. Reviews are financially uncompensated, or in special cases poorly compensated, and typically private.
Carlsmith’s report did not go through any comparable process, and would likely not have made it through peer review at a leading journal. Carlsmith’s report was subject only to post-publication review in which staff members of the Open Philanthropy Foundation (which commissioned the report) commissioned reviews from a range of authors. Reviewers had no power to alter or reject the report. Reviews were generously financially compensated, and were posted in public. As we saw, public reviews can have real consequences for reviewers (a public guffaw met my report), so we are typically more reluctant to speak our minds, especially when we are being paid by the foundation which commissioned the report.
Many scholars view articles which have not passed processes of peer review with deep suspicion. A report such as the Carlsmith report would be viewed in much the same light as a post on a good blog – that is to say, a far cry from a contribution to the scholarly literature. (Although I hope that readers will enjoy and learn from this blog, I would never dream of demanding that it be cited in scholarly discussions. If you want to learn about the ethics of artificial intelligence, you can read any of the many excellent papers on this subject written by world-leading scholars).
A final issue concerns explicitness of reasoning. Reviewers are notorious for demanding exact and precise statements of the reasons underlying all claims, even comparatively minor claims. A danger of circumventing ordinary processes of peer review is that reports can be published even when a substantial part of the reasoning in support of key premises is not specified.
We have already seen one instance of this in our discussion of Ord’s views on climate risk, beginning in Part 2 of the series Exaggerating the risks. We saw that Ord asserts, almost entirely without argument, a 1/1,000 risk of irreversible existential catastrophe from climate change by 2100. And we saw the dangers of this strategy: it took us a great deal of time to chase down every possible argument that Ord could have had in mind, since Ord did not tell us. We found that none of these arguments came anywhere close to grounding the conclusions that Ord gave.
At key points in my response to Carlsmith, the complaint will be much the same complaint I have made against Ord. Central premises of the argument are defended by little, if any explicit reasoning. That is, to my mind, good cause for skepticism.
5. Carlsmith’s argument
Carlsmith argues that humanity faces at least a 5% probability of existential catastrophe from power-seeking artificial intelligence by 2070 (updated to 10% in March 2022). Here is how Carlsmith outlines the argument, with probabilities reflecting Carlsmith’s weaker, pre-2022 view (“|” represents conditionalization; descriptive titles for premises are added but otherwise the following is a direct quotation from Carlsmith).
By 2070:
1. (Possibility) 65% It will become possible and financially feasible to build AI systems with the following properties:
- Advanced capability: They outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering and persuasion/manipulation).
- Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world.
- Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.
(Call these “APS” – Advanced, Planning, Strategically aware – systems).
2. (Incentives) 80% There will be strong incentives to build and deploy APS systems | (1).
3. (Alignment difficulty) 40% It will be much harder to build APS systems that would not seek to gain and maintain power in unintended ways (because of problems with their objectives) on any of the inputs they’d encounter if deployed, than to build APS systems that would do this, but which are at least superficially attractive to deploy anyway | (1)-(2).
4. (Damage) 65% Some deployed APS systems will be exposed to inputs where they seek power in unintended and high-impact ways (say, collectively causing >$1 trillion dollars worth of damage) because of problems with their objectives. | (1)-(3).
5. (Disempowerment) 40% Some of this power-seeking will scale (in aggregate) to the point of permanently disempowering ~all of humanity | (1)-(4).
6. (Catastrophe) 95% This disempowerment will constitute an existential catastrophe | (1)-(5).
Aggregate probability: 65% * 80% * 40% * 65% * 40% * 95% ≈ 5%.
6. Looking ahead
In the next few iterations of this series, I will take a look at Carlsmith’s argument for these six claims. I will focus on questions such as the following:
- What argument is given for (Alignment Difficulty)? How much evidence do we really have that artificial intelligence is likely to be substantially misaligned?
- How bad is the misalignment expected to be? Everyone should grant that AI systems will sometimes fail to do what we want them to. They already have. But why think they will take over the world and disempower us all?
- Why are we so confident in (Possibility)? Optimists have a long history of promising that artificial general intelligence is just around the corner. After so many failed predictions, why should we take them at their word today?
Do let me know in the comments if there are other questions regarding the Carlsmith report that you would like to hear discussed. I am also interested in hearing which other arguments for existential risk from artificial intelligence, beyond the Carlsmith report, readers would like to hear discussed.
Leave a Reply