We should focus on building institutions that both reduce existing AI risks and put us in a robust position to address new ones as we learn more about them … [But] let’s focus on the things we can study, understand and control—the design and real-world use of existing AI systems, their immediate successors, and the social and political systems of which they are part.Lazar, Howard and Narayanan, “Is avoiding extinction from AI really an urgent priority?“
This is Part 8 of the series Exaggerating the risks. In this series, I look at some places where leading estimates of existential risk look to have been exaggerated.
Part 6 introduced the Carlsmith report on power-seeking artificial intelligence, and Part 7 discussed the role of instrumental convergence in Carlsmith’s argument. While Part 7 expresses the heart of my disagreement with Carlsmith, I think it might be helpful to discuss a few other places where I wanted to hear more from Carlsmith.
At many points, the complaint will be the same complaint that I raised against Carlsmith’s discussion of instrumental convergence: there isn’t enough argument given to support the conclusion. As I said in Part 6 of this series:
At key points in my response to Carlsmith, the complaint will be much the same complaint I have made against Ord. Central premises of the argument are defended by little, if any explicit reasoning. That is, to my mind, good cause for skepticism.
In particular, I want to look at what is driving the views about AI timelines used to get Carlsmith’s concern off the ground, as well as the link between the potential for practical PS-misalignment and the damages (permanent disempowerment of humanity) that Carlsmith thinks are likely to follow from the deployment of practically PS-misaligned agents.
First, let’s briefly review Carlsmith’s argument. Readers familiar with the argument may want to skip ahead to Section 3.
2. Carlsmith’s argument
Carlsmith argues that humanity faces at least a 5% probability of existential catastrophe from power-seeking artificial intelligence by 2070 (updated to 10% in March 2022). Here is how Carlsmith outlines the argument, with probabilities reflecting Carlsmith’s weaker, pre-2022 view (“|” represents conditionalization).
1. (Possibility) 65% It will become possible and financially feasible to build AI systems with the following properties:
- Advanced capability: They outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering and persuasion/manipulation).
- Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world.
- Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.
(Call these “APS” – Advanced, Planning, Strategically aware – systems).
2. (Incentives) 80% There will be strong incentives to build and deploy APS systems | (1).
3. (Alignment difficulty) 40% It will be much harder to build APS systems that would not seek to gain and maintain power in unintended ways (because of problems with their objectives) on any of the inputs they’d encounter if deployed, than to build APS systems that would do this, but which are at least superficially attractive to deploy anyway | (1)-(2).
4. (Damage) 65% Some deployed APS systems will be exposed to inputs where they seek power in unintended and high-impact ways (say, collectively causing >$1 trillion dollars worth of damage) because of problems with their objectives. | (1)-(3).
5. (Disempowerment) 40% Some of this power-seeking will scale (in aggregate) to the point of permanently disempowering ~all of humanity | (1)-(4).
6. (Catastrophe) 95% This disempowerment will constitute an existential catastrophe | (1)-(5).
Aggregate probability: 65% * 80% * 40% * 65% * 40% * 95% ≈ 5%.
Let’s begin today’s discussion with Premise 1, (Alignment).
3. AI timelines
An old joke about artificial general intelligence (AGI) is that it has been twenty years around the corner for at least seven decades.
In 1955, a crack team of ten scientists converged on Dartmouth with the bold claim that they could take humanity a significant way towards AGI in two months:
We propose that a 2-month, 10-man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.
Ten years later, in 1965, Nobel Laureate Herbert Simon proclaimed:
Machines will be capable, within twenty years, of doing any work a man can do.
Five years later, in 1970, AI pioneer Marvin Minsky strengthened Simon’s claim:
In from three to eight years we will have a machine with the general intelligence of an average human being.
And that takes us only up to 1970. More recently, it is not just AI experts who have been wrong. Our best forecasters appear to have been caught up in the hype. A recent analysis of predictions on the forecasting platform Metaculus found that while most non-AI predictions were reasonably accurate, AI predictions had a Brier score of 0.24, almost indistinguishable from the chance-level score of 0.25.
One would think that the lesson from history would be caution on behalf of experts and forecasters. But in many circles today, we hear much the same thing as we did before. Elon Musk tells us in 2022 that:
2029 feels like a pivotal year. I’d be surprised if we don’t have AGI by then.
and Ray Kurzweil makes a similar prediction by 2029.
Looking at these predictions, one cannot help but wonder if we have failed to learn from a history of overconfidence about AI timelines. There is no denying that impressive progress has been made, and will continue to be made, in the field of artificial intelligence, but we must not be too hasty to project the most extreme scenarios. Here I am tempted to agree with Rodney Brooks, writing for the MIT Technology Review:
AI has been overestimated again and again, in the 1960s, in the 1980s, and I believe again now.
On timelines, some effective altruists are thankfully more modest than Musk and Kurzweil. For example, Carlsmith projects a 65% chance of APS systems becoming possible and financially feasible by 2070, although I suspect this chance would increase with Carlsmith’s recent confidence update.
Given the history of failed predictions on this subject, one would expect Carlsmith to present a detailed body of independent evidence in support of his predictions. What Carlsmith gives us is something less than this. Here is the entirety of Carlsmith’s discussion of the likelihood of APS systems being possible and feasible by 2070:
My own views on this topic emerge in part from a set of investigations that Open Philanthropy has been conducting (see, for example, Cotra (2020), Roodman (2020), and Davidson (2021), which I encourage interested readers to investigate. I’ll add, though, a few quick data points.
- Cotra’s model, which anchors on the human brain and extrapolates from scaling trends in contemporary machine learning, put >65% on the development of “transformative AI” systems by 2070 …
- Metaculus, a public forecasting platform, puts a median of 54% on “human-machine intelligent parity by 2040 “and a median of 2038 for the “date the first AGI is publicly known” (as of mid-April 2021) …
- Depending on how you ask them, experts in 2017 assign a median probability of >30% or >50% to “unaided machines can accomplish every task better and more cheaply than human workers” by 2066, and a 3% or 10% to the “full automation of labor” by 2066 (though their views in this respect are notably inconsistent, and I don’t think the specific numbers should be given much weight).
Personally, I’m at something like 65% on “developing APS systems will be possible/financially feasible by 2070.” I can imagine going somewhat lower, but less than e.g. 10% seems to me weirdly confident (and I don’t think the difficulty of forecasts like these licenses assuming that probability is very low, or treating it that way implicitly).
Carlsmith’s concession to the possibility of lowering his forecast at the end of this discussion is welcome. Otherwise, this discussion raises many if not most of the methodological worries discussed in Part 6 of this series.
- It asks us to defer our opinions to an assemblage of unpublished writings that do not deserve the title of a scholarly literature: Cotra (2020), Roodman (2020), Davidson (2021), and Grace et al. (2016), together with Metaculus forecasts.
- 3 out of 4 of these writings were published by the same foundation (Open Philanthropy) which published Carlsmith’s report, and the last is from an organization (AI Impacts) that has been heavily supported by Open Philanthropy.
- Only one of these pieces has been published in a scholarly journal, and there is no evidence that the others will ever be submitted for publication.
- The authors of all but Grace et al. (2016) lack terminal degrees in any field, let alone a relevant field.
- We just saw that Metaculus predictions of AI questions hover near chance-level.
Given these methodological challenges, what we have been offered is a thin collection of reports funded essentially by the same organization that commissioned the Carlsmith report, and largely lacking the credentials to be taken seriously by scholars. These are supplemented with predictions that we have good reason to believe are hovering near chance-level.
I cannot, and will not defer to such evidence, nor will the vast majority of educated readers.
4. From misalignment to disempowerment
There is no doubt that artificial agents will, at some point, seek to gain some type of power or influence over humans. Your news feed may seek to keep you scrolling, your gig-economy app to keep you working, and your car to keep you from driving recklessly. The question is not why we should think that APS systems would seek power and influence over humans in response to certain inputs. The question is why we should think that APS systems will seek, and gain, so much power that they permanently disempower humanity?
Like many conclusions in the Carlsmith report, this conclusion hovers somewhat uncomfortably in the margins of the text, or perhaps between the words on the page. Let’s start with what Carlsmith says about the possibility of massive damages, totaling at least $1 trillion (Premise 4), given the first three premises:
I’m going to say: 65%. In particular, I think that once we condition on [premises] 2 and 3, the probability of high-impact post-deployment failures goes up a lot, since it means that we’re likely building systems that would be practically PS-misaligned if deployed, but which are tempting – to some at least, especially in light of the incentives at stake in 2 — to deploy regardless.
So far, we’ve heard no words about damages. We’ve only been told about the likelihood of deploying misaligned systems. Will we hear anything about damages? Carlsmith continues:
The 35% on this premise being false comes centrally from the fact that (a) I expect us to have seen a good number of warning shots before we reach really high-impact practical PS-alignment failures, so this premise requires that we haven’t responded to those adequately, (b) the time-horizons and capabilities of the relevant practically PS-misaligned systems might be limited in various ways, thereby reducing potential damages, and (c) practical PS-alignment failures on the scale of trillions of dollars (in combination) are major mistakes, which relevant actors will have strong incentives, other things equal, to avoid/prevent (from market pressure, regulation, self-interested and altruistic concern, and so forth).
So far, we have been given some sketchlike reasons to think that high damages are not likely, but no reasons to think that high damages are likely. Carlsmith continues:
However, there are a lot of relevant actors in the world, with widely varying degrees of caution and social responsibility, and I currently feel pessimistic about prospects for international coordination (cf. climate change) or adequately internalizing externalities (especially since the biggest costs of PS-misalignment failures are to the long-term future). Conditional on 1-3 above, I expect the less responsible actors to start using APS systems even at the risk of PS-misalignment failure; and I expect there to be pressure on others to do the same, or get left behind.
The second sentence is an argument for the likelihood of deploying PS-misaligned agents, and says nothing about possible damages. The first sentence might be interpreted as an argument about damages, but if so, it is highly nonspecific: it tells us that some actors are less cautious than others, and that existing forms of coordination or the internalization of externalities won’t deter them.
Surely that cannot be the argument for a confidence level of 65% that misaligned AI systems will cause a trillion dollars of damages by 2070. What we need is a specific and detailed discussion of how AI misalignment should be expected to lead to damages of this magnitude by 2070.
Is that argument contained in the passages that come before? It is hard to tell. No section or subsection of the Carlsmith report is devoted to the discussion of damages from misalignment. It is possible that some of Carlsmith’s earlier arguments could be repurposed to support damage estimates, but it is hard to tell how this would go given that Carlsmith does not tell us anything about these earlier arguments in explaining his confidence of 65% that misaligned AI will cause a trillion dollars of damages.
I must confess at this point of the argument a feeling of frustration: what we have been given is an overlapping collection of fears, principles, and speculative future scenarios which are meant to coalesce somehow in the mind of the reader to support a high level of confidence that power-seeking APS systems will soon cause high levels of harm. How are these fears meant to coalesce to that conclusion? We are not told.
A similar story accompanies Carlsmith’s fifth premise, that power-seeking AI will lead to the permanent disempowerment of humanity. Carlsmith writes:
I’m going to say: 40%. There’s a very big difference between $>1 trillion dollars of damage (~6 Hurricane Katrinas) and the complete disempowerment of humanity; and especially in slower take-off scenarios, I don’t think it at all a foregone conclusion that misaligned power-seeking that causes the former will scale to the latter. But I also think that conditional on reaching a scenario with this level of damage from high-impact practical PS-alignment failures (as well as the other previous premises), things are looking dire. It’s possible that the world gets its act together at that point, but it seems far from certain.
Is there an argument in this passage? I have to be honest: I cannot find an argument in this passage at all. There is, as Carlsmith notes, a gap between the idea that misaligned AI may cause considerable damage (say, an unreversed `flash crash‘) and the idea that AI may disempower humanity. There is, going forward, another gap between the idea that AI may temporarily disempower humanity and the idea that near-term AI systems could manage to permanently wrestle power from humanity and keep that power for the rest of human history.
What bridges this gap? As we saw in Part 7, at times Carlsmith seems to treat doom as the default conclusion: “it’s possible that the world gets its act together”, but if we do not get our act together, then we are invited to assume that our fate is sealed. This isn’t an argument for the conclusion that PS-misaligned systems would seek to permanently disempower humanity. It is an assertion of the doctrine that PS-misaligned systems would seek to permanently disempower humanity, combined with a previous exploration of ways that we might try and fail to stop them.
Readers may at this point become frustrated, and protest that evidence of the type I would like to see is not possible to produce in such matters. Or, alternatively, they might say that Carlsmith was just stating his own views, and that those views should not be taken as gospel for others. On both points, I agree.
It is nigh on impossible to produce solid evidence that we are on the cusp of being permanently disempowered by AI systems. And given this impossibility, it may be most charitable to interpret Carlsmith’s concluding arguments as statements of personal belief, rather than arguments intended to offer strong evidence that others should believe the same.
But if this is what we are being given, we should be honest about it. Carlsmith has not given us much, if anything, in the way of evidence which should compel belief that APS systems are on the verge of disempowering humanity, just as he has not given us much in the way of evidence that such systems are soon to be developed. What we are left with more closely resembles a position statement than an argument, and it would be inappropriate to treat it as anything else.
Parts 2-5 of this series looked at one of the more scrutable existential risks: climate risk. Because climate risk is at least somewhat scrutable using our best scientific methods, I used those methods to take a look at climate risk and argue that climate risk is substantially lower than many effective altruists take it to be.
In Part 5, I expressed concern about a regression to the inscrutable, in which effective altruists invest increasing confidence in the least scrutable risks. The challenge is that it is very hard to know what to say about such risks because they are nearly inscrutable.
This suggests that effective altruists will have a difficult time motivating the claim that highly inscrutable phenomena pose high levels of existential risk, since the very inscrutability of these phenomena makes it hard to get a detailed risk argument off the ground. If that is right, then a very good strategy for pushing back against less scrutable risks such as AI risk is to carefully examine the arguments and show that (as we might have expected) the arguments provide little evidence for the risk claims they claim to support.
We saw an illustration of this strategy in Parts 7-8 of this series, where I argued that key elements of Carlsmith’s argument (instrumental convergence; AI timelines; implications of instrumental convergence) are relatively undersupported.
“That’s not fair,” you say! “It is almost always going to be hard to construct a plausible argument that inscrutable phenomena pose high levels of existential risk. Therefore skeptics will nearly always be able to point at gaping holes in arguments for AI risk and other inscrutable risks.”
Exactly. That is why it is so hard for me and many others to believe the arguments made by effective altruists for high levels of AI risk. Arguments for inscrutable risks tend, by construction, to have gaping holes in them.
If we do not, and likely cannot come to possess significant evidence that artificial intelligence poses a high level of existential risk to humanity, then we should not believe that artificial intelligence poses a high level of existential risk to humanity. Speculative arguments may take us a few strides beyond our meager evidence, but there is only so much that can be done without more evidence.