I don’t believe we need to be obsessively worried by a hypothesised existential risk to humanity [from artificial intelligence]. Why? Because, for the risk to become real, a sequence of things all need to happen, a sequence of big ifs. If we succeed in building human equivalent AI and if that AI acquires a full understanding of how it works, and if it then succeeds in improving itself to produce super-intelligent AI, and if that super-AI, accidentally or maliciously, starts to consume resources, and if we fail to pull the plug, then, yes, we may well have a problem. The risk, while not impossible, is improbable.
Prof. Alan Winfield, “Artificial intelligence will not turn into a Frankenstein’s monster“
1. Recap
This is Part 7 of the series Exaggerating the risks. In this series, I look at some places where leading estimates of existential risk look to have been exaggerated.
Part 1 introduced the series. Parts 2, 3, 4 and 5 looked at climate risk and drew lessons from this discussion.
Part 6 introduced the Carlsmith report on power-seeking artificial intelligence, and situated the Carlsmith report within a discussion of the recent history and methodology of AI safety research, as well as a more general `regression to the inscrutable’ within discussions of existential risk.
Today, I want to get to the heart of the Carlsmith report by focusing on the reason why Carlsmith expects that artificial agents may seek to disempower humanity in the first place. This is the instrumental convergence thesis, which we will meet below, and which may be familiar to many readers from the writings of Nick Bostrom and others in the AI safety community.
But first, let us begin on a positive note.
2. What I like about the Carlsmith report
There are many things I disagree with in the Carlsmith report. But there are also several things I like about the report. It is no accident that, out of many possible discussions of existential risk from artificial intelligence, I chose to discuss Carlsmith’s report first. Let me begin this discussion by listing some things I like and admire about the Carlsmith report.
First, Carlsmith is a good philosopher. He is finishing his Ph.D. at Oxford University and is widely considered to be intelligent, well-read and well-spoken [Edit: A reader notes that this particular line of praise may draw on and reinforce a number of unfortunate prejudices. I should not have written it, and I am sorry.]. This means that Carlsmith’s writing is often of a high quality compared to similar arguments, and I have personally found his work to be better argued and better constructed than many similar arguments.
Second, there are no extraneous details to be found. For the most part, every step in Carlsmith’s argument is brief, necessary, and contributes to the overall argument. This makes Carlsmith’s report an informative read, packed full of useful information and argumentation.
Third, Carlsmith has a good command of recent arguments for existential risk from artificial agents. When combined with the recency of the manuscript, this makes Carlsmith’s report a good introduction to recent thinking by effective altruists on AI risk.
That is what I like about the Carlsmith report. Next, I’ll review the main argument of the Carlsmith report and outline the portion of the argument that I focus on in this post.
3. Carlsmith’s argument
Carlsmith argues that humanity faces at least a 5% probability of existential catastrophe from power-seeking artificial intelligence by 2070 (updated to 10% in March 2022). Here is how Carlsmith outlines the argument, with probabilities reflecting Carlsmith’s weaker, pre-2022 view (“|” represents conditionalization).
By 2070:
1. (Possibility) 65% It will become possible and financially feasible to build AI systems with the following properties:
- Advanced capability: They outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering and persuasion/manipulation).
- Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world.
- Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.
(Call these “APS” – Advanced, Planning, Strategically aware – systems).
2. (Incentives) 80% There will be strong incentives to build and deploy APS systems | (1).
3. (Alignment difficulty) 40% It will be much harder to build APS systems that would not seek to gain and maintain power in unintended ways (because of problems with their objectives) on any of the inputs they’d encounter if deployed, than to build APS systems that would do this, but which are at least superficially attractive to deploy anyway | (1)-(2).
4. (Damage) 65% Some deployed APS systems will be exposed to inputs where they seek power in unintended and high-impact ways (say, collectively causing >$1 trillion dollars worth of damage) because of problems with their objectives. | (1)-(3).
5. (Disempowerment) 40% Some of this power-seeking will scale (in aggregate) to the point of permanently disempowering ~all of humanity | (1)-(4).
6. (Catastrophe) 95% This disempowerment will constitute an existential catastrophe | (1)-(5).
Aggregate probability: 65% * 80% * 40% * 65% * 40% * 95% ≈ 5%.
Today, I want to focus on the third premise, Alignment Difficulty. Why think that it will be difficult to build APS systems that would not seek to gain power in unintended ways?
This discussion will necessary blur into the fourth and fifth premises, Damage and Disempowerment. These premises say that misaligned power-seeking will be extremely costly, causing at least a trillion dollars in damage and permanently disempowering humanity. And that is important.
Nobody should deny that machines will sometimes fail to do what we want them to, or that there will be consequences to these failures. Self-driving cars will crash, autonomous weapons will misfire, and algorithmic traders will go haywire. What is surprising about Carlsmith’s claim is that these failures are meant to be so severe that they will cause the equivalent of a trillion dollars of damage and permanently disempower humanity. That is a strong claim, and it had better be supported by a strong argument. By contrast, we will see that Carlsmith provides little in the way of argument for Alignment Difficulty when the difficulty is understood in this way.
4. Alignment
To get a grip on Carlsmith’s argument for Alignment Difficulty, we need to introduce some definitions.
Carlsmith defines misaligned behavior as “unintended behavior that arises specifically in virtue of problems with an AI system’s objectives”. Systems are fully aligned if they don’t engage in misaligned behavior in response to any physically possible inputs.
Full alignment isn’t always a useful target. My car could turn into a bomb in response to some physically possible inputs: for example, you could fill it with gasoline or put a bomb in the trunk. But we wouldn’t get much mileage out of calling my car a potential car bomb in light of the fact that it could turn into a car bomb if a bomb were loaded into it. What’s needed is a definition that focuses more on the inputs that a system is likely to receive in practice, rather than all inputs it could in theory receive.
Carlsmith takes a system to be practically aligned if it doesn’t engage in misaligned behavior on any of the inputs it will in fact receive. That is the type of alignment we should be interested in: we want to know whether our computer systems will in fact go haywire, or whether our cars will in fact become bombs.
Carlsmith is interested in a particular kind of misaligned behavior: power-seeking. Hence Carlsmith reformulates the notions of full and practical alignment in terms of power-seeking:
- A system is fully PS-aligned if it doesn’t engage in misaligned power-seeking in response to any physically possible inputs.
- A system is practically PS-aligned if it doesn’t engage in misaligned power-seeking in response to any inputs it will in fact receive.
Carlsmith begins, I think, with the assumption that APS systems are unlikely to be fully aligned. I say that I think Carlsmith begins with this assumption, because it is hard to see how he will make use of the Instrumental Convergence Thesis we will shortly introduce without assuming that APS systems are unlikely to be fully aligned. However, I can’t locate anything in the text resembling an argument against the likelihood of full alignment. And that matters.
There are some very broad readings of full alignment on which full alignment is nearly unachievable. For example, I could count behavior as `unintended’ if it did not precisely satisfy the designer’s intentions, so that a robot vacuum which sometimes missed a spot due to problems with its objectives would count as misaligned. But that would be uninteresting, because this type of misaligned behavior is not very threatening.
Similarly, I could count behavior as misaligned if humans could use the system to achieve ends not intended by its designer. For example, I might say that an APS military planning system engaged in misaligned behavior if it were stolen by the enemy and turned on its designers. But that would not immediately ground more suspicion of APS systems than of any other device which can be misused by humans.
By contrast, if Carlsmith wants to read the notion of full alignment in a narrower way that will be load-bearing in the argument to come, then it would really be better for Carlsmith to tell us a bit about how he is understanding full alignment and why he thinks full alignment is unlikely.
5. Power seeking
Carlsmith links the possibility of misalignment to power-seeking misalignment using a slightly nonstandard formulation of what is known as the Instrumental Convergence Thesis:
Instrumental Convergence: Carlsmith’s Version (ICC): If an APS AI system is less-than-fully aligned, and some of its misaligned behavior involves strategically-aware agentic planning in pursuit of problematic objectives, then in general and by default, we should expect it to be less-than-fully PS-aligned too.
Carlsmith’s ICC principle holds that if APS systems are not fully aligned, they are unlikely to be fully PS-aligned. In other words, APS systems will be disposed to engage in misaligned power-seeking.
Carlsmith’s ICC formulation of instrumental convergence completes a movement in Bostrom’s original formulation towards the blurring of two importantly distinct claims. Bostrom’s original formulation of Instrumental Convergence combined two claims:
Instrumental Convergence: Bostrom’s Version (ICB): Several instrumental values can be identified which are convergent in the sense that (1) their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, (2) implying that these instrumental values are likely to be pursued by many intelligent agents.
Nick Bostrom, “The superintelligent will“
Let’s separate these claims and give them names:
Likelihood of Goal Satisfaction (LGS): Instrumentally convergent values would increase the chances of the agent realizing its goals.
Goal Pursuit (GP): Agents (of a suitable type) would be likely to pursue instrumentally convergent goals.
It is important to separate Likelihood of Goal Satisfaction (LGS) from Goal Pursuit (GP). For suitably sophisticated agents, (LGS) is a nearly trivial claim. Most agents, including humans, superhumans, toddlers, and toads, would be in a better position to achieve their goals if they had more power and resources under their control. For this reason, arguments for (ICB) had better concentrate on (GP).
Carlsmith’s statement of instrumental convergence sweeps (LGS) under the rug, focusing almost entirely on a claim resembling (GP). Substituting Carlsmith’s definitions of full alignment from above gives:
Instrumental Convergence: Carlsmith v2 (ICC-2): If an APS AI system is would engage in misaligned behavior in response to some physically possible inputs, and some of its misaligned behavior involves strategically-aware agentic planning in pursuit of problematic objectives, then in general and by default, we should expect it to engage in misaligned power-seeking in response to some physically possible inputs.
Now, Carlsmith is not interested in just any kind of power-seeking. Surely agents will engage in some types of power-seeking: for example, social media algorithms may seek to keep us scrolling. For (ICC-2) to be strong enough to serve as an input to (Disempowerment), Carlsmith needs to argue that APS systems will be disposed towards seeking enough power that they would, if not checked, entirely disempower humanity. That is:
Instrumental Convergence: Carlsmith v3 (ICC-3): If an APS AI system is would engage in misaligned behavior in response to some physically possible inputs, and some of its misaligned behavior involves strategically-aware agentic planning in pursuit of problematic objectives, then in general and by default, we should expect it to engage in misaligned power-seeking in response to some physically possible inputs, such that some of this power-seeking, if not prevented, would permanently disempower all of humanity by 2070.
(ICC-3) is a very strong claim. (ICC-3) is, for the most part, a strengthened form of GP: it says that an AI system is disposed to pursue a certain goal (lots of power). (ICC-3) is not reducible to, and does not in any way follow from (LGS). From the fact that wresting power from humanity would help a human, toddler, superhuman or toad to achieve some of their goals, it does not yet follow that the agent is disposed to actually try to disempower all of humanity.
It would therefore be disappointing, to say the least, if Carlsmith were to primarily argue for (LGS) rather than for (ICC-3). However, that appears to be what Carlsmith does.
6. Carlsmith’s argument for instrumental convergence (ICC)
In my original review of Carlsmith’s report, I complained that I wasn’t able to locate much of a positive argument for instrumental convergence:
I wanted a much more extended positive argument for the instrumental convergence claim. I realize this is hard to do. But right now, I kept waiting for a long argument for that claim with the feeling that it was coming along, and instead what came was a reply to some objections you might make to the instrumental convergence claim, some examples of misaligned behavior that might arise, and a really nice discussion in Section 4.3 of the difficulty of ensuring PS alignment. Is the discussion in Section 4.3 the argument for the instrumental convergence claim?
Carlsmith’s reply on this point is terse and telling:
No, the argument for the instrumental convergence claim is supposed to come centrally in section 4.2.
Let’s be clear about what happened here. Carlsmith’s argument for instrumental convergence is so minimal that a paid reviewer with a PhD, with every interest in locating and assessing the argument, couldn’t find it. Given the centrality of instrumental convergence to Carlsmith’s argument, this is not what we would like to see.
What is the argument in Section 4.2 for instrumental convergence? Here is, on my best reading, everything that Section 4.2 does:
- Define PS-misalignment and introduce instrumental convergence (ICC).
- Give a `basic reason’ why (ICC) might hold: power is useful.
- Give examples of the types of power that might be useful for APS systems to seek.
- Give an example of resource acquisition by present-day AI systems trained to play hide-and-seek, and draw lessons from this example.
- Clarify the notions of power-seeking and (ICC).
- Answer two objections to (ICC).
Okay, so what’s the argument for (ICC)? (1) isn’t an argument, it’s a definition. (5) isn’t an argument, but rather a clarification. And (6) isn’t an argument: it’s a response to objections.
The `basic reason’ (2) why (ICC) might hold is that power is useful. This is an argument for (LGS): power would increase the likelihood of an agent achieving its goals. It’s not yet an argument for (GP): that agents would be disposed to pursue power in misaligned ways, particularly to the degree of permanently disempowering humanity.
(3) doesn’t add anything to (2) except an illustration, and a familiar one at that, drawing on a classic discussion by Omohundro (2008).
So what is the argument for GP? Perhaps it’s (4). Carlsmith writes:
We see examples of rudimental AI systems “discovering” the usefulness of e.g. resource acquisition already. For example: when OpenAI trained two teams of AIs to play hide and seek in a simulated environment that included blocks and ramps that the AI could move around and fix in place, the AIs learned strategies that depended crucially on acquiring control of the blocks and ramps in question – despite the fact that they were not given any direct incentives to interact with those objects (the hiders were simply rewarded for avoiding being seen by the seekers; the seekers, for seeing the hiders).
It’s tempting to take this as more confirmation of LGS: resource acquisition can help agents to achieve their goals, such as not being seen in a game of hide-and-seek. Is there an argument for (GP) here? It is true that a simple agent trained only to avoid being seen sought a kind of power: after all, there was literally nothing in its objectives that told it to avoid power-seeking. But that’s not much more than an operationalization of (LGS).
Carlsmith seems to hold that this example can ground a deeper lesson about (ICC):
Of course, this is a very simple, simulated environment, and the level of agentic planning it makes sense to ascribe to these AIs isn’t clear. But the basic dynamic that gives rise to this type of behavior seems likely to apply in much more complex, real-world contexts, and to much more sophisticated systems as well. If, in fact, the structure of a real-world environment is such that control over things like money, material goods, compute power, infrastructure, energy, skilled labor, social influence, etc. would be useful to an AI system’s pursuit of its objectives, then we should expect the planning performed by a sufficiently sophisticated, strategically aware AI to reflect this fact. And empirically, such resources are in fact useful for a wide variety of objectives.
We are told that the same “basic dynamic” involved in the hide-and-seek example should generalize to other agents and environments. But what is the argument? We are told to assume that something like (LGS) holds: “if, in fact, the structure of a real-world environment is such that control over things like money … would be useful”. Fair enough.
Then, we are told, “We should expect the planning performed by a sufficiently sophisticated, strategically aware AI to reflect this fact”. What does the phrase “reflect this fact” mean? Does Carlsmith mean: the APS system should be aware of, and take into account the instrumental value of resource acquisition? Fair enough. But now we are still at (LGS) + an agent’s awareness of (LGS).
What we need is an argument that artificial agents for whom power would be useful, and who are aware of this fact are likely to go on to seek enough power to disempower all of humanity. And so far we have literally not seen an argument for this claim.
What is going on here? I am loathe to psychologize, but I think that one part of Carlsmith’s statement of instrumental convergence (ICC) may be revealing:
Instrumental Convergence: Carlsmith’s Version (ICC): If an APS AI system is less-than-fully aligned, and some of its misaligned behavior involves strategically-aware agentic planning in pursuit of problematic objectives, then in general and by default, we should expect it to be less-than-fully PS-aligned too.
Carlsmith holds that in general, and by default we should expect APS systems to be PS-misaligned, in virtue of the fact that power is useful to them. Why might it be that Carlsmith, and as we will see later in this series, other effective altruists as well, provides little argument linking claims like (LGS) to the eventual likelihood of artificial agents seizing power? In practice, it is because the claim of misaligned power-seeking acquires a quasi-default status in discussions by effective altruists. It is not something to be argued for, but something to be treated as a general default, and the burden is now placed on opponents to refute it.
I cannot prove that this is Carlsmith’s intention. But it would go a long way towards explaining why Carlsmith passes so quickly from a minimal argument for (ICC) towards considering a variety of objections to (ICC), and a variety of ways in which humans could mitigate the power-seeking behavior of APS systems. Carlsmith hasn’t done much to convince us that APS systems are likely to seek power, and to a large extent Carlsmith isn’t trying to do this. Carlsmith is taking for granted the idea that APS systems are likely to seek power, and seeing what follows from this.
I said earlier that it would be disappointing if Carlsmith had provided a limited argument for (ICC), and if most of that argument were offered only in support of the weak claim (LGS) which may be accepted without any direct fear of losing control of humanity’s future to power-seeking artificial agents. But we have seen in this section that this disappointing reality came to pass: Carlsmith offers so little argument for (ICC) that I was initially unable to locate the argument, and most of the argument is offered only in support of the weaker claim (LGS), perhaps assuming a default status for later fears that the instrumental usefulness of power will lead to misaligned power-seeking. I expected more from Carlsmith, and I was disappointed not to get it.
7. Looking ahead
So far in this series, we have introduced the notion of existential risk from artificial agents (AI risk) and some methodological challenges in studying AI risk (Part 6). Today, we extended this discussion by taking a look at the Carlsmith report, which argues for the claim that there is at least a 5% chance of existential catastrophe by 2070, in which humanity is permanently disempowered by artificial systems.
Today, we looked at the natural objection to Carlsmith’s argument: there is little if any discernible argument in favor of the main claim (ICC) driving fears of misaligned power-seeking. It seems very much that the hypothesis of misaligned power-seeking is given a default status in this discussion, treated not as a scientific hypothesis to be proven by evidence and experiment but rather as a default view which opponents are invited to disconfirm.
That is disappointing, and it is the main reason that I struggle to know what to say about the Carlsmith report beyond that I wish Carlsmith had provided a more extensive argument for his view. There are, I think, a few other points in Carlsmith’s argument which may be productive to discuss. I will do this in future weeks. But for the most part, the heart of my response to Carlsmith is contained in this post: I would appreciate an extended argument for the animating concern of the report.
Leave a Reply