grayscale photo of dandelion flower

Exaggerating the risks (Part 7: Carlsmith on instrumental convergence)

I don’t believe we need to be obsessively worried by a hypothesised existential risk to humanity [from artificial intelligence]. Why? Because, for the risk to become real, a sequence of things all need to happen, a sequence of big ifs. If we succeed in building human equivalent AI and if that AI acquires a full understanding of how it works, and if it then succeeds in improving itself to produce super-intelligent AI, and if that super-AI, accidentally or maliciously, starts to consume resources, and if we fail to pull the plug, then, yes, we may well have a problem. The risk, while not impossible, is improbable.

Prof. Alan Winfield, “Artificial intelligence will not turn into a Frankenstein’s monster

1. Recap

This is Part 7 of the series Exaggerating the risks. In this series, I look at some places where leading estimates of existential risk look to have been exaggerated.

Part 1 introduced the series. Parts 2, 3, 4 and 5 looked at climate risk and drew lessons from this discussion.

Part 6 introduced the Carlsmith report on power-seeking artificial intelligence, and situated the Carlsmith report within a discussion of the recent history and methodology of AI safety research, as well as a more general `regression to the inscrutable’ within discussions of existential risk.

Today, I want to get to the heart of the Carlsmith report by focusing on the reason why Carlsmith expects that artificial agents may seek to disempower humanity in the first place. This is the instrumental convergence thesis, which we will meet below, and which may be familiar to many readers from the writings of Nick Bostrom and others in the AI safety community.

But first, let us begin on a positive note.

2. What I like about the Carlsmith report

There are many things I disagree with in the Carlsmith report. But there are also several things I like about the report. It is no accident that, out of many possible discussions of existential risk from artificial intelligence, I chose to discuss Carlsmith’s report first. Let me begin this discussion by listing some things I like and admire about the Carlsmith report.

First, Carlsmith is a good philosopher. He is finishing his Ph.D. at Oxford University and is widely considered to be intelligent, well-read and well-spoken [Edit: A reader notes that this particular line of praise may draw on and reinforce a number of unfortunate prejudices. I should not have written it, and I am sorry.]. This means that Carlsmith’s writing is often of a high quality compared to similar arguments, and I have personally found his work to be better argued and better constructed than many similar arguments.

Second, there are no extraneous details to be found. For the most part, every step in Carlsmith’s argument is brief, necessary, and contributes to the overall argument. This makes Carlsmith’s report an informative read, packed full of useful information and argumentation.

Third, Carlsmith has a good command of recent arguments for existential risk from artificial agents. When combined with the recency of the manuscript, this makes Carlsmith’s report a good introduction to recent thinking by effective altruists on AI risk.

That is what I like about the Carlsmith report. Next, I’ll review the main argument of the Carlsmith report and outline the portion of the argument that I focus on in this post.

3. Carlsmith’s argument

Carlsmith argues that humanity faces at least a 5% probability of existential catastrophe from power-seeking artificial intelligence by 2070 (updated to 10% in March 2022). Here is how Carlsmith outlines the argument, with probabilities reflecting Carlsmith’s weaker, pre-2022 view (“|” represents conditionalization).

By 2070:

1. (Possibility) 65% It will become possible and financially feasible to build AI systems with the following properties:

  • Advanced capability: They outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering and persuasion/manipulation).
  • Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world.
  • Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.

(Call these “APS” – Advanced, Planning, Strategically aware – systems).

2. (Incentives) 80% There will be strong incentives to build and deploy APS systems | (1).

3. (Alignment difficulty) 40% It will be much harder to build APS systems that would not seek to gain and maintain power in unintended ways (because of problems with their objectives) on any of the inputs they’d encounter if deployed, than to build APS systems that would do this, but which are at least superficially attractive to deploy anyway | (1)-(2).

4. (Damage) 65% Some deployed APS systems will be exposed to inputs where they seek power in unintended and high-impact ways (say, collectively causing >$1 trillion dollars worth of damage) because of problems with their objectives. | (1)-(3).

5. (Disempowerment) 40% Some of this power-seeking will scale (in aggregate) to the point of permanently disempowering ~all of humanity | (1)-(4).

6. (Catastrophe) 95% This disempowerment will constitute an existential catastrophe | (1)-(5).

Aggregate probability: 65% * 80% * 40% * 65% * 40% * 95% ≈ 5%.

Today, I want to focus on the third premise, Alignment Difficulty. Why think that it will be difficult to build APS systems that would not seek to gain power in unintended ways?

This discussion will necessary blur into the fourth and fifth premises, Damage and Disempowerment. These premises say that misaligned power-seeking will be extremely costly, causing at least a trillion dollars in damage and permanently disempowering humanity. And that is important.

Nobody should deny that machines will sometimes fail to do what we want them to, or that there will be consequences to these failures. Self-driving cars will crash, autonomous weapons will misfire, and algorithmic traders will go haywire. What is surprising about Carlsmith’s claim is that these failures are meant to be so severe that they will cause the equivalent of a trillion dollars of damage and permanently disempower humanity. That is a strong claim, and it had better be supported by a strong argument. By contrast, we will see that Carlsmith provides little in the way of argument for Alignment Difficulty when the difficulty is understood in this way.

4. Alignment

To get a grip on Carlsmith’s argument for Alignment Difficulty, we need to introduce some definitions.

Carlsmith defines misaligned behavior as “unintended behavior that arises specifically in virtue of problems with an AI system’s objectives”. Systems are fully aligned if they don’t engage in misaligned behavior in response to any physically possible inputs.

Full alignment isn’t always a useful target. My car could turn into a bomb in response to some physically possible inputs: for example, you could fill it with gasoline or put a bomb in the trunk. But we wouldn’t get much mileage out of calling my car a potential car bomb in light of the fact that it could turn into a car bomb if a bomb were loaded into it. What’s needed is a definition that focuses more on the inputs that a system is likely to receive in practice, rather than all inputs it could in theory receive.

Carlsmith takes a system to be practically aligned if it doesn’t engage in misaligned behavior on any of the inputs it will in fact receive. That is the type of alignment we should be interested in: we want to know whether our computer systems will in fact go haywire, or whether our cars will in fact become bombs.

Carlsmith is interested in a particular kind of misaligned behavior: power-seeking. Hence Carlsmith reformulates the notions of full and practical alignment in terms of power-seeking:

  • A system is fully PS-aligned if it doesn’t engage in misaligned power-seeking in response to any physically possible inputs.
  • A system is practically PS-aligned if it doesn’t engage in misaligned power-seeking in response to any inputs it will in fact receive.

Carlsmith begins, I think, with the assumption that APS systems are unlikely to be fully aligned. I say that I think Carlsmith begins with this assumption, because it is hard to see how he will make use of the Instrumental Convergence Thesis we will shortly introduce without assuming that APS systems are unlikely to be fully aligned. However, I can’t locate anything in the text resembling an argument against the likelihood of full alignment. And that matters.

There are some very broad readings of full alignment on which full alignment is nearly unachievable. For example, I could count behavior as `unintended’ if it did not precisely satisfy the designer’s intentions, so that a robot vacuum which sometimes missed a spot due to problems with its objectives would count as misaligned. But that would be uninteresting, because this type of misaligned behavior is not very threatening.

Similarly, I could count behavior as misaligned if humans could use the system to achieve ends not intended by its designer. For example, I might say that an APS military planning system engaged in misaligned behavior if it were stolen by the enemy and turned on its designers. But that would not immediately ground more suspicion of APS systems than of any other device which can be misused by humans.

By contrast, if Carlsmith wants to read the notion of full alignment in a narrower way that will be load-bearing in the argument to come, then it would really be better for Carlsmith to tell us a bit about how he is understanding full alignment and why he thinks full alignment is unlikely.

5. Power seeking

Carlsmith links the possibility of misalignment to power-seeking misalignment using a slightly nonstandard formulation of what is known as the Instrumental Convergence Thesis:

Instrumental Convergence: Carlsmith’s Version (ICC): If an APS AI system is less-than-fully aligned, and some of its misaligned behavior involves strategically-aware agentic planning in pursuit of problematic objectives, then in general and by default, we should expect it to be less-than-fully PS-aligned too.

Carlsmith’s ICC principle holds that if APS systems are not fully aligned, they are unlikely to be fully PS-aligned. In other words, APS systems will be disposed to engage in misaligned power-seeking.

Carlsmith’s ICC formulation of instrumental convergence completes a movement in Bostrom’s original formulation towards the blurring of two importantly distinct claims. Bostrom’s original formulation of Instrumental Convergence combined two claims:

Instrumental Convergence: Bostrom’s Version (ICB): Several instrumental values can be identified which are convergent in the sense that (1) their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, (2) implying that these instrumental values are likely to be pursued by many intelligent agents.

Nick Bostrom, “The superintelligent will

Let’s separate these claims and give them names:

Likelihood of Goal Satisfaction (LGS): Instrumentally convergent values would increase the chances of the agent realizing its goals.

Goal Pursuit (GP): Agents (of a suitable type) would be likely to pursue instrumentally convergent goals.

It is important to separate Likelihood of Goal Satisfaction (LGS) from Goal Pursuit (GP). For suitably sophisticated agents, (LGS) is a nearly trivial claim. Most agents, including humans, superhumans, toddlers, and toads, would be in a better position to achieve their goals if they had more power and resources under their control. For this reason, arguments for (ICB) had better concentrate on (GP).

Carlsmith’s statement of instrumental convergence sweeps (LGS) under the rug, focusing almost entirely on a claim resembling (GP). Substituting Carlsmith’s definitions of full alignment from above gives:

Instrumental Convergence: Carlsmith v2 (ICC-2): If an APS AI system is would engage in misaligned behavior in response to some physically possible inputs, and some of its misaligned behavior involves strategically-aware agentic planning in pursuit of problematic objectives, then in general and by default, we should expect it to engage in misaligned power-seeking in response to some physically possible inputs.

Now, Carlsmith is not interested in just any kind of power-seeking. Surely agents will engage in some types of power-seeking: for example, social media algorithms may seek to keep us scrolling. For (ICC-2) to be strong enough to serve as an input to (Disempowerment), Carlsmith needs to argue that APS systems will be disposed towards seeking enough power that they would, if not checked, entirely disempower humanity. That is:

Instrumental Convergence: Carlsmith v3 (ICC-3): If an APS AI system is would engage in misaligned behavior in response to some physically possible inputs, and some of its misaligned behavior involves strategically-aware agentic planning in pursuit of problematic objectives, then in general and by default, we should expect it to engage in misaligned power-seeking in response to some physically possible inputs, such that some of this power-seeking, if not prevented, would permanently disempower all of humanity by 2070.

(ICC-3) is a very strong claim. (ICC-3) is, for the most part, a strengthened form of GP: it says that an AI system is disposed to pursue a certain goal (lots of power). (ICC-3) is not reducible to, and does not in any way follow from (LGS). From the fact that wresting power from humanity would help a human, toddler, superhuman or toad to achieve some of their goals, it does not yet follow that the agent is disposed to actually try to disempower all of humanity.

It would therefore be disappointing, to say the least, if Carlsmith were to primarily argue for (LGS) rather than for (ICC-3). However, that appears to be what Carlsmith does.

6. Carlsmith’s argument for instrumental convergence (ICC)

In my original review of Carlsmith’s report, I complained that I wasn’t able to locate much of a positive argument for instrumental convergence:

I wanted a much more extended positive argument for the instrumental convergence claim. I realize this is hard to do. But right now, I kept waiting for a long argument for that claim with the feeling that it was coming along, and instead what came was a reply to some objections you might make to the instrumental convergence claim, some examples of misaligned behavior that might arise, and a really nice discussion in Section 4.3 of the difficulty of ensuring PS alignment. Is the discussion in Section 4.3 the argument for the instrumental convergence claim?

Carlsmith’s reply on this point is terse and telling:

No, the argument for the instrumental convergence claim is supposed to come centrally in section 4.2.

Let’s be clear about what happened here. Carlsmith’s argument for instrumental convergence is so minimal that a paid reviewer with a PhD, with every interest in locating and assessing the argument, couldn’t find it. Given the centrality of instrumental convergence to Carlsmith’s argument, this is not what we would like to see.

What is the argument in Section 4.2 for instrumental convergence? Here is, on my best reading, everything that Section 4.2 does:

  1. Define PS-misalignment and introduce instrumental convergence (ICC).
  2. Give a `basic reason’ why (ICC) might hold: power is useful.
  3. Give examples of the types of power that might be useful for APS systems to seek.
  4. Give an example of resource acquisition by present-day AI systems trained to play hide-and-seek, and draw lessons from this example.
  5. Clarify the notions of power-seeking and (ICC).
  6. Answer two objections to (ICC).

Okay, so what’s the argument for (ICC)? (1) isn’t an argument, it’s a definition. (5) isn’t an argument, but rather a clarification. And (6) isn’t an argument: it’s a response to objections.

The `basic reason’ (2) why (ICC) might hold is that power is useful. This is an argument for (LGS): power would increase the likelihood of an agent achieving its goals. It’s not yet an argument for (GP): that agents would be disposed to pursue power in misaligned ways, particularly to the degree of permanently disempowering humanity.

(3) doesn’t add anything to (2) except an illustration, and a familiar one at that, drawing on a classic discussion by Omohundro (2008).

So what is the argument for GP? Perhaps it’s (4). Carlsmith writes:

We see examples of rudimental AI systems “discovering” the usefulness of e.g. resource acquisition already. For example: when OpenAI trained two teams of AIs to play hide and seek in a simulated environment that included blocks and ramps that the AI could move around and fix in place, the AIs learned strategies that depended crucially on acquiring control of the blocks and ramps in question – despite the fact that they were not given any direct incentives to interact with those objects (the hiders were simply rewarded for avoiding being seen by the seekers; the seekers, for seeing the hiders).

It’s tempting to take this as more confirmation of LGS: resource acquisition can help agents to achieve their goals, such as not being seen in a game of hide-and-seek. Is there an argument for (GP) here? It is true that a simple agent trained only to avoid being seen sought a kind of power: after all, there was literally nothing in its objectives that told it to avoid power-seeking. But that’s not much more than an operationalization of (LGS).

Carlsmith seems to hold that this example can ground a deeper lesson about (ICC):

Of course, this is a very simple, simulated environment, and the level of agentic planning it makes sense to ascribe to these AIs isn’t clear. But the basic dynamic that gives rise to this type of behavior seems likely to apply in much more complex, real-world contexts, and to much more sophisticated systems as well. If, in fact, the structure of a real-world environment is such that control over things like money, material goods, compute power, infrastructure, energy, skilled labor, social influence, etc. would be useful to an AI system’s pursuit of its objectives, then we should expect the planning performed by a sufficiently sophisticated, strategically aware AI to reflect this fact. And empirically, such resources are in fact useful for a wide variety of objectives.

We are told that the same “basic dynamic” involved in the hide-and-seek example should generalize to other agents and environments. But what is the argument? We are told to assume that something like (LGS) holds: “if, in fact, the structure of a real-world environment is such that control over things like money … would be useful”. Fair enough.

Then, we are told, “We should expect the planning performed by a sufficiently sophisticated, strategically aware AI to reflect this fact”. What does the phrase “reflect this fact” mean? Does Carlsmith mean: the APS system should be aware of, and take into account the instrumental value of resource acquisition? Fair enough. But now we are still at (LGS) + an agent’s awareness of (LGS).

What we need is an argument that artificial agents for whom power would be useful, and who are aware of this fact are likely to go on to seek enough power to disempower all of humanity. And so far we have literally not seen an argument for this claim.

What is going on here? I am loathe to psychologize, but I think that one part of Carlsmith’s statement of instrumental convergence (ICC) may be revealing:

Instrumental Convergence: Carlsmith’s Version (ICC): If an APS AI system is less-than-fully aligned, and some of its misaligned behavior involves strategically-aware agentic planning in pursuit of problematic objectives, then in general and by default, we should expect it to be less-than-fully PS-aligned too.

Carlsmith holds that in general, and by default we should expect APS systems to be PS-misaligned, in virtue of the fact that power is useful to them. Why might it be that Carlsmith, and as we will see later in this series, other effective altruists as well, provides little argument linking claims like (LGS) to the eventual likelihood of artificial agents seizing power? In practice, it is because the claim of misaligned power-seeking acquires a quasi-default status in discussions by effective altruists. It is not something to be argued for, but something to be treated as a general default, and the burden is now placed on opponents to refute it.

I cannot prove that this is Carlsmith’s intention. But it would go a long way towards explaining why Carlsmith passes so quickly from a minimal argument for (ICC) towards considering a variety of objections to (ICC), and a variety of ways in which humans could mitigate the power-seeking behavior of APS systems. Carlsmith hasn’t done much to convince us that APS systems are likely to seek power, and to a large extent Carlsmith isn’t trying to do this. Carlsmith is taking for granted the idea that APS systems are likely to seek power, and seeing what follows from this.

I said earlier that it would be disappointing if Carlsmith had provided a limited argument for (ICC), and if most of that argument were offered only in support of the weak claim (LGS) which may be accepted without any direct fear of losing control of humanity’s future to power-seeking artificial agents. But we have seen in this section that this disappointing reality came to pass: Carlsmith offers so little argument for (ICC) that I was initially unable to locate the argument, and most of the argument is offered only in support of the weaker claim (LGS), perhaps assuming a default status for later fears that the instrumental usefulness of power will lead to misaligned power-seeking. I expected more from Carlsmith, and I was disappointed not to get it.

7. Looking ahead

So far in this series, we have introduced the notion of existential risk from artificial agents (AI risk) and some methodological challenges in studying AI risk (Part 6). Today, we extended this discussion by taking a look at the Carlsmith report, which argues for the claim that there is at least a 5% chance of existential catastrophe by 2070, in which humanity is permanently disempowered by artificial systems.

Today, we looked at the natural objection to Carlsmith’s argument: there is little if any discernible argument in favor of the main claim (ICC) driving fears of misaligned power-seeking. It seems very much that the hypothesis of misaligned power-seeking is given a default status in this discussion, treated not as a scientific hypothesis to be proven by evidence and experiment but rather as a default view which opponents are invited to disconfirm.

That is disappointing, and it is the main reason that I struggle to know what to say about the Carlsmith report beyond that I wish Carlsmith had provided a more extensive argument for his view. There are, I think, a few other points in Carlsmith’s argument which may be productive to discuss. I will do this in future weeks. But for the most part, the heart of my response to Carlsmith is contained in this post: I would appreciate an extended argument for the animating concern of the report.


21 responses to “Exaggerating the risks (Part 7: Carlsmith on instrumental convergence)”

  1. Violet Hour Avatar
    Violet Hour

    Thanks for this! I’ll probably leave a few comments as I slowly digest more of the piece. However, I thought it was worth noting that the first version of ICC seems okay to me, so I thought I’d provide an argument for it. Here’s ICC again:

    ICC: If (A) an APS AI system (‘Alice’) is less-than-fully aligned, and some of its misaligned behavior involves strategically-aware agentic planning in pursuit of problematic objectives, then (B) in general and by default, we should expect it to be less-than-fully PS-aligned.

    So we want to show that (B) follows from (A). Here’s one argument for a related claim:

    – Premise 1: Capable agents (like Alice) pursue their goals. (Assumption)
    – Premise 2: If Alice has less-than-fully aligned, there exist some physical inputs on which Alice pursues some misaligned goal G. (Definition, from Carlsmith)
    – Premise 3: ‘Power’ is an instrumental goal which, if achieved, would be useful for a wide variety of final goals. Call goals of this type: ‘instrumentally convergent sub-goals’. (Assumption)
    – Thus, Conclusion 1: If Alice is in an environment where she can gain more power to achieve G at little relative cost, she will attempt to accrue power in pursuit of G. (P1, P3)

    From C1, we can argue:
    – Premise 4: It is “physically possible” for Alice to be in an environment where she can accrue more power to achieve G at little relative cost. (Assumption)
    – Premise 5: If G is a misaligned goal, Alice’s pursuit of power to achieve G is not what the designers intended. (Assumption)
    – Premise 6: If Alice attempts to gain power in a way that designers did not intend, then Alice is PS-misaligned. (Def, from Carlsmith)
    – Thus, Conclusion 2: If Alice is less-than-fully aligned, Alice is less-than-fully PS-aligned. (C1, P4-6)

    P1 feels trivial, and you seem happy with P3. I doubt you’d object to P4 either, though perhaps you think that we should focus on a more practically relevant concept than ‘physical possibility’. (I’m sympathetic, but that sounds more like claiming that the original formulation of ICC is practically irrelevant, rather than objecting to the claim)

    Perhaps you’d object to P5, by stating that its plausibility depends pretty crucially on the degree of misalignment. For very small levels of misalignment (e.g., something like your robot vacuum example), it might be that pursuing power in pursuit of misaligned goal G rather than the intended (aligned) goal G* look extremely similar. However, if the degree of misalignment involves “strategically-aware agentic planning in pursuit of problematic objectives” (as Carlsmith specifies), then P5 does look quite plausible to me. It would feel (at least) least prima facie surprising if power-seeking in “pursuit of problematic objectives” results in behaviors intended by the designer. With that caveat in mind, expecting PS-misalignment “in general and by default” seems right (though I don’t love the phrase).

    Admittedly, the stronger versions of ICC you outline (e.g. ICC-3) do seem less plausible. But those variants seem to have more force against Premises 4 and 5 of Carlsmith’s original argument, so I thought it was worth providing a more explicit defense of (what seems like) a core argument supporting Carlsmith’s Premise 3.

    1. David Thorstad Avatar

      Thanks Violet Hour!

      In general, this is exactly the sort of thing I’d like to see from effective altruists: efforts to construct detailed arguments for key premises of AI risk arguments. Instrumental convergence is perhaps the most controversial and load-bearing premise of many AI risk arguments, so it is good to start with constructing arguments for instrumental convergence.

      I do think it is important to make sure we consider a version of ICC where the relevant notion of power-seeking matches the kind of power-seeking that will conclude the full argument of the paper. As I have said, it is plausible that AI systems (like humans) will engage in some forms of power-seeking. AI systems already try to keep us scrolling. And if we were to discover that, say, chocolate is carcinogenic, then perhaps a future medical robot would seek the power to keep us from consuming chocolate (because otherwise, I’d never give up chocolate!). But we can’t just try to establish that kind of power-seeking with ICC, because then future premises would be a bait-and-switch: we’d claim to have shown that agents are likely to be less-than-fully PS-aligned, in the sense that they will seek to permanently disempower humanity, whereas all we’ve shown is that they’re liable to keep us scrolling or steal our Twix bars.

      (A small question: can we define the notion of a misaligned goal? I think that perhaps when you try to define this you will see that it isn’t the notion you want unless you assume that this single goal has overriding weight in decisionmaking, but I’m not sure about that. Anyways, I think this is avoidable.)

      I think I would object to Premise 4, and would definitely want to see a detailed argument for Premise 4. I worry that Premise 4 mostly relocates the claim that is driving the argument, but calls it an assumption. However, let me make sure I understand the formulation of Premise 4 (and Conclusion 1 before it), so I can make sure I am pushing in the right place.

      There is a notion introduced in Conclusion 1 that was not present in any of the premises (1)-(3): the notion of the “relative cost” of acquiring power. I think it is worth asking what this notion means. We can try to read cost in a narrow sense, as tracking something like energy expenditure or computational difficulty. But that can’t be what Conclusion 1 means, since nobody thinks that AGIs will act only to minimize energy expenditure or computational difficulty.

      On the other hand, we can try to read cost in a very broad sense, as something like “a measure of how averse the agent is to doing X without good cause” or “a measure of how strongly the agent’s goals penalize X, not counting the benefits of X”. I’m not exactly sure what that means, but I don’t want to be too picky. Let’s run with it.

      Now Premise 4 says something like: “It is physically possible for an APS system to be in an environment where she can accrue total dominion over humanity at little relative cost, which is to say that she would not have any strong aversion to becoming completely dominant over humanity, even setting aside the potential benefits.” This is, I think, the bulk of the controversial claim made by instrumental convergence arguments, so it would probably help to provide an argument for Premise 4. Otherwise, we’ve just passed the buck to Premise 4.

      To see the gap here, imagine the agent in question is me, rather than an AGI. Premises (1)-(3) are certainly true of me, and if we understand Conclusion 1 as a (possibly vacuous) conditional claim then we might well grant it. But Premise (4) isn’t true of me. Why not?

      I don’t want to be misunderstood here. I’m not proposing any specific mechanisms as critical or indispensable explanations of why I would be very averse to having total dominion over humanity. I’m just trying to illustrate some possible explanations to show why someone might doubt the relevant application of Premise 4 to me. One reason I might count it as a pretty bad thing, ignoring benefits, to gain dominion over humanity is that I think this is a substantially morally wrong thing to do. Another reason I might count it as very bad is that it wouldn’t be very enjoyable to loom over everyone like that. Still another reason could be that I’ve evolved certain aversive reactions (maybe you’d like to call them “moral intuitions”, though I don’t like this term) to doing something like this.

      For exactly the same reason that Premise (4) could well fail for me, Premise (4) could also easily fail for an AGI system. Any number of features of the AGI system as an agent, or its basic decisionmaking process, could give it extreme reluctance to disempower all of humanity, ignoring the benefits of doing so.

      The basic lesson of this discussion is that claims like Premise (4) need to say something very specific about how an agent is built and how its decisionmaking process works. We can’t really tell whether an agent will (like me) be very averse to dominating humanity, or whether it will (like a psychopath) have little aversion to this prospect until we know how the agent works (is she more like me, or more like a psychopath, or quite different from both of us?). So we really need to argue for claims like Premise (4) on the basis of detailed claims about the nature of artificial agents, not based on abstract claims like LGS that apply to most all agents.

      This puts AI risk arguments in an awkward place since most authors don’t take themselves to know much at all about how exactly an AGI agent would be built, how it would make decisions, and the like. This means they risk not having much to say in defense of Premise 4 until the state of scientific knowledge advances.

      I think that this problem has pushed some authors (not Carlsmith) to lean on existing characterizations of AI systems, such as deep neural networks trained by familiar forms of reinforcement learning with relatively simple reward functions. That would take us to a rather different argument than Carlsmith’s, so I think it would be best not to say too much about that here.

      1. Michael St. Jules Avatar
        Michael St. Jules

        “This puts AI risk arguments in an awkward place since most authors don’t take themselves to know much at all about how exactly an AGI agent would be built, how it would make decisions, and the like. This means they risk not having much to say in defense of Premise 4 until the state of scientific knowledge advances.”

        I would guess most authors think there’s a decent chance deployed AGI agents will be (like) deep learning models with the kinds of choices they make at least in part determined from training with some objective function, like deep reinforcement learning. These kinds of models seem to be the closest to AGIs among those available now (AFAIK, nothing else seems close or promising), and are already being used and deployed. They are mostly currently fairly limited in their output behaviour (e.g. text output in LLMs), but this could change as their financial, power-increasing (e.g. for government) and/or scientific upsides increase, and people are already giving LLMs access to other things. These models are also effectively black boxes right now, and we don’t know how to ensure that, even if we specify the right objective function (outer alignment, avoiding Goodhart’s law), they will follow the objective reasonably closely when deployed (inner alignment, avoiding misgeneralization). So, the current path towards AGI seems to involve systems we don’t know how to align to prevent catastrophe with extremely high confidence.

        Maybe authors believe or assume this without making this sufficiently explicit?

        I’d guess there’s a decent chance that the modelling will be learned flexibly with deep learning or similar, but the search and decision-making will be more explicitly and transparently programmed in, or at least have explicit constraints on it. Maybe a separate model will be used to check outputs for safety/constraints and reject bad ones. But it’s not obvious to me that they would be safe (not seek power in a way that disempowers humanity) with probability very close to 1. Furthermore, these approaches to search and decision-making may have lower capabilities than more blackbox-like approaches, so there may be incentives to deploy blackbox AGI agents instead. Or, maybe some company or government will build a blackbox AGI agent even if it’s less safe.

        1. Michael St. Jules Avatar
          Michael St. Jules

          *Or, maybe some company or government will build a blackbox AGI agent even if it’s less safe just because it’s more familiar, easier to build for whatever reason, or out of curiosity.

        2. Violet Hour Avatar
          Violet Hour

          On defining the concept of ‘misaligned goals’: while I don’t think that I’ve got a very crisp understanding of what it is for something to be a ‘goal’, here’s a first pass.

          A system has a goal when: at some level of description, the system’s behavior can be better predicted by assuming the system is competently trying to realize G.

          So, we can say that I have the goal ‘eat chocolate cake’, when there’s some level of description (say, folk psychology) when you can better predict my behavior (walking to the chocolate cake shop, moving to the chocolate cake shop next door when the first one is closed, preparing my special pre-chocolate-cake-eating ritual, etc) by assuming that I’m competently trying to move the world towards states where I’m eating chocolate cake. An AI’s goal is *misaligned* to the extent that it is trying to steer the world towards some state that the designers did not intend.

          I don’t think my definition of ‘misaligned goal’ requires that the goal has overriding weight in decision-making, and I don’t think it undercuts my argument. Happy to hear if you disagree on this point!

          1. David Thorstad Avatar

            Thanks Violet Hour! That’s a helpful clarification.

            Many people have a picture on which agents have several different goals G1, G2, … Gn, say. On a simple version of this picture, agents act to realize some function f(G1, … ,Gn) of these goals, say a weighted sum of goal satisfaction.

            If this is something like the picture you have in mind (maybe it isn’t), I wasn’t sure I understood your definition: “An AI’s goal is *misaligned* to the extent that it is trying to steer the world towards some state that the designers did not intend.”. Typically, we would understand what an agent is trying to do as a function of f(G1, …, Gn) and not of some particular goal G_i, so we wouldn’t read your definition as a definition of a single goal being misaligned, but rather of the agent being misaligned. (Honestly, I think we maybe should just talk about agents rather than goals being misaligned, which is why I suggested that change).

            You might try to restrict your definition to a single goal G_i in various ways. So for example, you might say: “goal G_i is misaligned in system S to the extent that S’s possession of G_i disposes it to steer the world towards some unintended state”. But now the word “disposes” is tricky: it could be read in two senses.

            On the one hand, we might have a local sense of disposition in mind: “goal G_i is misaligned in system S to the extent that, if S were just acting to maximize G_i, then S would be disposed to steer the world towards some unintended state”. That’s what I was worried the definition meant, because if that’s what we mean by a misaligned goal, then it doesn’t follow from the fact that an agent has a misaligned goal that she is PS-misaligned. After all, she isn’t just trying to maximize G_i, but to maximize f(G1, …, Gn).

            On the other hand, we might have a global sense of disposition in mind: “goal G_i is misaligned in system S to the extent that S is disposed to steer the world towards some unintended in virtue of S’s possession of G_i”. But now it looks like we could do just as well by doing away with the notion of a misaligned goal: “System S is misaligned to the extent that S is disposed to steer the world towards some unintended state”. And we’d solve a bunch of tricky problems about blame assignment among various goals.

            Is there a different way to read the notion of goal misalignment? Is there anything in particular driving us to read instrumental convergence in terms of the misalignment of goals, rather than agents?

      2. Violet Hour Avatar
        Violet Hour

        Thanks for the interesting response, and sorry for the late reply here! I’ll respond to this in two comments. First, you say:

        “I … think it is important to make sure we consider a version of ICC where the relevant notion of power-seeking matches the kind of power-seeking that will conclude the full argument of the paper.”

        I agree that, in order to assess Carlsmith’s overall argument, we need to introduce a stronger notion of power-seeking than the notion I worked with. However, my primary interest in the previous comment was in defending the more minimal version of ICC, which moves from a (very weak) version of misalignment to a (similarly weak) version of PS-misalignment. My hope was that we’d either disagree on even very minimal versions of ICC, or agree that a weaker version of ICC can be granted, allowing us to more explicitly discuss the premises needed to move from weaker versions of ICC to stronger versions. (That said, it’s important to avoid a bait-and-switch, and you should definitely call me out if I later engage in something that looks like a bait-and-switch)

        You then (fairly) pick up on some sloppiness wrt the way I introduce Conclusion 1. The implicit premise I had in mind was something like “if you have some goal G, there is some small cost ɛ, such that you’re willing to pursue G at cost ɛ.” To me, this seems like a criterion for having anything worth labeling as a ‘goal’ at all. If there is no cost ɛ such that pursuing goal is worth ɛ, I find it hard to see how it’s meaningful to say that you have that goal. My intended sense of ‘cost’ is more in line with your second reading, where cost is construed in a “very broad sense” — perhaps something like ‘other things equal, the agent would always disprefer ‘the status quo + ɛ’ to ‘the status quo’.

        I intended Premise 4 to be much weaker than your suggested reading. And, look, David, I like you, so hopefully you won’t be too offended if I state that I think Premise 4 *is* true of you.

        Granted, I believe this because I’m reading Premise 4 in a rather weak way. If we’re talking about pursuing a misaligned goal given some “physically possible input”, this presumably includes (e.g.) injecting me with heroin, and consequently leading to a situation where I sacrifice my current goals (where ‘my current goals’ is interpreted de re) to receive the heroin fix. In the situation where I become a heroin addict, I wouldn’t be aligned with my *present* goals, let alone the goals of wider humanity. I imagine that there is some analogous case we can present for you. The space of all physically possible inputs is pretty wide.

        To hopefully help clarify, here’s how the rough picture I had in mind for how to make progress on evaluating Carlsmith’s claims:

        (1) First, see if a minimal version of ICC is correct: this would allow us to establish at least one possible case in which misalignment leads to PS-misalignment.
        (2) Then, we could discuss whether there are good reasons for believing that artificial agents are actually likely to face any of the “physically possible inputs” which lead to artificial agents being PS-misaligned.

        After Step 2, we’ll have a sense of the plausibility of P3 in Carlsmith’s argument. Then, possibly, we could discuss whether the damage accrued when faced with these inputs, allowing us to discuss P4 and P5.

        1. David Thorstad Avatar

          Thanks Violet Hour, this is very helpful!

          You’re absolutely right that we can and should discuss weaker versions of ICC (made weaker by weakening the amount of power-seeking that we require to deem a system PS-misaligned). As I’ve mentioned, like many people I am concerned about some less dramatic ways in which AI might be very dangerous, including gaining some power over humanity: for example, police robots might arrest us and autonomous systems might decide some court cases.

          In principle, you are absolutely right about the possibility of first establishing a weak version of ICC, then seeing whether this weak version of ICC can be used to establish a stronger power-seeking worry. Carlsmith’s argumentative structure leaves it open for him to do that. The problem is that Carlsmith does not explicitly, or in detail, argue in this way (are there any passages that could support a different reading?), so if this is what Carlsmith intends, we would have a textual bait-and-switch: Carlsmith doesn’t do very much to show that systems which would be disposed to seek, say, the power to deprive us of carcinogenic chocolates would be disposed to permanently disempower humanity, so Carlsmith would owe us more argument for this. I think it is probably more charitable to read Carlsmith as trying to establish a strong version of ICC initially than as pulling a bait-and-switch, so that is how I have read him, though I am open to other readings.

          The clarification of cost is helpful. Currently, you have defined the notion of “is a cost” but not yet an ordinal or cardinal measure of cost. However, I think it is fairly clear how one might extend your definition in either direction if desired.

          Your example of injecting someone with heroin raises an important point, which I meant to raise in my discussion at the end of Section 4. There are some very broad ways in which we can understand the notion of full alignment on which full alignment is nearly unachievable.

          One way to do that is to say that a system is misaligned unless it perfectly satisfies the designer’s intentions (you’re not doing this). Another way to do that is to say that a system is misaligned if humans could use It to do bad things (you’re not doing this, but might be doing something similar. See below.).

          But another way to say that a system is misaligned is to say that I could change it to make it misaligned. In the case of humans, I can give them drugs or maybe in the future use neurosurgery to change how their minds work. And in the case of machines, I can hack them or introduce a virus. These facts are known, and quite true, but it’s not usually understood to be the kind of worry that EAs are pointing to: everyone knows that a system can be hacked and that we need to be concerned about what humans might do with a system they hack. (This would be like running the AI risk argument by saying that someone could intentionally use a dastardly reward function during training).

          One helpful comparison here might be the example of the stolen military device I mentioned. We could call a military planning system PS-misaligned because it might be stolen and turned on its designers. But that’s not really the kind of worry we meant to be pushing. Similarly, we can call a system PS-misaligned if it can be hacked and asked to dominate some people. But that’s just the same kind of worry, with hacking taking the place of theft.

  2. Michael St. Jules Avatar
    Michael St. Jules

    “What we need is an argument that artificial agents for whom power would be useful, and who are aware of this fact are likely to go on to seek enough power to disempower all of humanity. And so far we have literally not seen an argument for this claim.”

    One way we might try to complete the argument is that
    1. because it’s useful and the system is an agent that pursues goals (maybe assuming/arguing further that it aims to maximize in particular), and
    2. if there are not stronger reasons not to do it (incentives like risks of being caught and losing far more, or an interest in not disempowering humans, opportunity costs),
    then we should expect the system to seek power in a way that disempowers humanity.

    1. David Thorstad Avatar

      Thanks Michael!

      It’s generally true that suitably sophisticated and rational agents will tend to do what they take themselves to have most reason to do (not quite the same thing as what they have most reason to do, for example because AGI systems plausibly don’t have most reason to do something horrible such as killing us all, so formulating the argument in terms of what agents actually have most reason to do would scuttle it.)

      (There may be some fussiness in articulating the relevant notion taking oneself to have a reason, but I don’t want to push on that here.)

      Precisely because it is relatively uncontroversial that AGI systems of the relevant sort will tend to do what they take themselves to have most reason to do, most authors pushing worries about power-seeking AI have aimed to give evidence for claims about what AGI systems might take themselves to have most reason to do.

      From the fact that power would be useful to AGI systems, it doesn’t follow that they might take themselves to have most reason to permanently disempower us all. It follows only that they would take themselves to have some reasons to permanently disempower us all.

      Is there a reason why we should think AGI systems are likely to take themselves to have most reason to do something as drastic as permanently disempowering us all?

      1. Michael St. Jules Avatar
        Michael St. Jules

        “Is there a reason why we should think AGI systems are likely to take themselves to have most reason to do something as drastic as permanently disempowering us all?”

        Eliminating competitors would secure its control over (more) resources, and it might have objectives that are more satisfied the more resources it controls (e.g. building more things that do X) or more likely to be (more) satisfied if it continues to control any resources at all (e.g. preserving its own existence). This is basically just a version of instrumental convergence.

        It’s hard to say that this definitely outweighs the risks/downsides from the perspective of any particular system, but it’s also hard to say that it definitely doesn’t outweigh the risks/downsides, and we just need one system with most reason to do something that permanently disempowers us all, even if all others don’t, and then succeed in actually doing so. Conditional on a system with the capability to disempower us all having reasons to disempower us all, I would take it as not very unlikely (maybe >5%?) that it would do so as long as it doesn’t strong enough reasons not to do so. It seems hard to be very confident (say >99%) that it would have strong enough reasons not to do so (e.g. how would we justify such confidence?), so if we combine the probabilities, extremely low probabilities for disempowerment seem hard to defend.

        It’s worth mentioning that while humans have some preferences to avoid doing things that seem drastic, such a system might not have any such “preferences” or strong enough ones, or might not view disempowering humans as drastic at all or as having any particular special status to the system. Similarly, humans don’t seem to view our use of most of the world’s habitable land for agriculture ( and artificial selection of chickens and dogs as drastic enough to not do, despite their impact on and the disempowering of many nonhuman animals, wild and farmed. We’ve also wiped out mosquitoes in some regions as threats to us, or other animal species for similar or other reasons.

        1. David Thorstad Avatar

          Thanks Michael!

          It is definitely true that acquisition of power tends to promote resource acquisition. This will probably come as little surprise to defenders of instrumental convergence arguments, since such arguments are often run with respect to the pursuit of a number of goals at once (including both power and resource acquisition). Likewise, it will probably not substantially move opponents, who also think that instrumental convergence arguments face problems when formulated in terms of resource-seeking rather than power-seeking. (As you say, the appeal to resource-seeking is “basically a version of instrumental convergence).

          It is certainly true that there is no principled reason why any possible AI system must place a premium on avoiding catastrophic harms to humanity. It is precisely for this reason that I have avoided the claim that power-seeking AI is impossible.

          The question at issue is whether we should assign nontrivial probability to the claim that near-term AI systems will in fact be disposed to seek to disempower humanity. To support this claim, it is not enough to say that power-seeking is possible, or to propose some probability estimates. What is needed is a substantive argument in favor of the likelihood of power-seeking. The arguments offered so far will not do much to move skeptics, since they do not tell most skeptics anything they did not already know.

    2. Michael St. Jules Avatar
      Michael St. Jules

      And we only need one system to seek power in a way that disempowers all of humanity, even the vast majority do not. We shouldn’t model separate systems’ probabilities of doing this as statistically independent, but, still, the more systems like this there are, the more likely it is for one such system to do this, all else equal (e.g. ignoring how knowledge of other such systems might affect things).

  3. ZCHuang Avatar

    It’s weird to me that no one posted literally peer reviewed or high status papers on artificial intelligence.

    a. If the power seeking part of the Carlsmith report is load bearing like you say. Then this paper by Turner et. al 2021 @ NeurIPS formalises power seeking behaviour and shows why some level of convergence happens:

    b. Hereustistic-ey but still emblematic is the behaviour exhibited by GPT-4 as shown by the Alignment Research Centre’s Eval team in the GPT-4 system card and the accompanying report.

    c. Early corrigibility work shows at least a directional shift towards it being really really hard is The Off-Switch Game 2017 @ ICAJ:

    On a meta level I will note you’re not learning the bitter lesson of cognitive science and seem to really love to lean towards credentialism in your own domain. It is quite cringe to read that one of the things you like about Carlsmith is that he’s an Oxford PhD and is well-spoken from all accounts (this just seems like he’s grown up in the Anglosphere and is privileged?). I think this criticism would be better if you tried out coding and replicating some machine learning papers to get a better object level feel.

    1. David Thorstad Avatar

      Thanks ZCHuang!

      As you know, I am a big fan of peer review. I think it is among the most reliable methods that we have for reaching the truth in a way that builds upon existing knowledge and articulates clear and persuasive arguments for correct conclusions. I am happy to see some papers about AI risk being submitted for peer review. I hope to see more peer-reviewed papers coming out of developments such as the special issue on AI Safety in Philosophical Studies, as well as the generally excellent class of philosophy fellows at the Center for AI Safety.

      I polled my readers to ask which of the existing arguments for AI risk I should discuss first, and the consensus was that I should start with Carlsmith, which is why I have discussed Carlsmith first. At some of the most load-bearing places in the Carlsmith report, such as instrumental convergence, Carlsmith did not provide as much argument as I would have liked. That makes my task a bit difficult: in general, what I need to do is to say why the argument offered is insufficient, and what more would be needed.

      I am happy in the future to review other works. I am thankful for your particular suggestions, and I will add both to the list of potential papers to discuss in later iterations of this series. I would be interested to hear if other readers have opinions on these or other papers that could be discussed after the Carlsmith report.
      I am a bit upset with myself for having written the sentence about Carlsmith’s credentials that you correctly take issue with. While academic assessment does place some importance on educational background, it is important to acknowledge the many barriers to admission to top programs, and in particular that many Oxford PhDs are offered without funding, presenting a substantial barrier to entry. I am very upset with myself for the line about Carlsmith being well-spoken. That was not good. I will correct it in the text, and I am sorry.

      I do try to learn the mathematical fundamentals of machine-learning techniques such as reinforcement learning, backpropagation, and gradient descent. I will continue to do this in the future. I have a degree in mathematics and think that it is essential to understand the underlying mathematics as well as possible. I certainly have no aversion to learning to code (I can only code a bit of Python), but I would like to hear a bit more about the specific relevance of coding experience to discussions of instrumental convergence before committing hundreds of hours to developing the relevant coding experience. Could you say more about why this might be a good investment?

      1. ZCHuang Avatar

        On peer review:

        This might seem harsh but I don’t believe you have a good grasp on what you actually like about peer review or differences between fields on peer review. On face it’s a bit wild to choose your self-selected audience and ask them for papers and likely they are just median normie EAs who don’t have sharp enough domain knowledge to cite things. In turn, you’re complaining about the lack of peer review in the field but chose to review something that isn’t peer reviewed.

        Cannonical papers in machine learning are rejected all the time because of how fast the field moves and how ill-defined the difference in papers at ICLR, ICML, and NeurIPS are. I also don’t think you understand the topic specific issues people have with machine learning peer review. A lot of the core canon papers that drove the field forward were rejected (e.g. Hinton

        A few examples of epistemic trip wires you hit that I just want to note:

        a. In response to OpenPhil having a 5k word count you said this betrays a want for shallow reports (this is the usual length of a NeurIPS paper and is not a sign of shallowness).

        b. You treated Lesswrong as a standalone entity and misunderstand the epistemic egalitarianism it represents. The NeurIPS paper I linked was written by a prolific Lesswrong poster with much lesser credentials and generous funding from LTFF. His posting on Lesswrong distilling his research should instead be seen as helping junior aspiring researchers. Specifically, he mentors in SERI-MATS and the starting points of the paper were written on Lesswrong.

        c. You mistake the grant officers who action execution level OpenPhil actions and external advisors like Jacob Steinhardt and Paul Christiano (who I would guess by your lights are well-credentialed enough to defer to?).

        My personal take is also you’ve been cruel on twitter:

        a. I am kind of shocked that you don’t see the cruelty of using a paper rejection (at the chair level) to dunk on a field. Imagine if we saw a junior researcher in philosophy post openly that they were rejected from Mind or Nous and someone from a different field said that they were a crank and delegitimising peer review. I am personally friends with Richard so part of this does come from a tribal place but I am also a young researcher and I lock my twitter for a reason — I’m also sure Richard will resubmit and be fine so I don’t even understand dunking on someone for complaining about peer review.

        b. From watching your interactions and posts it bothers me seeing the contradiction between inclusion and diversity in philosophy but no extension of that generosity towards the AI Safety community. The open sourcing and posting ethos of informal research is not about escaping scrutiny (most of the time) but about making sure there are low resource toy models for individuals to use and attempt research with. An analogy here would be that Lesswrong and adjacent communities should sometimes be treated as a group of undergrad outreach and that’s ok. There will be cringe and wrong pieces and many will not have graduate degrees but that’s ok.

        c. You keep pondering the worst of EAs switch to AI Safety rather than considering what they’ve said themselves. The switch from RCTs to hits-based giving is an OpenPhil thing and Holden was openly critical of MIRI initially. You insinuate a lot of bad faith instead of asking the core question: why does someone start GiveWell and then switch completely to AI Safety (so much so to take a sabbatical from OP). Perhaps the arguments are correct or perhaps Holden Karnofsky is a wife guy.

        d. If you want to feel viscerally what you sound like, you sound like a non-philosopher looking down on a Rutgers philosophy PhD because it’s from Rutgers (a public university with low prestige in undergrad). I think actually replicating the code of a paper will give you more appreciation into how “real” the field is.

        I will still tune in and read this blog and this was probably a bit more cutting that I wanted to write but c’est la vie.

        1. David Thorstad Avatar


          Every academic knows the importance of fundamental epistemic institutions including peer review, doctoral education and disciplinary methodology. We are deeply committed to them and happy to explain why academic fields have converged upon these commitments. A post about the importance of peer review has already been drafted for my series on epistemics and should be posted within a few months.

          I am aware of the difficulty of publishing work in fast-moving fields. Leading journals in my field average a 2-3% acceptance rate, and my own personal acceptance rate is sitting at a probably-lucky number of 12-13%. Many of my papers have been rejected many times, including papers that I hope will become important references in the discussions they contributed to.

          There is a good deal of interest in issues surrounding the safety and governance of emerging technologies, including interest from academic researchers. The position I will take up at Vanderbilt involves several teaching and research commitments related to normative issues in artificial intelligence, many of which could be classified as issues of safety or governance. It is, as you say, a growing field, and all are welcome, but fields have rules and those rules must be respected. Attempts to sidestep or question core epistemic institutions and practices will not be looked on kindly.

          I do worry about the tension between the need to promote an inclusive and welcoming environment and the need to speak clearly about what is permissible and what is not. Often, as you note, I dedicate significant time to promoting inclusion and belonging: for example, I have organized several workshops for underrepresented students in the field, and I chair my institute’s diversity initiative. I will continue to do this because inclusivity and belonging are important. When I find myself acting against the interest of inclusion and belonging, I correct my actions: you will note that I immediately edited the post in response to your earlier comments. This isn’t the first mistake I have made, and it won’t be the last, but I try to learn and grow.

          Sometimes sharp language is needed. This can be in response to issues of exclusion (for example, racism or sexual harassment), but it can also be an apt response to attacks on epistemic institutions within the field. Effective altruists often react badly to sharp language of any kind. This is a mistake.

          I would react negatively to (and speak sharply to) a non-philosopher looking down on a Rutgers PhD. For one thing, no one should look down on someone because they received their PhD from a public university. For another, as you mention, Rutgers has an excellent reputation within philosophy and its faculty as well as its students have earned the respect of the field. Someone who looked down on a philosophy PhD from Rutgers would be behaving in an ignorant and prejudiced manner and I would tell them as much.

          1. ZCHuang Avatar

            I think you misunderstand the behaviours of yours that I take issue with. The analogy to non-philosophers looking down upon Rutgers is the same derision you often have for EAs without credentials questioning their lack of terminal degrees — some of the best in the field like Chris Olah are dropouts. You often betray a fundamental misunderstanding of ML as a field trying to use academic philosophy markers (e.g. papers).

            The “epistemic institutions” that should not be questioned in philosophy are very different norm wise in ML. You should look up on twitter how many canonical papers are rejected from ICML and NeurIPS for the very reason Richard’s was — it’s quite a lot. Moreover, it’s a norm to complain about peer review itself and asking why people can’t just post on arXiv itself.

            Ironically, I set the motion for the US Universities Debating Championship on abolishing peer review this year and was not really driven by any of the EA concerns you criticised but more about the deep history in ethnic studies about the harms of predatory journal companies (which I must say is happening right now with the Journal of Political Philosophy).

  4. Joe Avatar

    Hi David,

    This is Joe Carlsmith, the author of the report. Thanks for your engagement here (and for your earlier review).

    Re: “I can’t locate anything in the text resembling an argument against the likelihood of full alignment.” Section 4.3, on the difficulty of making AI systems practically PS-aligned, is also an argument for the difficulty of making them fully aligned, since full alignment is strictly harder than practical PS-alignment, by a lot. I didn’t focus on full alignment because it seems like (a) an extremely difficult condition to satisfy, and (b) not a necessary one. I wanted to highlight it mostly because I think some of the alignment discourse (for example, Yudkowsky’s discussion of the “omni-test”) implicitly assumes that full alignment is necessary, and I disagree.

    Re: the broader worry that I haven’t done enough to justify the role of instrumental convergence in the argument: it looks like one key criticism here is that “gaining blah form of power would increase an agent’s ability to achieve its goals, and the agent is aware of that fact” does not itself imply “the agent will in fact seek this form of power.” For example, maybe in principle my goals would benefit from being president, but that doesn’t mean I’m going to run; maybe in principle I would benefit from owning the money in my local bank, but that doesn’t mean I will try to steal it; and so on. And plausibly, this distinction applies with special force to forms of power at the scale of disempowering all of humanity, especially given humanity’s incentives to prevent such an outcome.

    I agree, and I think this is a source of hope. For example, it may be that we end up with misaligned AI systems who are aware in that in principle they would benefit from disempowering humanity, but whose incentives are such that they shouldn’t go for it in practice. Indeed, in the report I discuss a variety of methods for trying to ensure that less-than-fully-PS-aligned systems are of this type — e.g., limiting the temporal scope of their goals (section, trying to limit their capabilities (section 4.3.2), and trying to control their options and incentives (section 4.3.3). As I note in the report, though, I think that each of these methods also has significant problems.

    (I also think, more generally, that it may just be not-that-hard to prevent problematic forms of power-seeking in practice, via suitable forms of training, even if we have less-than-perfect control over the objectives of our AI systems more generally. I discuss this in the section on power-seeking — see the quote starting “The in-principle possibility of strategic, agentic misalignment without PS-misalignment is important…” — and in my response to Ben Garfinkel’s review, which I think pushes on this point in some valuable ways.)

    Now, strictly, “will misaligned AI agents who would benefit from blah form power, and who are aware of this fact, actually seek blah form of power in practice?” isn’t the question at stake in section 4.2, on instrumental convergence. Rather, that section is focused on full alignment rather than practical alignment, e.g., ICC-2: “If an APS AI system is would engage in misaligned behavior in response to some physically possible inputs, and some of its misaligned behavior involves strategically-aware agentic planning in pursuit of problematic objectives, then in general and by default, we should expect it to engage in misaligned power-seeking in response to some physically possible inputs.”

    ICC-2 continues to seem quite plausible to me. For example, in the cases above, if being president would serve my goals, and I’m aware of this, it seems like there are indeed physically possible circumstances where I would choose to seek the presidency — for example, ones where winning would be suitably easy. ICC-2, in this sense, isn’t actually a very strong claim. Indeed, as far as I can tell, you don’t actually disagree with it?

    Rather, part of where you’re pushing back is that ICC-2 isn’t enough to get the conclusion that agents will be practically PS-misaligned in practice, especially at existentially threatening scales. And there, as I noted above, I agree: I think that part does indeed require further argument, which I aim to offer in section 4.3, “the challenge of practical PS-alignment,” and to some extent in sections all of sections 4.4- section 6 (my response to you re: “the argument for the instrumental convergence claim is supposed to come centrally in section 4.2” was meant to apply to ICC-2 specifically). That is, I think that assessing whether, in practice, we will end up practically-PS-misaligned systems that actually try to disempower humanity as a whole, is a task that requires looking in more detail at both the technical challenges of understanding and controlling the objectives and behavior of AI systems as they scale (possibly quite rapidly) in sophistication, and at the social challenges involved in dynamics around deployment, coordination, and broader corrective response, given realistic take-off speeds, timelines, etc. In a sense, it’s the topic at stake in quite a lot of the report; and I’m happy to admit that the conclusion that “yes, there is significant risk here” is not some clean conceptual deduction, but rather a product of looking at the landscape of technical and social challenges as a whole.

    The main work I’m trying to do, in section 4.2, is to explain why you might worry, at all, about AI systems with problematic goals seeking power in particular (though I’m happy to acknowledge that ICC-2 is a bit of a janky way to do this, and that section 4.2 in general could be quite a bit stronger). It’s true that there’s a gap between “would benefit from blah form of power and is aware of this” and “actually seeks blah form of power.” But there’s also a close connection between them: namely, the former implies that *under certain conditions*, we’d get the latter as well. The question that a lot of the rest of the report addresses is whether those conditions will actually hold. To me, it seems disturbingly plausible that they will. For example, I think that “it’s suitably easy to gain the relevant forms of power” does a lot to bridge the gap, and I worry that for sufficiently advanced, superhumanly-intelligent misaligned AI systems, it will indeed be suitably easy.

    Indeed, I feel quite skeptical about the degree of hope you seem to be deriving from the gap in question — e.g., from the idea that “we might have misaligned, superhumanly-intelligent AI systems that would benefit from disempowering humanity in principle and are aware of this fact, but who don’t go for it in practice.” In your review, you end up putting only .1% probability on premise 3, Alignment Difficulty (though it sounds like you were uncertain about where to locate your disagreement re: premise 3 vs. premise 4), and from your comments in your review and on this post it seems like some perceived deficit of argument re: instrumental convergence is your central objection. To me, though, and even absent engagement with the more detailed argument in sections 4.3-sections 6, it seems like “maybe the misaligned superhumanly-intelligent AI systems who would in principle benefit from disempowering humanity and who are aware of this don’t go for it in practice” is ill-suited to justifying 99.9% confidence that ensuring practical-PS-alignment won’t be difficult in the sense at stake in premise 3.

    1. David Thorstad Avatar

      Thanks Joe! Just wanted to let you know that I’ll get back to you on this soon. Didn’t want to rush a response and it turned out to be a crazier week than I expected.

    2. David Thorstad Avatar

      Hi Joe,

      Thanks for a thorough and thoughtful response! I should be clear, if I have not already, that I reviewed your report first because I think it’s probably the best of the bunch. I have considerable skepticism about the genre, but given the existence of the genre, I am happy that you are writing in it.

      I was pleased to see some points of agreement in your response, for example leaving open the possibility that it may just be not-that-hard to prevent problematic forms of power-seeking in practice, and the importance of achieving greater conceptual clarity and argumentative detail (a fault from which my own post is not immune, as we will see below).

      Something you may not know about me is that I am generally quite skeptical of public philosophy. I don’t think it makes for clear, accurate, well-grounded or thoughtful communication, and I’m quite worried that this comment of mine will be insufficiently clear, thoughtful and the like.

      I think that philosophers should generally aim to communicate with one another in scholarly journals. I think that the most helpful way to have this discussion would be if you and others were to publish your views in a number of tightly-scoped scholarly papers in leading journals: for example, you might write a paper defending instrumental convergence, or even just clarifying different notions of instrumental convergence.

      That would make it easier for me to respond. Right now, my problem is that I cannot write scholarly articles responding to most AI safety arguments, because journals are hesitant to publish refutations of ideas that haven’t already been developed extensively in good journals. I can do no better than to start a blog and blog about AI safety, despite actively protesting the idea of doing philosophy in this way (oh, the irony!).

      In the meantime, I’ll try to write out some thoughts. I tend to be a bit hesitant about the value of long comment chains – in fact, I limit the size of comment chains on this blog to five. I would be happy to continue this discussion through one more iteration of blog comments, but I’d also suggest some other possibilities. Of course, my preference would be to communicate in the form of scholarly papers. But if you would like, you are welcome to publish a blog-style response to anything I have written here, and I hereby promise that (on your request) I will put it up as a post on my blog. It’s also of course possible to leave things where they are.

      **On full alignment and Section 4.3**

      As I mentioned, it may help to say what exactly we are to count as less-than-full alignment.

      There are some quite weak ways to read the notion on which no sane person should doubt the likelihood of less-than-full alignment. For example, we might count any kind of unintended behavior (however minor) that arises in virtue of a problem with the agent’s objectives as sufficient for less-than-full-alignment. Alternatively, we might have a very broad conception of an input so that, as one commentator seemed to suggest, the possibility of hacking a system and using it in unintended ways makes it less-than-fully aligned.

      If we have these conceptions in mind, then there would be no point in asking for an argument in support of the difficulty of full alignment. But precisely because the claim is so clear, it might not be of much use as a premise in support of the difficulty of full PS-alignment, and readers would have some questions about why instrumental convergence is not simply stated as a claim about the difficulty of full PS-alignment with no mention of the difficulty of full alignment.

      By contrast, if you mean something stronger by full alignment, then the difficulty of ensuring full alignment might become controversial and it will be important to trace the argument for this difficulty. Your response helpfully suggested that the argument for the difficulty of ensuring full alignment comes in Section 4.3.

      The progression of the textual argument is from (1) the difficulty of full alignment –>(2) the difficulty of full PS-alignment –> (3) the difficulty of practical PS-alignment. Most of the material in Section 4.3 is focused on arguing from (2) to (3), which makes sense because Section 4.2 was focused on arguing from (1) to (2).
      It might be possible to repurpose some material from Section 4.3 to support (1). Perhaps you could say a bit more about which remarks in Section 4.3 best support (1)?

      Section 4.3 discusses three approaches to ensuring full alignment: controlling objectives (4.3.1), controlling capabilities (4.3.2), and controlling circumstances (4.3.3).
      The difficulty of controlling circumstances (4.3.3) could not be an argument for (1), since the need to control circumstances only arises once we assume that there are circumstances in which the agent will act in a misaligned way, i.e. that the agent is not fully aligned.

      The difficulty of controlling capabilities (4.3.2) might be pressed as an argument for (1), though the link between capabilities and full alignment is not immediately apparent. Would the idea be that more capable agents are less likely to be aligned? That would take us largely beyond the text, and might not be straightforward to justify, though it could be possible.

      Alternatively, you might take the difficulty of some strategies (such as specialization and scaling) discussed in Section 4.3.2 as an argument for (1). This would leave (1) undersupported. While it may well be true that it is easier for general, scaled-up agents to seek power than specialized, scaled-down agents, (1) needs to be justified against a quite general request to know why agents are unlikely to be aligned in the first place. The argument for (1) cannot just tell us that the system will be sophisticated (general, scaled-up, etc.) but also has to tell us something about its motivations and decisionmaking processes, and how these combine in some circumstance(s) to produce misaligned behavior.

      I think perhaps you meant the difficulty of controlling objectives (Section 4.3.1) to support (1). This may well be the best way for you to go. It still begins the argument a bit too far forward from the starting line. Supporting (1) by appealing to the difficulty of controlling objectives only works against a view on which there are lots of plausible objectives that AI systems could be trained on which would lead them to become less-than-fully aligned – then Section 4.3.1 kicks in to explain why it is hard to avoid training agents on these or relevantly similar objectives.

      **On goals, benefits and costs**

      My earlier discussion with Violet Hour brought out a distinction that I wish I had made in the text, between weak and strong readings of talk about agent’s goals (and corresponding claims about what is a benefit or a cost for them).
      Consider (LGS), “Instrumentally convergent values would increase the chances of the agent realizing its goals.” On a weak reading, this means roughly: there are some goals X which are important to the agent, such that achieving instrumentally convergent values would increase the agent’s chance of achieving X. I intended something like this weak reading.

      Alternatively, we might go for a strong reading, on which goals-talk is just a way of talking about what the agent will do, or what she considers all-things-considered best. It’s not straightforward to state (LGS) in this way, since LGS involves probabilities, but we can state the strong reading if we help ourselves to a particular assumption about how the agent decides: say, that she maximizes expected utility. Then the strong reading of LGS says roughly: the agent assigns positive expected value to the achievement of instrumentally convergent goals.

      I had initially taken you to argue only for (LGS), but pointed out that (LGS) does not imply (GP). This is only a good description of my view on the weak reading of (LGS).
      On a weak reading of (LGS), the objection to (GP) is that of course some values (say, power) can increase an agent’s chance of realizing many important goals X (wealth, saving lives, hedonic pleasure, …). Nevertheless, that does not imply that there is some circumstance C on which the agent will be disposed to seek instrumentally convergent values to an arbitrarily high degree. To borrow one of your helpful examples: “Maybe in principle I would benefit from owning the money in my local bank, but that doesn’t mean I will try to steal it”. This is quite plausible if “benefit” takes a weak sense like “fulfill many of my goals to a high degree”.

      By contrast, on a strong reading of “benefit” as “count it an all-things-considered better outcome, or a utility-increasing outcome,” a different response would be needed. Here we need to explain my behavior by denying that I would benefit, in the relevant sense, from robbing a bank. That brings us back to the strong reading of (LGS).
      If we opt for the strong reading of (LGS) rather than the weak reading, then the passage from (LGS) to (GP) could only fail as a matter of the inputs an agent is likely to receive in practice. I think that you read me as having in mind the strong reading of (LGS), which is why you took me to be arguing against the possibility of practical PS-misalignment. I should have been clearer that this isn’t my concern.

      ((In fact, I think at least one of the things that you wrote uses a strong reading of “benefit”, similar to the strong reading of “goal” in the strong version of (LGS). In:
      “””It’s true that there’s a gap between “would benefit from blah form of power and is aware of this” and “actually seeks blah form of power.” But there’s also a close connection between them: namely, the former implies that *under certain conditions*, we’d get the latter as well.””
      the implication only follows if “would benefit from” takes an all-things-considered reading.))

      You might take yourself to have established something like the strong reading of (LGS) in the text. But this wouldn’t follow from Omohundro-style arguments: those just tell us that there are lots of goals X which would be better achieved given power, which is the weak reading of (LGS).

      I take it that you might take the strong reading of (LGS) to be established by something like the same arguments that suggested full alignment was unlikely. (Is this right?). If so, then this makes the case for full alignment very important, because it now absorbs most of the weight of the argument. For example, I would not want to take Section 4.3 as showing that there are likely to be circumstances C in which APS systems count the act of disempowering humanity as expected-value maximizing or otherwise all-things-considered pursuitworthy. (Just as I wouldn’t want to take the fact that I could fulfill many of my ambitions by having more money as proof that under some circumstances, I would count the act of robbing a bank as expected-value-maximizing or all-things-considered pursuitworthy).

      Now of course, the example of robbing a bank breaks down here because there are circumstances (for example, needing to feed many starving children) in which I might consider robbing a bank. Here too, the point might be that there are likely to be circumstances (for example, ?) under which even the best APS system might consider permanently disempowering humanity. I suppose I would want to hear about what circumstances we substitute in for the question mark above. If we substitute something like “humanity is about to destroy the universe” I might be more sympathetic, though we would have to talk about whether this should count as misaligned behavior (on the most relevant definition, not the textual definition) or as an existential catastrophe.

      **On (ICC-2) and (ICC-3)**

      One important question that I had in working through the report is whether you meant to be arguing for something like (ICC-2) in Section 4.2, or rather for something stronger like (ICC-3). This is important, because on some readings (ICC-2) could be satisfied by quite minimal forms of power-seeking, for example the power to keep humans scrolling, and might already be true.

      You didn’t say in your response that Section 4.2 was only concerned with (ICC-2), but I did note that you were careful to talk about (ICC-2) rather than (ICC-3). Could you say if you take Section 4.2 to be establishing (ICC-2) or (ICC-3), and if the former, what level of power-seeking you have in mind with (ICC-2)?

      The challenge is that if you take Section 4.2 as showing (ICC-2), understood in a sense significantly weaker than (ICC-3), you will then need to point to other passages in the text that help to bridge from (ICC-2) to (ICC-3). I wasn’t able to find many passages that could plausibly do the job. You do say a few words about premises (4) and (5) in the final part of the report, where you assign overall probabilities. Is that the passage from (ICC-2) to (ICC-3)? Does the passage come somewhere else? It might be that something after Section 4.2 could bridge from (ICC-2) to (ICC-3), but I’m having a hard time seeing what that is.

Leave a Reply