black and white photo of human hand and robot hand

Exaggerating the risks (Part 8: Carlsmith wrap-up)

We should focus on building institutions that both reduce existing AI risks and put us in a robust position to address new ones as we learn more about them … [But] let’s focus on the things we can study, understand and control—the design and real-world use of existing AI systems, their immediate successors, and the social and political systems of which they are part.

Lazar, Howard and Narayanan, “Is avoiding extinction from AI really an urgent priority?
Listen to this post

1. Recap

This is Part 8 of the series Exaggerating the risks. In this series, I look at some places where leading estimates of existential risk look to have been exaggerated.

Part 1 introduced the series. Parts 2, 3, 4 and 5 looked at climate risk and drew lessons from this discussion.

Part 6 introduced the Carlsmith report on power-seeking artificial intelligence, and Part 7 discussed the role of instrumental convergence in Carlsmith’s argument. While Part 7 expresses the heart of my disagreement with Carlsmith, I think it might be helpful to discuss a few other places where I wanted to hear more from Carlsmith.

At many points, the complaint will be the same complaint that I raised against Carlsmith’s discussion of instrumental convergence: there isn’t enough argument given to support the conclusion. As I said in Part 6 of this series:

At key points in my response to Carlsmith, the complaint will be much the same complaint I have made against Ord. Central premises of the argument are defended by little, if any explicit reasoning. That is, to my mind, good cause for skepticism.

In particular, I want to look at what is driving the views about AI timelines used to get Carlsmith’s concern off the ground, as well as the link between the potential for practical PS-misalignment and the damages (permanent disempowerment of humanity) that Carlsmith thinks are likely to follow from the deployment of practically PS-misaligned agents.

First, let’s briefly review Carlsmith’s argument. Readers familiar with the argument may want to skip ahead to Section 3.

2. Carlsmith’s argument

Carlsmith argues that humanity faces at least a 5% probability of existential catastrophe from power-seeking artificial intelligence by 2070 (updated to 10% in March 2022). Here is how Carlsmith outlines the argument, with probabilities reflecting Carlsmith’s weaker, pre-2022 view (“|” represents conditionalization).

By 2070:

1. (Possibility) 65% It will become possible and financially feasible to build AI systems with the following properties:

  • Advanced capability: They outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering and persuasion/manipulation).
  • Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world.
  • Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.

(Call these “APS” – Advanced, Planning, Strategically aware – systems).

2. (Incentives) 80% There will be strong incentives to build and deploy APS systems | (1).

3. (Alignment difficulty) 40% It will be much harder to build APS systems that would not seek to gain and maintain power in unintended ways (because of problems with their objectives) on any of the inputs they’d encounter if deployed, than to build APS systems that would do this, but which are at least superficially attractive to deploy anyway | (1)-(2).

4. (Damage) 65% Some deployed APS systems will be exposed to inputs where they seek power in unintended and high-impact ways (say, collectively causing >$1 trillion dollars worth of damage) because of problems with their objectives. | (1)-(3).

5. (Disempowerment) 40% Some of this power-seeking will scale (in aggregate) to the point of permanently disempowering ~all of humanity | (1)-(4).

6. (Catastrophe) 95% This disempowerment will constitute an existential catastrophe | (1)-(5).

Aggregate probability: 65% * 80% * 40% * 65% * 40% * 95% ≈ 5%.

Let’s begin today’s discussion with Premise 1, (Alignment).

3. AI timelines

An old joke about artificial general intelligence (AGI) is that it has been twenty years around the corner for at least seven decades.

In 1955, a crack team of ten scientists converged on Dartmouth with the bold claim that they could take humanity a significant way towards AGI in two months:

We propose that a 2-month, 10-man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.

Ten years later, in 1965, Nobel Laureate Herbert Simon proclaimed:

Machines will be capable, within twenty years, of doing any work a man can do.

Five years later, in 1970, AI pioneer Marvin Minsky strengthened Simon’s claim:

In from three to eight years we will have a machine with the general intelligence of an average human being.

And that takes us only up to 1970. More recently, it is not just AI experts who have been wrong. Our best forecasters appear to have been caught up in the hype. A recent analysis of predictions on the forecasting platform Metaculus found that while most non-AI predictions were reasonably accurate, AI predictions had a Brier score of 0.24, almost indistinguishable from the chance-level score of 0.25.

One would think that the lesson from history would be caution on behalf of experts and forecasters. But in many circles today, we hear much the same thing as we did before. Elon Musk tells us in 2022 that:

2029 feels like a pivotal year. I’d be surprised if we don’t have AGI by then.

and Ray Kurzweil makes a similar prediction by 2029.

Looking at these predictions, one cannot help but wonder if we have failed to learn from a history of overconfidence about AI timelines. There is no denying that impressive progress has been made, and will continue to be made, in the field of artificial intelligence, but we must not be too hasty to project the most extreme scenarios. Here I am tempted to agree with Rodney Brooks, writing for the MIT Technology Review:

AI has been overestimated again and again, in the 1960s, in the 1980s, and I believe again now.

On timelines, some effective altruists are thankfully more modest than Musk and Kurzweil. For example, Carlsmith projects a 65% chance of APS systems becoming possible and financially feasible by 2070, although I suspect this chance would increase with Carlsmith’s recent confidence update.

Given the history of failed predictions on this subject, one would expect Carlsmith to present a detailed body of independent evidence in support of his predictions. What Carlsmith gives us is something less than this. Here is the entirety of Carlsmith’s discussion of the likelihood of APS systems being possible and feasible by 2070:

My own views on this topic emerge in part from a set of investigations that Open Philanthropy has been conducting (see, for example, Cotra (2020), Roodman (2020), and Davidson (2021), which I encourage interested readers to investigate. I’ll add, though, a few quick data points.

  • Cotra’s model, which anchors on the human brain and extrapolates from scaling trends in contemporary machine learning, put >65% on the development of “transformative AI” systems by 2070 …
  • Metaculus, a public forecasting platform, puts a median of 54% on “human-machine intelligent parity by 2040 “and a median of 2038 for the “date the first AGI is publicly known” (as of mid-April 2021) …
  • Depending on how you ask them, experts in 2017 assign a median probability of >30% or >50% to “unaided machines can accomplish every task better and more cheaply than human workers” by 2066, and a 3% or 10% to the “full automation of labor” by 2066 (though their views in this respect are notably inconsistent, and I don’t think the specific numbers should be given much weight).

Personally, I’m at something like 65% on “developing APS systems will be possible/financially feasible by 2070.” I can imagine going somewhat lower, but less than e.g. 10% seems to me weirdly confident (and I don’t think the difficulty of forecasts like these licenses assuming that probability is very low, or treating it that way implicitly).

Carlsmith’s concession to the possibility of lowering his forecast at the end of this discussion is welcome. Otherwise, this discussion raises many if not most of the methodological worries discussed in Part 6 of this series.

  • It asks us to defer our opinions to an assemblage of unpublished writings that do not deserve the title of a scholarly literature: Cotra (2020), Roodman (2020), Davidson (2021), and Grace et al. (2016), together with Metaculus forecasts.
  • 3 out of 4 of these writings were published by the same foundation (Open Philanthropy) which published Carlsmith’s report, and the last is from an organization (AI Impacts) that has been heavily supported by Open Philanthropy.
  • Only one of these pieces has been published in a scholarly journal, and there is no evidence that the others will ever be submitted for publication.
  • The authors of all but Grace et al. (2016) lack terminal degrees in any field, let alone a relevant field.
  • We just saw that Metaculus predictions of AI questions hover near chance-level.

Given these methodological challenges, what we have been offered is a thin collection of reports funded essentially by the same organization that commissioned the Carlsmith report, and largely lacking the credentials to be taken seriously by scholars. These are supplemented with predictions that we have good reason to believe are hovering near chance-level.

I cannot, and will not defer to such evidence, nor will the vast majority of educated readers.

4. From misalignment to disempowerment

There is no doubt that artificial agents will, at some point, seek to gain some type of power or influence over humans. Your news feed may seek to keep you scrolling, your gig-economy app to keep you working, and your car to keep you from driving recklessly. The question is not why we should think that APS systems would seek power and influence over humans in response to certain inputs. The question is why we should think that APS systems will seek, and gain, so much power that they permanently disempower humanity?

Like many conclusions in the Carlsmith report, this conclusion hovers somewhat uncomfortably in the margins of the text, or perhaps between the words on the page. Let’s start with what Carlsmith says about the possibility of massive damages, totaling at least $1 trillion (Premise 4), given the first three premises:

I’m going to say: 65%. In particular, I think that once we condition on [premises] 2 and 3, the probability of high-impact post-deployment failures goes up a lot, since it means that we’re likely building systems that would be practically PS-misaligned if deployed, but which are tempting – to some at least, especially in light of the incentives at stake in 2 — to deploy regardless.

So far, we’ve heard no words about damages. We’ve only been told about the likelihood of deploying misaligned systems. Will we hear anything about damages? Carlsmith continues:

The 35% on this premise being false comes centrally from the fact that (a) I expect us to have seen a good number of warning shots before we reach really high-impact practical PS-alignment failures, so this premise requires that we haven’t responded to those adequately, (b) the time-horizons and capabilities of the relevant practically PS-misaligned systems might be limited in various ways, thereby reducing potential damages, and (c) practical PS-alignment failures on the scale of trillions of dollars (in combination) are major mistakes, which relevant actors will have strong incentives, other things equal, to avoid/prevent (from market pressure, regulation, self-interested and altruistic concern, and so forth).

So far, we have been given some sketchlike reasons to think that high damages are not likely, but no reasons to think that high damages are likely. Carlsmith continues:

However, there are a lot of relevant actors in the world, with widely varying degrees of caution and social responsibility, and I currently feel pessimistic about prospects for international coordination (cf. climate change) or adequately internalizing externalities (especially since the biggest costs of PS-misalignment failures are to the long-term future). Conditional on 1-3 above, I expect the less responsible actors to start using APS systems even at the risk of PS-misalignment failure; and I expect there to be pressure on others to do the same, or get left behind.

The second sentence is an argument for the likelihood of deploying PS-misaligned agents, and says nothing about possible damages. The first sentence might be interpreted as an argument about damages, but if so, it is highly nonspecific: it tells us that some actors are less cautious than others, and that existing forms of coordination or the internalization of externalities won’t deter them.

Surely that cannot be the argument for a confidence level of 65% that misaligned AI systems will cause a trillion dollars of damages by 2070. What we need is a specific and detailed discussion of how AI misalignment should be expected to lead to damages of this magnitude by 2070.

Is that argument contained in the passages that come before? It is hard to tell. No section or subsection of the Carlsmith report is devoted to the discussion of damages from misalignment. It is possible that some of Carlsmith’s earlier arguments could be repurposed to support damage estimates, but it is hard to tell how this would go given that Carlsmith does not tell us anything about these earlier arguments in explaining his confidence of 65% that misaligned AI will cause a trillion dollars of damages.

I must confess at this point of the argument a feeling of frustration: what we have been given is an overlapping collection of fears, principles, and speculative future scenarios which are meant to coalesce somehow in the mind of the reader to support a high level of confidence that power-seeking APS systems will soon cause high levels of harm. How are these fears meant to coalesce to that conclusion? We are not told.

A similar story accompanies Carlsmith’s fifth premise, that power-seeking AI will lead to the permanent disempowerment of humanity. Carlsmith writes:

I’m going to say: 40%. There’s a very big difference between $>1 trillion dollars of damage (~6 Hurricane Katrinas) and the complete disempowerment of humanity; and especially in slower take-off scenarios, I don’t think it at all a foregone conclusion that misaligned power-seeking that causes the former will scale to the latter. But I also think that conditional on reaching a scenario with this level of damage from high-impact practical PS-alignment failures (as well as the other previous premises), things are looking dire. It’s possible that the world gets its act together at that point, but it seems far from certain.

Is there an argument in this passage? I have to be honest: I cannot find an argument in this passage at all. There is, as Carlsmith notes, a gap between the idea that misaligned AI may cause considerable damage (say, an unreversed `flash crash‘) and the idea that AI may disempower humanity. There is, going forward, another gap between the idea that AI may temporarily disempower humanity and the idea that near-term AI systems could manage to permanently wrestle power from humanity and keep that power for the rest of human history.

What bridges this gap? As we saw in Part 7, at times Carlsmith seems to treat doom as the default conclusion: “it’s possible that the world gets its act together”, but if we do not get our act together, then we are invited to assume that our fate is sealed. This isn’t an argument for the conclusion that PS-misaligned systems would seek to permanently disempower humanity. It is an assertion of the doctrine that PS-misaligned systems would seek to permanently disempower humanity, combined with a previous exploration of ways that we might try and fail to stop them.

Readers may at this point become frustrated, and protest that evidence of the type I would like to see is not possible to produce in such matters. Or, alternatively, they might say that Carlsmith was just stating his own views, and that those views should not be taken as gospel for others. On both points, I agree.

It is nigh on impossible to produce solid evidence that we are on the cusp of being permanently disempowered by AI systems. And given this impossibility, it may be most charitable to interpret Carlsmith’s concluding arguments as statements of personal belief, rather than arguments intended to offer strong evidence that others should believe the same.

But if this is what we are being given, we should be honest about it. Carlsmith has not given us much, if anything, in the way of evidence which should compel belief that APS systems are on the verge of disempowering humanity, just as he has not given us much in the way of evidence that such systems are soon to be developed. What we are left with more closely resembles a position statement than an argument, and it would be inappropriate to treat it as anything else.

5. Conclusion

Parts 2-5 of this series looked at one of the more scrutable existential risks: climate risk. Because climate risk is at least somewhat scrutable using our best scientific methods, I used those methods to take a look at climate risk and argue that climate risk is substantially lower than many effective altruists take it to be.

In Part 5, I expressed concern about a regression to the inscrutable, in which effective altruists invest increasing confidence in the least scrutable risks. The challenge is that it is very hard to know what to say about such risks because they are nearly inscrutable.

This suggests that effective altruists will have a difficult time motivating the claim that highly inscrutable phenomena pose high levels of existential risk, since the very inscrutability of these phenomena makes it hard to get a detailed risk argument off the ground. If that is right, then a very good strategy for pushing back against less scrutable risks such as AI risk is to carefully examine the arguments and show that (as we might have expected) the arguments provide little evidence for the risk claims they claim to support.

We saw an illustration of this strategy in Parts 7-8 of this series, where I argued that key elements of Carlsmith’s argument (instrumental convergence; AI timelines; implications of instrumental convergence) are relatively undersupported.

“That’s not fair,” you say! “It is almost always going to be hard to construct a plausible argument that inscrutable phenomena pose high levels of existential risk. Therefore skeptics will nearly always be able to point at gaping holes in arguments for AI risk and other inscrutable risks.”

Exactly. That is why it is so hard for me and many others to believe the arguments made by effective altruists for high levels of AI risk. Arguments for inscrutable risks tend, by construction, to have gaping holes in them.

If we do not, and likely cannot come to possess significant evidence that artificial intelligence poses a high level of existential risk to humanity, then we should not believe that artificial intelligence poses a high level of existential risk to humanity. Speculative arguments may take us a few strides beyond our meager evidence, but there is only so much that can be done without more evidence.


7 responses to “Exaggerating the risks (Part 8: Carlsmith wrap-up)”

  1. JWS Avatar

    Hey David, leaving my thoughts below. Once again, thanks for being an good-faith but critical voice, willing to engage with EA on its merits but point out where you think it (or its proponents) make mistakes.

    As for my thoughts:

    1) I think the base-rate argument is quite a weak one, and I generally hate getting involved with reference class tennis. There are cases of people thinking that AI will never be able to do ‘X’ only to be later disproved – scepticism of AI’s ability to play Chess and Go comes to mind here. I also think that a lot of scepticism about AI never being able to do ‘X’ with language is also being disproven with the most advanced LLMs (and as a response, sceptics of AI are saying that LLMs don’t really ‘understand’, but that certainly looks like shifting goalposts to me). In any case, I think that base rates should often be held quite lightly, and that while useful people often over-anchor to them in arguments.

    2) As for Carlsmith’s Report itself (I’ve only skimmed it, so I’m happy to trust your presentation of it here), it does seem rather lacking in terms of being a persuasive piece. Maybe it works more as a statement of Carlsmith’s own subjective credences, and useful as a place to highlight where assessments of AI risk differ? I know some think that the ‘chain of conditionals’ approach is flawed, but I don’t have a similar aversion to it. I do think that it’s a bit concerning if EAs simply defer to this report though, without reading other arguments for AI risk. I think we probably have a lot of similar ground here.

    [3) I think our greatest disagreement is going to be about epistemics – but this section was getting quite long so I’m going to break it out into a separate comment]

    Anyway, regardless of my criticisms, I think this is another good blog post. You really seem to be on a roll with them recently! Do you have any advice for you how you seem to be so productive in both quantity and quality? In trying to write my own thoughts/arguments/posts down I seem to be lacking in both :’)

    1. David Thorstad Avatar

      Thanks JWS! As always, it is good to hear from you.

      **On (1)**

      Could you say a bit more about what you mean by the base rate argument? I think that people often classify quite a broad variety of things as base-rate arguments, and that it’s often best not to group them together (since they have different merits). So for example, I think that in your next comment you might be using the phrase base-rate argument in a different sense (where even questioning authors’ credentials counts as a base-rate argument).

      I think you might mean the base-rate argument to be the evidence on AI forecasting track records cited in Section 3. But if that is what you are after, your proposal might be more controversial than you meant it to be. This section aims to judge forecasters by their track records, and to judge the forecastability of a phenomenon by the track record of the best forecasters. Judging forecasters and problems by their forecasting record is something that effective altruists often recommend, so it would be a bit surprising if they were to pull away from that practice in precisely the area (AI forecasting) where forecasting records are poor.

      **On (2)**

      The summary of Carlsmith is a direct quotation of his own summary, except for the premise labels which I have added. Sorry, I should have made that clear! I think I said this in Part 6 and then forgot to repeat it in Parts 7-8. My bad on that.

      I have to admit that I sometimes agree with your interpretation on which the Carlsmith report should be taken as a statement of position, rather than a thorough argument for a view. That’s why I wrote: “It may be most charitable to interpret Carlsmith’s concluding arguments as statements of personal belief, rather than arguments intended to offer strong evidence that others should believe the same.”

      I have to admit that I also have similar thoughts about many other pieces. For example, here’s what I wrote in Part 2 about Ord’s discussion of climate risk in The Precipice:

      ((Begin quoted passage)) “That’s not fair”, you say. “Ord wasn’t claiming to have given us an argument for his existential risk numbers. They were just subjective reports of Ord’s own views, which were never intended to be treated as statements of fact or as reflections of printed arguments.” Exactly. ((End quoted passage))

      I have to say, I don’t think it would be very good for effective altruists if it turned out that canonical texts like the Carlsmith Report and The Precipice were mostly statements of personal opinion or manifestos rather than developed arguments meant to persuade others. Effective altruists often take themselves to be in possession of solid, rigorous arguments for their views, and take those texts to be among the best arguments. If it turns out that headline arguments for those views turn out really not to be strong arguments at all, or maybe even to be intended as strong arguments, then the idea that EA views on existential risk and other controversial topics are supported by a good deal of rigorous argument and evidence might need revision.

      I don’t want to speak for Carlsmith, but I suspect that he would want to see his report as doing something more than this. Certainly, others see Carlsmith as doing this: I reviewed the Carlsmith report because I asked around for the best case for AI risk, and was told frequently that the Carlsmith report was the best case. But if this is the best case …

      **On advice**

      Thanks, I’m flattered! I’m afraid I don’t have any terribly good writing advice. On quantity, I’m just not very good at work-life balance, which tends to solve problems with quantity of output. On quality, I think probably I’ve been spoiled by having a few years of dedicated research time to consider these issues, and I’m mostly just drawing on the fruits of that time more than anything else.

      To be honest it’s a bit of a learning process for me. I think that I’ve gotten better at blogging with time – I’m not particularly proud of my earliest posts – and I hope I’ll get better in the future.

      1. JWS Avatar

        **On base rates**
        Ah, apologies for not being clear – I was talking about the what the base rate for the claim of AGI occuring, let’s use Carlsmith’s ‘By 2070 It will become possible and financially feasible to build AI systems with the following [APS] proporties’. One could say that this belongs in a reference class of previous failed predictions for AI progress, such as the Dartmouth Conference statement. If correct, then an empirical prior for this belief would be 0/N (where N is the number of failed AGI predictions in the reference class) along with some Laplace Smoothing. Thus, we would hold a very sceptical prior on any claim here – far lower than Carlsmith’s 65%.

        The issue of ‘reference class tennis’ is what is the correct reference class. It could be the set of claims that humans achieving something is impossible. Now, most things of that class did turn out to be false – but we do have cases such as Szilard disproving Rutherford’s ‘moonshine’ comment, or the long-standing belief that the 4-minute mile was impossible, etc. Depending on how you calculate it, that prior could be high enough to be concerning and warrant further investigation.

        A lot of arguments (on various topics) I see go back-on-forth on which reference frames are more appropriate, whereas I think it’s much more productive to look at the arguments given by each side – such as what you’ve done here with Carlsmith.

        **On the Report Itself**
        I don’t think that I have much to disagree about on your assessment of Carlsmith’s (lack of) arguments here. I think we’re both of the same mind that a good step that the ‘AI Safety’ camp [I don’t actually like buying into to the us vs them dynamic with these labels, but it’s the shortest way to refer to the set of people we’re talking about] could take to demostrate good faith to those sceptical of them, now they’ve got so much attention, is to work on collecting what is a scattered and diverse literature into developed arguments legible to those sceptics. For what it’s worth I don’t think it’s the ‘best case’, the work that’s made the most sense to me is Stuart Russell’s research agenda (summarised in ‘Human Compatible’), and I admit I find it surprising it hasn’t got much traction in EA/AI-Safety circles in the last few months.

        **On Epistemics**
        I think I’ll leave the debate until your upcoming blog post on peer review, so I will leave those claims until then. What might inform your understanding my perspective is that my recent dives into epistemology have led me more and more toward the Popperian/Deutschian framing of science as a process of conjecture and refutation of arguments and explanations. So yes, one should expect arguments made in leading journals to be much more reliable than other sources, but the fundamental source of that reliability comes from their ability to identify good explanations. Most people sending letters to maths departments claiming to have solved some theorem are completely wrong, but every so often you do get a Ramanujan.

        Anyway, that’s not particularly relevant to the points you made in our other thread (and I think that we’re basically in agreement on the non-epistemics part of this post) so feel free not to respond to this last part. I look forward to your upcoming post on peer review and discussing it with you.

  2. JWS Avatar

    Ok, here’s the epistemics section. I think we have some significant disagreements here, and I’ve tried to walk the line between holding firm on my pushback without going over the line of the comment policy, but I will apologise in advance if I have, and hope our previous engagements mean I’m still have ‘good-faith’ chips to cash in on this blog 🙂

    * I don’t think you’re being asked to ‘defer’ to the AI Risk literature – only to consider it’s arguments, so the claims of “I cannot, and will not defer to such evidence, nor will the vast majority of educated readers.” seems a bit overdramatic and, if I may be so bold, a little bit pompous? Especially the ‘majority of educated readers’ bit, which might be both unnecessary, unkind, and potentially empirically false to boot! I think it’s totally fine to say you’ve assessed the Carlsmith Report, and find its arguments lacking to motivate having a high xRisk from AI, and I wouldn’t disagree!

    * The methodological challenges you raise at the end of section 3 (without recapitulating part 6 of the series), seem to be based on base-rates again? That because the authors lack terminal degrees, their arguments are worth discounting? Or the fact that research funded by OP is less likely to be true/useful than all other non OP-funded research? I don’t have the same intuitive response here, and beyond a certain level of ‘conservation of academic energy’ whether Cotra’s arguments are convincing (for example) is to be found in the arguments themselves, and not the fact that she doesn’t have a PhD, which at best is only going to feed in indirectly to the truth of the argument.

    * To add to the above, I think at times you seem to be implying (in this blog and elsewhere) that unless research and arguments are published in peer-reviewed journals, they don’t count as evidence. I think that this is obviously false (ask Ramanujan, Eddington, or basically any scientific advance before the modern system of peer review was set up after the Second World War). I agree with you that EA-aligned researchers should submit their research for peer-review more, and engage more with the academic critiques of their work, but I think you perhaps lean far too hard on this from an epistemological standpoint (though perhaps from understandable frustration).

    * Coming into the conclusion, I think inscrutability over AI doesn’t just work against effective altruists, but everyone involved in AI. This is a historically new phenomenon for which we have vanishingly small historical track record, and flimsy-at-best analogies with other technologies. But that doesn’t mean the xRisk is ~0% from this perspective, it would be better to say that it is undefined. And under this deep uncertainty, humanity will still have to act. Governments, NGOs, and Private Companies will adopt norms, rules, and institutions to govern the development of increasingly advanced AI, and they will never have a definite answer on the xRisk from the scientific literature (at least, not for a long time).

    * Related to the above, there’s an implied assumption that ‘inscrutable’ phenomena can’t pose an xRisk, which doesn’t follow. Of course, you could argue that we simply shouldn’t believe in the xRisk even if it does exist, but that does cut against some other epistemological intuitions. I think you are also overgeneralising from the gaps in Carlsmith’s report to the case for AI risk as a whole (though again, I appreciate that might be ‘conservation of academic energy’ as opposed to unjustified inductino).

    * Finally, I think that your arguments for not believing in AI risk arguments are a lot stronger than other critics. I think one of the primary reasons AI risk concerns are gaining ground recently is because the counter arguments are simply really poor, and when people without a background in the debate or a dog in the various fights look at the state of the arguments, they’re able to pick the more convincing side. I’m fully willing to buy that the above is wishful thinking on my part – but I think it accounts for at least some of the changing trends in AI risk discourse.

    Just want to sign off by saying that I appreciate you work, this blog, and I hope we can debate these issues in good faith and come to mutual understanding if the absence of mutual agreement.

    1. David Thorstad Avatar

      Thanks JWS! You have plenty of good-faith chips – I appreciate the pushback, and you’re well within the limits of acceptable discourse. It’s important to me that people be able to speak their minds.

      **On deference**

      When the main work of a section (such as Carlsmith’s section on AI timelines) is meant to be carried by the work of others, there are two importantly different ways that the cited work can be presented.

      On the one hand, the main views and arguments of the cited works can be described in sufficient detail to allow readers to engage with the arguments on their own merits. In this case, no deference is required: readers are asked to engage with the arguments on their own merits. But that’s not what Carlsmith does.

      On the other hand, the views can be quickly summarized and no substantial arguments given. That’s what Carlsmith does – I’ve quoted the relevant passage from Carlsmith to show that this is what Carlsmith does.

      In this case, readers aren’t being asked to evaluate the arguments on their own merits, since they aren’t given the material to do so. We could say either (as I have) that readers are being asked to defer to the cited views. Or we could say that no argument has been given, but that readers have been encouraged to look elsewhere and make up their own minds. I thought it would be uncharitable to say that Carlsmith doesn’t make an argument for his AI timelines, but I would be happy if readers wanted to interpret the Carlsmith report in this way.

      ** On expertise **

      I’ve been increasingly troubled by the willingness of effective altruists to move beyond, and even attack standard markers of authority and expertise.

      Increasingly, effective altruists question the value of standard credentials (such as academic degrees in a relevant field) as indicators of authority to speak on the subject that they are credentials for. For example, you seem surprised that I urge more respect for the opinions of authors who hold terminal degrees in a relevant field than for the opinions of authors who don’t.

      Effective altruists also increasingly attack the reliability of reputable venues for publication, such as academic journals and mainstream media outlets, touting the merits of less-traditional sources such as forum posts, podcasts and blog posts. For example, after the TIME Magazine article on the treatment of women in effective altruism, Eliezer Yudkowsky told me directly that he thinks an average blogger is more reliable than a mainstream media outlet such as TIME Magazine. Yudkowsky and many others used this claim to cast doubt on the TIME Magazine report and on survivors of sexual harassment and abuse.

      It is certainly true that neither a fancy degree nor a fancy publication venue is necessary to do good work. But that does not mean that the education given by leading academic programs is worthless, or that the vetting process at leading publication venues is unreliable. Precisely for this reason, those with reputable credentials publishing in reputable venues are more likely to be doing reliable and high-quality work, and those interested in learning to reliably do high-quality work would be well-advised to pursue standardly accepted credentials.

      I will speak about many of these issues in my series on epistemics. The post on peer review is already written and should be out soon. I very much hope that effective altruists will come around on the importance of standard markers of expertise. Those working on short-termist causes (global health and development) know the importance of speaking to experts and reading publications in good journals. I hope that longtermists will join them.

      **Inscrutability for everyone**

      You are absolutely right that the long-term future of AI is inscrutable for everyone, not merely for those working on a particular aspect of the long-term future of AI (existential risk). It is precisely for this reason that I opened this post with a passage by Seth Lazar and colleagues urging us to “focus on the things we can study, understand and control—the design and real-world use of existing AI systems, their immediate successors, and the social and political systems of which they are part.” And that is the same reason why I emphasized that the track record of forecasting on AI timelines and many other long-term-ish AI-related questions has been generally poor.

      **Inscrutability and existential risk**

      The challenge posed by inscrutability isn’t that inscrutable risks could not possibly happen, but rather that it is hard to get much evidence bearing on inscrutable risks. This means that it is hard to come to be significantly confident that an inscrutable phenomenon poses an existential risk to humanity based on evidence. If effective altruists are confident today that artificial agents pose a significant existential risk, this view is driven largely by prior beliefs and fears about artificial intelligence.

      In this case, I think it may be helpful to return to our earlier discussion about whether discussions such as the Carlsmith report should be treated more as expressions of personal opinion than as arguments meant to persuade the unconvinced. It might well be true that the Carlsmith report is a good way of expressing the views held by those who think that artificial agents pose a significant existential risk to humanity. And it might well be true that, because of the inscrutability of the phenomenon, this really is more of a manifesto or an expression of personal viewpoint than a piece of outside-facing argumentation.

      ** On critics **

      Like you, I am quite disappointed by the arguments made by some critics of AI risk arguments. Something I think we agree on is that the inscrutability of AI risks affects *everyone*. In light of the inscrutability of AI risk, people offering speculative arguments against particular AI risk scenarios aren’t likely to do any better than those offering speculative arguments for those scenarios. Add to that a general grumpiness on behalf of critics and the fact that some critics have spent less time thinking about AI risk than EAs have, and it may well be that critics engaging in speculative argument are making worse arguments than the EAs are.

      Something you will notice is that I take inscrutability to heart in refusing to engage in detailed speculation about most AI risk scenarios. If I’m right about the degree of inscrutability here, then that is a fool’s errand. The right thing to do is to look at particular arguments for AI risk and show they aren’t as strong as people take them to be.

      One thing I would urge EAs to remember is that EA critics are a self-selected bunch. Most people don’t spend their time reading and critiquing any particular social movement. For that reason, I would urge EAs to form their opinions about artificial intelligence by thinking broadly about what experts on artificial intelligence say, write and do, and not narrowly about the particular experts who have been drawn into arguments on Twitter.

      ** Some concluding remarks **

      Thanks, as always, JWS, for your constructive and helpful comments. Within reason, it’s of course okay to speak your mind – sometimes it’s necessary to speak a bit sharply, and as you note I’ve done that a time or two before. While I hope we can avoid gratuitous rudeness, these are high-stakes and emotionally laden epistemic issues that touch on very personal questions, such as who should be listened to and what should be read, that inevitably draw strong reactions.

      I appreciate your continued readership and engagement, and I look forward to hearing what you think of future posts!

  3. Joe Avatar

    Hi David,

    This is Joe Carlsmith, the author of the report. Thanks again for your engagement. A few quick responses:

    — Re: timelines: I wasn’t, in this report, trying to cover arguments about AI timelines in any depth, and it’s a giant topic in its own right. (Though even granted dissatisfaction with the sources I cite, I feel a bit confused about your own take on the topic. In your review of my report, for example (, you give 75% that the timelines premise is true — which is actively *higher* than the 65% number you’re objecting to here.)

    — Re: damages and full disempowerment: it seems like you’re mostly focusing on my comments in section 8, where I offer my own probabilities for the premises in question. But this section isn’t supposed to be where the bulk of the report’s argumentation for any of the premises takes place (rather, it’s meant to briefly gesture at the considerations motivating my specific probabilities). The main argumentation comes earlier — and in particular, re: premises 4 and 5 (on damages and disempowerment), the argument is in sections 5 and 6, which together are about 15 pages long. The basic thought there is that by hypothesis, practically PS-misaligned systems are trying to seek power over humans, and if their goals would benefit from disempowering humans fully (I do think we can wonder about how much they’ll value power at different scales, but I don’t think this source of possible comfort is enough to dismiss the damages/disempowerment premises in the way you seem inclined to), then we will need to actively *stop them* from disempowering us if they get deployed. Section 5 argues that they might well get deployed. Section 6 argues that we might well not stop them.

    — These smaller points aside, I think our biggest disagreement probably comes down to some more fundamental difference in epistemic orientation. In particular, I think you are much more excited than I am about patterns of reasoning of the form “I have not yet seen an argument for p that meets blah standard (and in particular: standards I associate with academia), therefore p should be assigned low probability.” I disagree about how badly the report does on various object-level standards here, and less excited about academia than you more generally, but I’m typically going to be happy to say “sure, this wasn’t an open-and-shut deductive argument, there are ways blah premise could be false, and so on.” But some of the probabilities you are reaching (at least in your review) on the basis of your dissatisfaction (e.g., 1/1000 on the alignment premise, 1/500 on the disempowerment premise, one in five million on doom) seem to me unreasonably low.

    – (As a sidetone, I also wonder whether your probabilities adequately reflect reasonable probabilities that you will see blah standard reached in future. E.g., what’s your probability that a paper making the case for AI risk, or AI timelines, or whatever, meets blah standard of peer review, or not-being-affiliated-with-Open-Phil, or whatever, sometime before 2070? If this would make a sufficiently substantive difference to your probability, and is decently likely, then on Bayesian grounds, low probabilities become hard to sustain — more on these dynamics here:

    1. David Thorstad Avatar

      Thanks Joe!

      **On timelines**

      We’re definitely agreed that the report isn’t aiming to make a strong argumentative case for any particular view about AI timelines. My remarks here were meant to illustrate why I think a strong argument has not been made, and I think we agree on that.

      The question of whether it is appropriate to cite claims about AI timelines to others and move on, instead of arguing for them, may be more of a crux. Generally, this is considered appropriate when the sources cited are reputable and have been appropriately reviewed, and when they reflect the consensus of experts in the field. These conditions have not been met. As a result, I’m not quite sure what to do with a report that leaves views about timelines largely unargued for. Of course, I’m happy to move on to the rest of the report and treat this as a conditional claim, but then it needs to be noted that this is something that hasn’t yet been satisfactorily established.

      As for my own timelines, I’m honestly not sure what to say. I’m deeply skeptical of long-term forecasting and I don’t think I’m in an especially privileged opinion to predict the future of technology in 2070. I would only give a probability estimate if forced, as I was by the questions asked to me on the review form. To be honest, I wrote the review that you quote because I needed the money. I didn’t have any background in these issues to speak of, and I hadn’t thought much about them. I’m not entirely sure what I was thinking when I gave a 75% probability estimate here, and I’m not confident that my thoughts would be worth reconstructing.

      It’s possible that I was working with a relatively weak notion of an APS system. I suspect that many effective altruists think AI will advance considerably beyond a minimal APS system by 2070, and I also suspect that many of your arguments later in the report may require a stronger claim about timelines than that some APS system or other will be developed – it’s not so clear that a minimally capable APS system poses a high risk of permanently disempowering humanity.

      It might be that you mean the “A” in APS systems to bake in a very high degree of capability so that any APS system would by definition have a good shot at permanently disempowering humanity if it tried. That wasn’t how I read your definition of advanced (“they outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering, and persuasion/manipulation)”). But if that’s what you mean, I would want to revise my probability assignment radically downwards.

      **On damages**

      I’m a bit surprised to hear you say that Sections 5-6 make the argument that practically PS-misaligned APS systems pose a high risk of permanently disempowering humanity. As you say, Section 5 deals with the likelihood that APS systems will be deployed, not what they will do once they are deployed. And Section 6 doesn’t deal with the question of what APS systems will do, but rather with whether we can stop them from doing those things.

      To argue that practically PS-misaligned systems pose a high risk of permanently disempowering humanity, most people would think it necessary to argue that (a) practically PS-misaligned APS systems are likely to intend to permanently disempower humanity, and (b) they’re likely to be capable of pulling it off unless we do a great deal to stop them.

      Sections 5-6 do not establish (a), since they tell us nothing about what practically PS-misaligned systems are likely to intend. As discussed in Part 7 of this series, you could take (a) as already established by reading instrumental convergence in a strong way (I think I called this (ICC-3)), but then the question would be precisely what part of the argument for instrumental convergence established (a). We don’t get (a) from the simple observation that power would help AI achieve many things it could want to achieve.

      Sections 5-6 also do not establish (b). You could hold that any APS system would be capable of permanently disempowering humanity, but it is hard to see how this argument would go. I suspect you need to strengthen the claims about timelines in Section 2 to say that we’ll develop something far beyond a minimal APS system by 2070. Or perhaps you have a strong reading of “A” in mind, as discussed earlier. Either of these suspicions, if true, would speak in favor of giving an argument for timelines.

      You might instead say that the discussion of controlling capabilities in Section 4.3.2 speaks to (b), but this would be something of a stretch. The discussion of specialization in Section just says that APS systems might be general, but doesn’t say what general APS systems can do. The discussion in Section of scaling suggests that alignment might not be preserved under scaling, but doesn’t tell us that systems will in fact be scaled up to the point where they are powerful enough to disempower humanity. The same goes for the discussion of preventing problematic improvements in Section it would need to be established that agents can self-improve to become quite powerful.

      It would help to spell out something like the following reasoning in the report: — you wrote in the comment above that:

      ”The basic thought there is that by hypothesis, practically PS-misaligned systems are trying to seek power over humans, and (a?) if their goals would benefit from disempowering humans fully (I do think we can wonder about how much they’ll value power at different scales, but I don’t think this source of possible comfort is enough to dismiss the damages/disempowerment premises in the way you seem inclined to), (b?) then we will need to actively *stop them* from disempowering us if they get deployed”

      I think that you’ll probably need to establish (a)/(b) at the points marked in this passage, but I’m not entirely sure. In any case, I found the passage helpful, because the reasoning in this passage was not communicated in the report, and now I have a somewhat better idea of what your argument might be.

      **On epistemic orientation**

      Effective altruists make a number of very striking claims. One such claim is that AI may soon kill or permanently disempower us all. Another is that AI may soon become radically superintelligent. Another is the time of perils hypothesis. These claims are on their face quite improbable, so if we do not have significant evidence for those claims, we should not invest much probability in them. For this reason, I focus on showing that we don’t have much evidence for these claims.

      I know that many EAs invest significant prior probability in such claims. They should not do so.

      **On future papers**

      I think there are several people with the capability, and possibly the motivation, to write a paper making the case for AI risk that could clear peer review at a tip-top journal (special issues don’t count, though I’m glad to see the Phil Studies special issue in the works!). I think that several of the CAIS philosophy fellows have a shot at doing this, and I think that you might as well. I hope that those able to write such papers will be motivated to do so. It will do a good deal to put the discussion on a solid epistemic footing.

      I hope you will do this!

Leave a Reply

%d bloggers like this: