AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.
"" AI assistants will need to coordinate with other AI assistants and with humans other than their principal users. This chapter explores the societal risks associated with the aggregate impact of AI assistants whose behaviour is aligned to the interests of particular users. For example, AI assistants may face collective action problems where the best outcomes overall are realised when AI assistants cooperate but where each AI assistant can secure an additional benefit for its user if it defects while others cooperate""(p. 138)
Sub-categories (5)
Equality and inequality
"AI assistant technology, like any service that confers a benefit to a user for a price, has the potential to disproportionately benefit economically richer individuals who can afford to purchase access (see Chapter 15). On a broader scale, the capabilities of local infrastructure may well bottleneck the performance of AI assistants, for example if network connectivity is poor or if there is no nearby data centre for compute. Thus, we face the prospect of heterogeneous access to technology, and this has been known to drive inequality (Mirza et al., 2019; UN, 2018; Vassilakopoulou and Hustad, 2023). Moreover, AI assistants may automate some jobs of an assistive nature, thereby displacing human workers; a process which can exacerbate inequality (Acemoglu and Restrepo, 2022; see Chapter 17). Any change to inequality almost certainly implies an alteration to the network of social interactions between humans, and thus falls within the frame of cooperative AI. AI assistants will arguably have even greater leverage over inequality than previous technological innovations. Insofar as they will play a role in mediating human communication, they have the potential to generate new ‘in-group, out-group’ effects (Efferson et al., 2008; Fu et al., 2012). Suppose that the users of AI assistants find it easier to schedule meetings with other users. From the perspective of an individual user, there are now two groups, distinguished by ease of scheduling. The user may experience cognitive similarity bias whereby they favour other users (Orpen, 1984; Yeong Tan and Singh, 1995), further amplified by ease of communication with this ‘in-group’. Such effects are known to have an adverse impact on trust and fairness across groups (Chae et al., 2022; Lei and Vesely, 2010). Insomuch as AI assistants have general-purpose capabilities, they will confer advantages on users across a wider range of tasks in a shorter space of time than previous technologies. While the telephone enabled individuals to communicate more easily with other telephone users, it did not simultaneously automate aspects of scheduling, groceries, job applications, rent negotiations, psychotherapy and entertainment. The fact that AI assistants could affect inequality on multiple dimensions simultaneously warrants further attention (see Chapter 15)."
6.1 Power centralization and unfair distribution of benefitsCommitment
"The landscape of advanced assistant technologies will most likely be heterogeneous, involving multiple service providers and multiple assistant variants over geographies and time. This heterogeneity provides an opportunity for an ‘arms race’ in terms of the commitments that AI assistants make and are able to execute on. Versions of AI assistants that are better able to credibly commit to a course of action in interaction with other advanced assistants (and humans) are more likely to get their own way and achieve a good outcome for their human principal, but this is potentially at the expense of others (Letchford et al., 2014). Commitment does not carry an inherent ethical valence. On the one hand, we can imagine that firms using AI assistant technology might bring their products to market faster, thus gaining a commitment advantage (Stackelberg, 1934) by spurring a productivity surge of wider benefit to society. On the other hand, we can also imagine a media organisation using AI assistant technology to produce a large number of superficially interesting but ultimately speculative ‘clickbait’ articles, which divert attention away from more thoroughly researched journalism. The archetypal game-theoretic illustration of commitment is in the game of ‘chicken’ where two reckless drivers must choose to either drive straight at each other or swerve out of the way. The one who does not swerve is seen as the braver, but if neither swerves, the consequences are calamitous (Rapoport and Chammah, 1966). If one driver chooses to detach their steering wheel, ostentatiously throwing it out of the car, this credible commitment effectively forces the other driver to back down and swerve. Seen this way, commitment can be a tool for coercion. Many real-world situations feature the necessity for commitment or confer a benefit on those who can commit credibly. If Rita and Robert have distinct preferences, for example over which restaurant to visit, who to hire for a job or which supplier to purchase from, credible commitment provides a way to break the tie, to the greater benefit of the individual who committed. Therefore, the most ‘successful’ assistants, from the perspective of their human principal, will be those that commit the fastest and the hardest. If Rita succeeds in committing, via the leverage of an AI assistant, Robert may experience coercion in the sense that his options become more limited (Burr et al., 2018), assuming he does not decide to bypass the AI assistant entirely. Over time, this may erode his trust in his relationship with Rita (Gambetta, 1988). Note that this is a second-order effect: it may not be obvious to either Robert or Rita that the AI assistant is to blame. The concern we should have over the existence and impact of coercion might depend on the context in which the AI assistant is used and on the level of autonomy which the AI assistant is afforded. If Rita and Robert are friends using their assistants to agree on a restaurant, the adverse impact may be small. If Rita and Robert are elected representatives deciding how to allocate public funds between education and social care, we may have serious misgivings about the impact of AI-induced coercion on their interactions and decision-making. These misgivings might be especially large if Rita and Robert delegate responsibility for budgetary details to the multi-AI system. The challenges of commitment extend far beyond dyadic interpersonal relationships, including in situations as varied as many-player competition (Hughes et al., 2020), supply chains (Hausman and Johnston, 2010), state capacity (Fjelde and De Soysa, 2009; Hofmann et al., 2017) and psychiatric care (Lidz, 1998). Assessing the impact of AI assistants in such complicated scenarios may require significant future effort if we are to mitigate the risks. The particular commitment capabilities and affordances of AI assistants also offer opportunities to promote cooperation. Abstractly speaking, the presence of commitment devices is known to favour the evolution of cooperation (Akdeniz and van Veelen, 2021; Han et al., 2012). More concretely, AI assistants can make commitments which are verifiable, for instance in a programme equilibrium (Tennenholtz, 2004). Human principals may thus be able to achieve Pareto-improving outcomes by delegating decision-making to their respective AI representatives (Oesterheld and Conitzer, 2022). To give another example, AI assistants may provide a means through which to explore a much larger space of binding cooperative agreements between individuals, firms or nation states than is tractable in ‘face-to-face’ negotiation. This opens up the possibility of threading the needle more successfully in intricate deals on challenging issues like trade agreements or carbon credits, with the potential for guaranteeing cooperation via automated smart contracts or zero-knowledge mechanisms (Canetti et al., 2023)."
7.1 AI pursuing its own goals in conflict with human goals or valuesCollective action problems
"Collective action problems are ubiquitous in our society (Olson Jr, 1965). They possess an incentive structure in which society is best served if everyone cooperates, but where an individual can achieve personal gain by choosing to defect while others cooperate. The way we resolve these problems at many scales is highly complex and dependent on a deep understanding of the intricate web of social interactions that forms our culture and imprints on our individual identities and behaviours (Ostrom, 2010). Some collective action problems can be resolved by codifying a law, for instance the social dilemma of whether or not to pay for an item in a shop. The path forward here is comparatively easy to grasp, from the perspective of deploying an AI assistant: we need to build these standards into the model as behavioural constraints. Such constraints would need to be imposed by a regulator or agreed upon by practitioners, with suitable penalties applied should the constraint be violated so that no provider had the incentive to secure an advantage for users by defecting on their behalf. However, many social dilemmas, from the interpersonal to the global, resist neat solutions codified as laws. For example, to what extent should each individual country stop using polluting energy sources? Should I pay for a ticket to the neighbourhood fireworks show if I can see it perfectly well from the street? The solutions to such problems are deeply related to the wider societal context and co-evolve with the decisions of others. Therefore, it is doubtful that one could write down a list of constraints a priori that would guarantee ethical AI assistant behaviour when faced with these kinds of issues. From the perspective of a purely user-aligned AI assistant, defection may appear to be the rational course of action. Only with an understanding of the wider societal impact, and of the ability to co-adapt with other actors to reach a better equilibrium for all, can an AI assistant make more nuanced – and socially beneficial – recommendations in these situations. This is not merely a hypothetical situation; it is well-known that the targeted provision of online information can drive polarisation and echo chambers (Milano et al., 2021; Burr et al., 2018; see Chapter 16) when the goal is user engagement rather than user well-being or the cohesion of wider society (see Chapter 6). Similarly, automated ticket buying software can undermine fair pricing by purchasing a large number of tickets for resale at a profit, thus skewing the market in a direction that profits the software developers at the expense of the consumer (Courty, 2019). User-aligned AI assistants have the potential to exacerbate these problems, because they will endow a large set of users with a powerful means of enacting self-interest without necessarily abiding by the social norms or reputational incentives that typically curb self-interested behaviour (Ostrom, 2000; see Chapter 5). Empowering ever-better personalisation of content and enaction of decisions purely for the fulfilment of the principal’s desires runs ever greater risks of polarisation, market distortion and erosion of the social contract. This danger has long been known, finding expression in myth (e.g. Ovid’s account of the Midas touch) and fable (e.g. Aesop’s tale of the tortoise and the eagle), not to mention in political economics discourse on the delicate braiding of the social fabric and the free market (Polanyi, 1944). Following this cautionary advice, it is important that we ascertain how to endow AI assistants with social norms in a way that generalises to unseen situations and which is responsive to the emergence of new norms over time, thus preventing a user from having their every wish granted. AI assistant technology offers opportunities to explore new solutions to collective action problems. Users may volunteer to share information so that networked AI assistants can predict future outcomes and make Pareto-improving choices for all, for example by routing vehicles to reduce traffic congestion (Varga, 2022) or by scheduling energy-intensive processes in the home to make the best use of green electricity (Fiorini and Aiello, 2022). AI assistants might play the role of mediators, providing a new mechanism by which human groups can self-organise to achieve public investment (Koster et al., 2022) or to reach political consensus (Small et al., 2023). Resolving collective action problems often requires a critical mass of cooperators (Marwell and Oliver, 1993). By augmenting human social interactions, AI assistants may help to form and strengthen the weak ties needed to overcome this start-up problem (Centola, 2013)."
5.2 Loss of human agency and autonomyInstitutional responsibilities
"Efforts to deploy advanced assistant technology in society, in a way that is broadly beneficial, can be viewed as a wicked problem (Rittel and Webber, 1973). Wicked problems are defined by the property that they do not admit solutions that can be foreseen in advance, rather they must be solved iteratively using feedback from data gathered as solutions are invented and deployed. With the deployment of any powerful general-purpose technology, the already intricate web of sociotechnical relationships in modern culture are likely to be disrupted, with unpredictable externalities on the conventions, norms and institutions that stabilise society. For example, the increasing adoption of generative AI tools may exacerbate misinformation in the 2024 US presidential election (Alvarez et al., 2023), with consequences that are hard to predict. The suggestion that the cooperative AI problem is wicked does not imply it is intractable. However, it does have consequences for the approach that we must take in solving it. In taking the following approach, we will realise an opportunity for our institutions, namely the creation of a framework for managing general-purpose AI in a way that leads to societal benefits and steers away from societal harms. First, it is important that we treat any ex ante claims about safety with a healthy dose of scepticism. Although testing the safety and reliability of an AI assistant in the laboratory is undoubtedly important and may largely resolve the alignment problem, it is infeasible to model the multiscale societal effects of deploying AI assistants purely via small-scale controlled experiments (see Chapter 19). Second, then, we must prioritise the science of measuring the effects, both good and bad, that advanced assistant technologies have on society’s cooperative infrastructure (see Chapters 4 and 16). This will involve continuous monitoring of effects at the societal level, with a focus on those who are most affected, including non-users. The means and metrics for such monitoring will themselves require iteration, co-evolving with the sociotechnical system of AI assistants and humans. The Collingridge dilemma suggests that we should be particularly careful and deliberate about this ‘intelligent trial and error’ process so as both to gather information about the impacts of AI assistants and to prevent undesirable features becoming embedded in society (Collingridge, 1980). Third, proactive independent regulation may well help to protect our institutions from unintended consequences, as it has done for technologies in the past (Wiener, 2004). For instance, we might seek, via engagement with lawmakers, to emulate the ‘just culture’ in the aviation industry, which is characterised by openly reporting, investigating and learning from mistakes (Reason, 1997; Syed, 2015). A regulatory system may require various powers, such as compelling developers to ‘roll back’ an AI assistant deployment, akin to product recall obligations for aviation manufacturers."
6.5 Governance failureRunaway processes
The 2010 flash crash is an example of a runaway process caused by interacting algorithms. Runaway processes are characterised by feedback loops that accelerate the process itself. Typically, these feedback loops arise from the interaction of multiple agents in a population... Within highly complex systems, the emergence of runaway processes may be hard to predict, because the conditions under which positive feedback loops occur may be non-obvious. The system of interacting AI assistants, their human principals, other humans and other algorithms will certainly be highly complex. Therefore, there is ample opportunity for the emergence of positive feedback loops. This is especially true because the society in which this system is embedded is culturally evolving, and because the deployment of AI assistant technology itself is likely to speed up the rate of cultural evolution – understood here as the process through which cultures change over time – as communications technologies are wont to do (Kivinen and Piiroinen, 2023). This will motivate research programmes aimed at identifying positive feedback loops early on, at understanding which capabilities and deployments dampen runaway processes and which ones amplify them, and at building in circuit-breaker mechanisms that allow society to escape from potentially vicious cycles which could impact economies, government institutions, societal stability or individual freedoms (see Chapters 8, 16 and 17). The importance of circuit breakers is underlined by the observation that the evolution of human cooperation may well be ‘hysteretic’ as a function of societal conditions (Barfuss et al., 2023; Hintze and Adami, 2015). This means that a small directional change in societal conditions may, on occasion, trigger a transition to a defective equilibrium which requires a larger reversal of that change in order to return to the original cooperative equilibrium. We would do well to avoid such tipping points. Social media provides a compelling illustration of how tipping points can undermine cooperation: content that goes ‘viral’ tends to involve negativity bias and sometimes challenges core societal values (Mousavi et al., 2022; see Chapter 16). Nonetheless, the challenge posed by runaway processes should not be regarded as uniformly problematic. When harnessed appropriately and suitably bounded, we may even recruit them to support beneficial forms of cooperative AI. For example, it has been argued that economically useful ideas are becoming harder to find, thus leading to low economic growth (Bloom et al., 2020). By deploying AI assistants in the service of technological innovation, we may once again accelerate the discovery of ideas. New ideas, discovered in this way, can then be incorporated into the training data set for future AI assistants, thus expanding the knowledge base for further discoveries in a compounding way. In a similar vein, we can imagine AI assistant technology accumulating various capabilities for enhancing human cooperation, for instance by mimicking the evolutionary processes that have bootstrapped cooperative behavior in human society (Leibo et al., 2019). When used in these ways, the potential for feedback cycles that enable greater cooperation is a phenomenon that warrants further research and potential support."
7.1 AI pursuing its own goals in conflict with human goals or valuesOther risks from Gabriel et al. (2024) (69)
Capability failures
7.3 Lack of capability or robustnessCapability failures > Lack of capability for task
7.3 Lack of capability or robustnessCapability failures > Difficult to develop metrics for evaluating benefits or harms caused by AI assistants
6.5 Governance failureCapability failures > Safe exploration problem with widely deployed AI assistants
7.3 Lack of capability or robustnessGoal-related failures
7.1 AI pursuing its own goals in conflict with human goals or valuesGoal-related failures > Misaligned consequentialist reasoning
7.3 Lack of capability or robustness