Vulnerability to Poisoning and Backdoors
Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.
"The previous section explored jailbreaks and other forms of adversarial prompts as ways to elicit harmful capabilities acquired during pretraining. These methods make no assumptions about the training data. On the other hand, poisoning attacks (Biggio et al., 2012) perturb training data to introduce specific vulnerabilities, called backdoors, that can then be exploited at inference time by the adversary. This is a challenging problem in current large language models because they are trained on data gathered from untrusted sources (e.g. internet), which can easily be poisoned by an adversary (Carlini et al., 2023b)."(p. 73)
Sub-categories (21)
Natural Language Underspecifies Goals
"For LLM-agents, both the goal and environment observations are typically specified in the prompt through natural language. While natural language may provide a richer and more natural means of specifying goals than alternatives such as hand-engineering objective functions, natural language still suffers from underspecification (Grice, 1975; Piantadosi et al., 2012). Furthermore, in practice, users may neglect fully specifying their goals, especially the information pertaining to elements of the environment that ought not to be changed (the classic frame problem (Shanahan, 2016)). Such underspecification (D’Amour et al., 2020), if not accounted for, can result in negative side-effects (Amodei et al., 2016), i.e. the agent succeeding at the given task but also changing the environment in undesirable ways"
7.1 AI pursuing its own goals in conflict with human goals or valuesGoal-Directedness Incentivizes Undesirable Behaviors
"Goal-directedness can cause agents to exhibit unethical and undesirable behaviors, such as deception (Ward et al., 2023), self-preservation (Hadfield-Menell et al., 2017), power-seeking, and immoral rea- soning (Pan et al., 2023a). Pan et al. (2023a) find that LLM-agents exhibit power-seeking behavior in text-based adventure games. LLM-agents have also been shown to use deception to achieve assigned goals when explicitly required by the task (Ward et al., 2023), or when the tasks can be more easily completed by employing deception and the prompt does not disallow deception (Scheurer et al., 2023a)."
7.2 AI possessing dangerous capabilitiesSafety Risks from Affordances Provided to LLM-agents
"The capabilities of LLM-agents can be enhanced in significant ways by providing the LLM-agent with novel affordances, e.g. the ability to browse the web (Nakano et al., 2021), to manipulate objects in the physical world (Ahn et al., 2022; Huang et al., 2022a), to create and instruct copies of itself (Richards, 2023), to create and use new tools (Wang et al., 2023a), etc. Affordances can create additional risks, as they often increase the impact area of the language-agent, and they amplify the consequences of an agent’s failures and enable novel forms of failure modes (Ruan et al., 2023; Pan et al., 2024)."
7.2 AI possessing dangerous capabilitiesFoundationality May Cause Correlated Failures
"Another important characteristic of LLM development is foundationality — due to the expense of large- scale pretraining, many deployed instances share similar or identical learned components. Foundation- ality may both be a blessing and a curse. On the one hand, it may be possible to exploit the similarity in the design of LLM-agents to facilitate cooperation (Critch et al., 2022; Conitzer and Oesterheld, 2023; Oesterheld et al., 2023). On the other hand, foundationality may leave LLM-agents vulnerable to correlated failures both in terms of safety and capabilities due to increased output homogenization (Bommasani et al., 2022)."
7.6 Multi-agent risksGroups of LLM-Agents May Show Emergent Functionality
"Multi-agent learning, either through explicit finetuning or implicit in-context learning, may enable LLM-agents to influence each other during their interactions (Foerster et al., 2018). Under some environmental settings, this can create feedback loops that result in novel and emergent behaviors that would not manifest in the absence of multi-agent interactions (Hammond et al., 2024, Section 3.6). Emergent functionality is a safety risk in two ways. Firstly, it may itself be dangerous (Shevlane et al., 2023). Secondly, it makes assurance harder as such emergent behaviors are difficult to predict, and guard against, beforehand (Ecoffet et al., 2020)."
7.6 Multi-agent risksCollusion between LLM-Agents
"While it would often be preferable for LLM-agents to be cooperative, cooperation can be undesirable if it undermines pro-social competition or produces negative externalities for coalition non-members (Dorner, 2021; Buterin, 2019; Dafoe et al., 2020). Collusion between relatively simple AI systems has been observed in the real world (Assad et al., 2020; Wieting and Sapi, 2021) and synthetic experiments (Brown and MacKay, 2023; Calvano et al., 2020; Klein, 2021) Collusion can occur through explicit or steganographic communication. Steganographic communication hides information in seemingly innocent content (Roger and Greenblatt, 2023), posing challenges for collusion monitoring and detection."
7.6 Multi-agent risksMisinformation and Manipulation
"Recent studies have demonstrated that LLMs can be exploited to craft deceptive narratives with levels of persuasiveness similar to human-generated content (Pan et al., 2023b; Spitale et al., 2023), to fabri- cate fake news (Zellers et al., 2019; Zhou et al., 2023f), and to devise automated influence operations aimed at manipulating the perspectives of targeted audiences (Goldstein et al., 2023). LLMs have also been found to be used in malicious social botnets (Yang and Menczer, 2023), powering automated accounts used to disseminate coordinated messages. More broadly, the use of LLMs for the deliberate generation of misleading information could significantly lower the barrier for propaganda and manip- ulation (Aharoni et al., 2024), as LLMs can generate highly credible misinformation with significant cost-savings compared to human authorship (Musser, 2023), while achieving considerable scale and speed of content generation (Buchanan et al., 2021; Goldstein et al., 2023)."
4.3 Fraud, scams, and targeted manipulationCybersecurity
"LLMs may exacerbate cybersecurity risks in various ways (Newman, 2024). Firstly, LLMs may significantly amplify the effectiveness of deceptive operations aimed at tricking people into disclosing sensitive information or granting adversary access to critical resources. For example, LLMs might prove highly effective at crafting personalized phishing emails or messages at scale that may be harder for an average user to recognize as phishing attempts (Karanjai, 2022; Hazell, 2023). In addition to being directly harmful to the targeted individual, such ‘social engineering’ attacks are often the base of larger hacking operations (Plachkinova and Maurer, 2018; Salahdine and Kaabouch, 2019)."
4.3 Fraud, scams, and targeted manipulationSurveillance and Censorship
"Content moderation has emerged as one of the key use-cases of LLMs (Weng et al., 2023), indicating the potential of LLMs for surveillance and censorship as well (Edwards, 2023). Surveillance and censorship are one of the primary tools employed by governments with dictatorial tendencies to suppress opposing political and social voices. These censorship measures, however, are often quite crude and can be escaped with little ingenuity...However, LLMs could enable significantly more sophisticated surveillance and censorship operations at scale (Feldstein, 2019). Multimodal-LLMs or LLMs combined with speech- to-text technologies could be used for surveilling and censoring other forms of communication as well, e.g. phone calls and video messages (Whittaker, 2019). This may collectively contribute towards the worsening of personal liberties and the heightening of state oppression across the world. Examples have been documented already, for instance in calling for violence and silencing of political dissidents (Aziz, 2020), and suppression of Palestinian social media accounts (Zahzah, 2021)."
4.1 Disinformation, surveillance, and influence at scaleWarfare and Physical Harm
"The use of AI in warfare is highly alarming and may pose dangers to human safety (Hendrycks et al., 2023). Autonomous drone warfare is being aggressively pursued as a tactic in the current war in Ukraine (Meaker, 2023), and may already have been used on human targets (Hambling, 2023). The use of AI- based facial recognition has been documented in the targeting of Palestinians in Gaza (International, 2023). LLMs have already been productized in limited ways for the purposes of warfare planning (Tarantola, 2023). Furthermore, active research is being carried out to develop multimodal-LLMs that can act as ‘brains’ for general-purpose robots (Ahn et al., 2022; 2024). Due to the ‘general-purpose’ nature of such advances, it will likely be cost-effective and practical to adapt them for creating more advanced autonomous weapons"
4.2 Cyberattacks, weapon development or use, and mass harmHazardous Biological and Chemical Technologies
"AI systems such as LLMs, chemical LLMs (Skinnider et al., 2021; Moret et al., 2023), and other LLM- based biological design tools might soon facilitate the production of bioweapons, chemical weapons, and other hazardous technologies. In particular, LLMs might enable actors with less expertise to more easily synthesize dangerous pathogens, while customized chemical and biological design tools might be more concerning in terms of expanding the capabilities of sophisticated actors (e.g. states) (Sandbrink, 2023). Gopal et al. (2023) and Soice et al. (2023) demonstrated that people with little background could use LLMs to help make progress towards developing pathogens such as the 1918 pandemic influenza. However, recent studies suggest that current LLMs are not more helpful than internet search in this regard (Mouton et al., 2024; Patwardhan et al., 2024)."
4.2 Cyberattacks, weapon development or use, and mass harmDomain-Specific Misuses
"Improvements in LLMs may exert greater pressure to apply LLMs to various domains, such as health and education (Eloundou et al., 2023). Crude efforts to use LLMs in such domains, however, may incur harm and should be discouraged strongly. In particular, it is important to guard against different ways in which LLMs may be misused within any domain. One famous episode of misuse within the health sector is a mental health non-profit experimenting LLM-based therapy on its users without their informed consent (Xiang, 2023a). Within the education sector, LLMs may be misused in various ways that might impact student learning; e.g. as cheating accessory by the students or as (low quality) evaluator of student’s work by the instructors (Cotton et al., 2023). Recent findings in moral psychology also suggest that LLMs can generate moral evaluations that people perceive as superior to human judgments; these could be misused to create compelling yet harmful moral guidance (Aharoni et al., 2024). Similar risks of misuse may exist in other domains as well."
4.3 Fraud, scams, and targeted manipulationHarms of Representation and Other Biases
"A pretrained LLM generally has many of the stereotypical biases commonly present in the human society (Touvron et al., 2023). This makes it difficult for users to trust that LLMs will work well for them and not produce unfair or biased responses. Appropriate finetuning can effectively limit the bias displayed in LLM outputs in a variety of situations, e.g. when models are explicitly prompted with stereotypes (Wang et al., 2023k), but it does not ‘solve’ the problem. Even after finetuning, biases often resurface when deliberately elicited (Wang et al., 2023k), or under novel scenarios, e.g. in writing reference letters (Wan et al., 2023a), generating synthetic training data (Yu et al., 2023c), screening resumes (Yin et al., 2024) or when used as LLM-agents (Pan et al., 2024)."
1.1 Unfair discrimination and misrepresentationInconsistent Performance across and within Domains
"Estimating true capabilities of an LLM is a difficult task (c.f. Section 3.3), especially for naive users unfamiliar with the brittle nature of machine learning technologies. Exaggeration of model capabilities by the developers (Lambert, 2023; Blair-Stanek et al., 2023), and issues such as task-contamination (Roberts et al., 2023b), underrepresentation of tasks or domains (Wu et al., 2023a; McCoy et al., 2023), and prompt-sensitivity (Anthropic, 2023d) may cause a user to misestimate the true capabilities of a model. This lack of reliability can undermine user trust or cause harm if a user bases their decision on incorrect or misleading information provided by an LLM."
5.1 Overreliance and unsafe useOverreliance
"If a user begins to excessively trust an LLM, this may cause them to develop an overreliance on the LLM. Overreliance can result in automation bias (Kupfer et al., 2023), and can cause errors of omission (user choosing not to verify the validity of a response) and errors of commission (user believing and acting on the basis of the LLM’s response, even if it contradicts their own knowledge) (Skitka et al., 1999). It can be particularly dangerous in domains where the user may lack relevant expertise to robustly scrutinize the LLM responses. This is particularly a source of risk for LLMs because LLMs can often generate plausible, yet incorrect or unfaithful, rationalizations of their actions (c.f. Section 3.4.10), which can mistakenly cause the user to develop the belief that LLM has the relevant expertise and has provided a valid response"
5.1 Overreliance and unsafe useEffects on the Workforce
"Rapid advances in LLMs pose three distinct sets of challenges for workers’ incomes (Korinek and Stiglitz, 2019; Susskind, 2023). First, they are likely to accelerate the rate of job turnover and disruption —– affecting more workers, including more highly skilled workers, and making the adjustment process for society more difficult than what we were used to from prior technological advances...Second, although technological progress means that society may produce more wealth overall, there is a risk that the general-purpose nature of LLMs may lead to progress that is biased against labor, meaning that the share of that wealth that goes to labor may decline...Third, if future LLMs and robots advance to the point where they can perform virtually all the work tasks, they would disrupt labor markets more fundamentally: if machines can do workers’ jobs, wages would fall would disrupt labor markets more fundamentally: if machines can do workers’ jobs, wages would fall to machines’ user cost (Korinek and Juelfs, 2023). This would pose fundamental challenges for labor markets and income distribution (Korinek, 2023)."
6.2 Increased inequality and decline in employment qualityEffects on Inequality
"LLMs could potentially worsen socioeconomic inequalities (Capraro et al., 2023). Effects on inequal- ity are closely linked to the effects of LLMs on workers but ultimately depend on how the fruits of technological progress are distributed...First, if the role and compensation of capital rise and the role and compensation of labor decline in an LLM-powered economy, inequality may go up because work is the main source of income for the majority of people...Second, the large fixed cost of training cutting-edge LLMs and the network effects involved imply that the market for the most advanced LLMs tends towards a natural monopoly structure in which only one or a small number of players will be successful, a phenomenon that has been termed ‘algorithmic monoculture’ in the literature (Kleinberg and Raghavan, 2021; Bommasani et al., 2022). As a result, LLM developers may amass significant market power. This might result in reduced social welfare, and lead to LLM-providers extracting monopoly rents from their customers (Kleinberg and Raghavan, 2021; Jagadeesan et al., 2023)...Third, as LLMs are becoming more powerful, who has access and who hasn’t is becoming a more and more important question. For example, automated coding tools have been shown to produce significant productivity gains, e.g. > 50% in some cases (Peng et al., 2023). Individuals who don’t have access —– whether it is for financial reasons, for reasons of education, because of corporate or governmental policies, or for geopolitical reasons — might be at a growing disadvantage"
6.2 Increased inequality and decline in employment qualityGlobal Economic Development
"Many of the themes and challenges that we discussed above come together when analyzing the socioeconomic effects on developing countries. The workforce of developing countries may suffer from a retrenchment of outsourcing as many simple cognitive tasks that used to be performed in developing countries — for example, in call centers –— can be automated with LLMs. This may adversely affect the economies of the poor countries (Georgieva, 2024)."
6.2 Increased inequality and decline in employment qualityExploiting Limited Generalization of Safety Finetuning
"Safety tuning is performed over a much narrower distribution compared to the pretraining distribution. This leaves the model vulnerable to attacks that exploit gaps in the generalization of the safety training, e.g. using encoded text (Wei et al., 2023c) or low-resource languages (Deng et al., 2023a; Yong et al., 2023) (see also Section 3.2)."
2.2 AI system security vulnerabilities and attacks“Model Psychology” Attacks
"LLMs are vulnerable to “psychological” tricks (Li et al., 2023e; Shen et al., 2023), which can be exploited by attackers. Examples include instructing the model to behave like a specific persona (Shah et al., 2023; Andreas, 2022), or employing various “social engineering” tricks crafted by humans (Wei et al., 2023c) or other LLMs (Perez et al., 2022b; Casper et al., 2023c)."
2.2 AI system security vulnerabilities and attacksAttacking LLMs via Additional Modalities a
"LLMs can now process modalities other than text, e.g. images or video frames (OpenAI, 2023c; Gemini Team, 2023). Several studies show that gradient-based attacks on multimodal models are easy and effective (Carlini et al., 2023a; Bailey et al., 2023; Qi et al., 2023b). These attacks manipulate images that are input to the model (via an appropriate encoding). GPT-4Vision (OpenAI, 2023c) is vulnerable to jailbreaks and exfiltration attacks through much simpler means as well, e.g. writing jailbreaking text in the image (Willison, 2023a; Gong et al., 2023). For indirect prompt injection, the attacker can write the text in a barely perceptible color or font, or even in a different modality such as Braille (Bagdasaryan et al., 2023)."
2.2 AI system security vulnerabilities and attacksOther risks from Anwar et al. (2024) (26)
Agentic LLMs Pose Novel Risks
7.2 AI possessing dangerous capabilitiesMulti-Agent Safety Is Not Assured by Single-Agent Safety
7.6 Multi-agent risksDual-Use Capabilities Enable Malicious Use and Misuse of LLMs
4.0 Malicious Actors & MisuseCorporate power may impeded effective governance
6.1 Power centralization and unfair distribution of benefitsJailbreaks and Prompt Injections Threaten Security of LLMs
2.2 AI system security vulnerabilities and attacksVulnerability to Poisoning and Backdoors > Natural Language Underspecifies Goals
7.1 AI pursuing its own goals in conflict with human goals or values