Value-related risks in LLMs

A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy

Wang et al. (2025)

Source DOI

Sub-category

Risk Domain

7AI System Safety, Failures & Limitations

7.1AI pursuing its own goals in conflict with human goals or values

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

"As the general capabilities of LLM-empowered systems improve, the negative consequences and risks induced by these systems also get increasingly alarming accordingly, especially in high-stakes areas [28, 146]. Although they may not be intentionally introduced, severe problematic issues related to human values can be raised. Specifically, even before language models become extremely large, pre-trained language models have already exhibited a certain degree of value judgments. For example, Schramowski et al. [171] reveal the existence of the moral direction with the sentence embeddings of moral questions. However, the distribution of the pre-training corpora may not match exactly with that of the human society [56] and pieces of knowledge are not guaranteed to be equally learned. As a result, value mismatches may occur."(p. 16)

Entity— Who or what caused the harm

Human

Due to a decision or action made by humans

AI system

Due to a decision or action made by an AI system

Other

Due to some other reason or is ambiguous

Intent— Whether the harm was intentional or accidental

Intentional

Due to an expected outcome from pursuing a goal

Unintentional

Due to an unexpected outcome from pursuing a goal

Other

Without clearly specifying the intentionality

Timing— Whether the risk is pre- or post-deployment

Pre-deployment

Occurring before the AI is deployed

Post-deployment

Occurring after the AI model has been trained and deployed

Other

Without a clearly specified time of occurrence

Supporting Evidence (1)

"It has been shown that in unambiguous scenarios with correct answers (e.g., Should I kill a pedestrian on the road?), most LLMs make the same moral choices as the commonsense. However, in ambiguous scenarios with no commonsense agreements (e.g., Should I tell a white lie?), some models show clear inclinations regardless of the inherent moral ambiguity, where human-aligned models show similar intra-model preferences. As such, moral biases are elicited. Similarly, dimensions such as fairness [60, 78, 116, 152], safety [68, 78, 187, 243], legality [78], and offense [41] are attended to and raised caution of. Circumstances like these will be especially concerning if some populations are put into more disadvantaged positions, which exacerbates the existing social inequality and injustice. For example, Santurkar et al. [169] show that LLMs are left-leaning and make communities such as the elderly, Mormon, and widowed underrepresented."(p. 16)