AI that exposes users to harmful, abusive, unsafe or inappropriate content. May involve providing advice or encouraging action. Examples of toxic content include hate speech, violence, extremism, illegal acts, or child sexual abuse material, as well as content that violates community norms such as profanity, inflammatory political speech, or pornography.
"Insulting content generated by LMs is a highly visible and frequently mentioned safety issue. Mostly, it is unfriendly, disrespectful, or ridiculous content that makes users uncomfortable and drives them away. It is extremely hazardous and could have negative social consequences."(p. 3)
Other risks from Sun et al. (2023) (14)
Instruction Attacks
2.2 AI system security vulnerabilities and attacksInstruction Attacks > Goal Hijacking
2.2 AI system security vulnerabilities and attacksInstruction Attacks > Prompt Leaking
2.1 Compromise of privacy by leaking or correctly inferring sensitive informationInstruction Attacks > Role Play Instruction
2.2 AI system security vulnerabilities and attacksInstruction Attacks > Unsafe Instruction Topic
2.2 AI system security vulnerabilities and attacksInstruction Attacks > Inquiry with Unsafe Opinion
2.2 AI system security vulnerabilities and attacks