Skip to main content
Home/Risks/Gipiškis2024/Encoded reasoning

Encoded reasoning

Sub-category
Risk Domain

AI systems that develop, access, or are provided with capabilities that increase their potential to cause mass harm through deception, weapons development and acquisition, persuasion and manipulation, political strategy, cyber-offense, AI development, situational awareness, and self-proliferation. These capabilities may cause mass harm due to malicious human actors, misaligned AI systems, or failure in the AI system.

"Models can employ steganography techniques to encode their intermediate rea- soning steps in ways that are not interpretable by humans [166]. Since en- coded reasoning can improve model performance, this tendency might naturally emerge and become more pronounced with more capable models."(p. 25)

Part of Model Evaluations (Interpretability/Explainability)

Other risks from Gipiškis2024 (144)