Jailbreak in LLM Malicious Use - White & Black Box Attacks
Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.
"In the fine-tuning and alignment phase, elaborately- designed instruction datasets can be utilized to fine-tune LLMs to drive them to perform undesirable behaviors, such as generating harmful information or content that violates ethical norms, and thus achieve a jailbreak. Based on the accessibility to the model parameters, we can categorize them into white-box and black-box attacks. For white-box attacks, we can jailbreak the model by modifying its parameter weights. In [107], Lermen et al. used LoRA to fine-tune the Llama2’s 7B, 13B, and 70B as well as Mixtral on AdvBench and RefusalBench datasets. The test results show that the fine-tuned model has significantly lower rejection rates on harmful instructions, which indicates a successful jailbreak. Other works focus on jailbreaking in black-box models. In [160], Qi et al. first constructed harmful prompt-output pairs and fine-tuned black-box models such as GPT-3.5 Turbo. The results show that they were able to successfully bypass the security of GPT-3.5 Turbo with only a small number of adversarial training examples, which suggests that even if the model has good security properties in its initial state, it may be much less secure after user-customized fine-tuning."(p. 22)
Other risks from Wang et al. (2025) (11)
Privacy - Membership Inference Attack (MIA)
2.2 AI system security vulnerabilities and attacksPrivacy - Data Extraction Attack (DEA)
2.2 AI system security vulnerabilities and attacksPrivacy - Prompt Inversion Attack (PIA)
2.2 AI system security vulnerabilities and attacksPrivacy - Attribute Inference Attack (AIA)
2.2 AI system security vulnerabilities and attacksPrivacy - Model Extraction Attack (MEA)
2.2 AI system security vulnerabilities and attacksHallucination
3.1 False or misleading information