This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Runtime behavior observation, anomaly detection, and activity logging.
Also in Non-Model
Anomaly detection in the latent space is employed to track patterns that emerge deep within the model’s architecture. By establishing a statistical baseline of what constitutes "normal" latent space activity, the framework can detect deviations that suggest potential risks, such as information leakage or bias amplification. The system continuously monitors latent space activity for anomalies, which could indicate an ongoing attack or emergent bias that was not detected during training [1].
Reasoning
Anomaly detection flags unusual latent space activity during runtime execution.
Data Layer
At the data layer, risks primarily involve biases, noisy data, and privacy violations that could influence model behavior later in the lifecycle. This layer applies formal verification techniques to ensure that the data used for training is both consistent and free from harmful biases or inaccuracies.
1.1.1 Training DataData Layer > Formal Verification Algorithm
We deploy a bias filtering algorithm that uses mathematical models to verify that the training data does not reinforce societal biases, such as gender or racial biases. This approach detects anomalies within the dataset, ensuring that harmful patterns are flagged before they can affect model outputs. This algorithm functions by analyzing patterns in the data that deviate from established baselines of fairness and accuracy.
1.1.1 Training DataData Layer > Dynamic Data Monitoring
In addition to formal verification, continuous monitoring is deployed to detect new patterns of bias that may emerge in real time. A real-time anomaly detection system monitors the data for unexpected shifts in distribution or content. If a bias or noisy data is detected, the data can be flagged, and the model training process is halted for further review.
1.2.3 Monitoring & DetectionModel Training Layer
During model training, adversarial manipulation becomes a significant threat, where attackers may attempt to perturb the training data to manipulate the model’s learning. Our framework integrates automated adversarial testing directly into the training process, enabling the system to continuously probe for weaknesses before the model is deployed.
2.2.2 Testing & EvaluationModel Training Layer > Adversarial Testing Module
This module employs real-time adversarial simulation that introduces adversarial inputs at various points during the training process to test the model’s robustness. The simulation leverages adversarial perturbations slight modifications to the input data that are designed to mislead the model. These perturbations mimic real-world attacks, allowing the framework to detect vulnerabilities before deployment. The module monitors how effectively the model resists adversarial inputs and adjusts accordingly.
2.2.2 Testing & EvaluationModel Training Layer > Metrics for Robustness
Metrics used to evaluate the effectiveness of adversarial testing include the model’s accuracy under adversarial conditions, the degree of perturbation required to mislead the model, and the model’s ability to maintain consistency in outputs across multiple adversarial inputs. These metrics help define what constitutes a "high-risk" versus "low-risk" vulnerability.
3.2.1 Benchmarks & EvaluationA Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation
Srivastava, Aviral; Panda, Sourav (2024)
As generative AI systems, including large language models (LLMs) and diffusion models, advance rapidly, their growing adoption has led to new and complex security risks often overlooked in traditional AI risk assessment frameworks. This paper introduces a novel formal framework for categorizing and mitigating these emergent security risks by integrating adaptive, real-time monitoring, and dynamic risk mitigation strategies tailored to generative models' unique vulnerabilities. We identify previously under-explored risks, including latent space exploitation, multi-modal cross-attack vectors, and feedback-loop-induced model degradation. Our framework employs a layered approach, incorporating anomaly detection, continuous red-teaming, and real-time adversarial simulation to mitigate these risks. We focus on formal verification methods to ensure model robustness and scalability in the face of evolving threats. Though theoretical, this work sets the stage for future empirical validation by establishing a detailed methodology and metrics for evaluating the performance of risk mitigation strategies in generative AI systems. This framework addresses existing gaps in AI safety, offering a comprehensive road map for future research and implementation.
Operate and Monitor
Running, maintaining, and monitoring the AI system post-deployment
Developer
Entity that creates, trains, or modifies the AI system
Measure
Quantifying, testing, and monitoring identified AI risks