Skip to main content

Application

The Risks of Machine Learning Systems

Tan, Taeihagh & Baxter (2022)

Sub-category

"This is the risk posed by the intended application or use case. It is intuitive that some use cases will be inherently "riskier" than others (e.g., an autonomous weapons system vs. a customer service chatbot)."(p. 5)

Supporting Evidence (10)

1.
"Application domain: As alluded to above, the intended purpose of the ML system can be a major risk factor, holding all other variables constant. Other than the specific use case, the domain could also contribute to the application risk. For example, it is intuitive that the negative consequences are more severe for an image classification system used to aid melanoma diagnoses than one used for Lego brick identification."(p. 5)
2.
"Consequentiality of system actions: The impact of the ML system’s actions on the affected community members is another important factor in the system’s application. For example, a slightly inaccurate automated text scoring system carries relatively minor consequences if used only for providing feedback on ungraded homework, compared to being used for grading school assignments. While inaccuracies in the latter use case may affect a student’s annual ranking, it carries a lower risk compared to using the same system to grade national exams that determine a student’s future, where even minor inaccuracies can unfairly impact their ability to enter their desired university or major [65]."(p. 5)
3.
"Protected populations impacted: Most societies have special protections for certain population groups such as children, the elderly, disabled, or ethnic minorities. For example, in the US, the Child Online Privacy Protection Act imposes stricter requirements on operators of websites or online services directed to children under 13 years of age [66]. Similarly, some social groups may be more vulnerable to the negative impacts of an ML system and lower thresholds for harm may therefore be necessary for them. The US Federal Trade Commission has warned of penalties against companies that sell or use biased AI systems that harm protected groups [67]."(p. 5)
4.
"Effect on existing power differentials and inequalities: Use cases that entrench or amplify power differentials between the organization employing the system and the affected population should be assigned a higher risk from a human rights perspective. This can take the form of increased surveillance, which increases the organization’s power over the public but not vice-versa. Other applications may amplify systemic inequalities due to the ease, scale, and speed with which predictions can now be made [62]. Additionally, the act of codifying it in a potentially black-boxed ML system may entrench these learned biases when humans fail to question their predictions [120]."(p. 5)
5.
"Scope of deployment environment: A system operating in an open environment, such as the outdoors, will often have to account for more uncertainties than in a closed one, such as an apartment. Consequently, there is a higher likelihood of failure in the former. For example, an autonomous cleaning robot deployed in a park will be exposed to a significantly more diverse range of inputs than one used in an apartment. In the latter, the system does not need to handle significant changes in weather conditions and seasons. Additionally, the ability to navigate uneven and unstable terrain will likely be less critical for an indoor cleaning robot compared to one deployed in a park. We refer to this “openness” as the deployment environment’s scope: a wider scope presents more potential points of failure and, therefore, a higher risk."(p. 5)
6.
"Scale of deployment: The scale of a use case will also significantly affect its risk. For example, a system that affects a community of 42 will likely have a lower upper bound of negative consequences compared to being deployed worldwide."(p. 5)
7.
"Presence of relevant evaluation techniques/metrics: Although held-out accuracy is commonly used to evaluate ML models developed for research, this assumes that the training distribution and the deployment environment’s distribution are identical. Such evaluation will be insufficient for ML systems meant to be used in the real world since this assumption is often violated. The result is poor system robustness to distributional variation with various second-order consequences (see Section 4.5). Therefore, any evaluation of an application’s risk must consider the availability of metrics to evaluate performance on the dimensions relevant to the application or deployment environment [181]. For example, a task-oriented chatbot should not only be evaluated using the success rate of the held-out validation set, but also its ability to cope with misspellings, grammatical variation, and different dialects, and generate sentences in the appropriate register. The lack of appropriate metrics reduces the ability to detect such flaws before deployment and increases the risk of negative consequences. Similarly, it is difficult to predict the impact of a risk on the real world. For example, group-level F1 scores for a face recognition system are not indicative of the magnitude of the system’s impact on an individual when it is wrong in the real world (e.g., the consequences of arresting a wrongly identified but innocent minority [3])."(p. 5)
8.
"Optionality of interaction: The ability to opt-out of interacting with or being affected by an ML system can limit its negative impacts on a person. For example, choosing to interact with a human customer service agent rather than a chatbot may reduce the risk of being misunderstood if the chatbot has not been specifically trained on the customer’s language variety. Inversely, being unable to opt-out of the interaction may increase the likelihood and frequency that an individual will experience negative consequences resulting from the ML system. For example, replacing human agents with automated ones as interfaces to essential services may unintentionally prevent the underprivileged from using them due to linguistic barriers. This is a real possibility when the agents have trained on the prestige variety of a language, but the people most in need of access to social welfare services only speak a colloquial variety."(p. 6)
9.
"Accountability mechanisms: From an organizational perspective, mechanisms that hold the actors accountable for the systems they build reduce the likelihood of negative consequences. For example, an organization might create explicit acceptability criteria, such as comparable accuracy across social groups, reward engineers for meeting these criteria, and block deployment when the system falls short. However, this will only work when acceptance criteria are not in conflict (e.g., engineers being rewarded more for increased user engagement than meeting an acceptable bias threshold)."(p. 6)
10.
"Stakeholders’ machine learning literacy: To give useful feedback and seek remediation, the affected community member might require basic knowledge of how ML systems work and the ways they could be impacted. For example, someone unaware of how recommendation algorithms work (or even the existence of such algorithms) may be unable to appreciate the extent to which their political views are influenced by their consumption of social media and video streaming sites [10, 15, 79, 151].2 The affected individual will hence be unaware that they are in an echo chamber, resulting in an inability to break free or give appropriate feedback to the product developers [96]. Research has also shown a person’s knowledge of AI to affect their interpretation of machine-generated explanations [59]."(p. 6)

Part of First-Order Risks

Other risks from Tan, Taeihagh & Baxter (2022) (17)