NY City School Teacher Evaluation Algorithm Contested

Feb 25, 20127 reportsSeverity: MinorToolHigh confidence

New York City's teacher evaluation system used value-added models and group measures to rate teachers based on standardized test scores, resulting in inconsistent and potentially unfair evaluations that affected thousands of educators.

New York City implemented a teacher evaluation system that relied heavily on value-added measurement (VAM) and standardized test scores to rate teachers from 2012 onwards. The system evaluated over 12,000 teachers who taught fourth through eighth grade English or math between 2007 and 2010. In the 2015-16 school year, 53 percent of NYC teachers were evaluated using group measures, meaning they were judged by test scores from subjects or students they did not teach. The VAM system used complex computer models to predict student performance and rate teachers based on how their students performed compared to these predictions. However, the system produced highly inconsistent results, with teachers receiving vastly different ratings for teaching the same students in the same year - for example, one teacher scored 97 out of 100 in language arts but only 2 out of 100 in math with identical students. The average confidence interval for these estimates was 35 percentile points in math and 53 in English Language Arts, indicating substantial measurement error. Teachers like Sheri Lederman, who had consistently high student performance, were rated as 'ineffective' despite their students scoring above state averages. The system led to hundreds of teachers in districts like Syracuse and Rochester planning appeals of their evaluations, with union leaders reporting that 40 percent of teachers in these districts received the two lowest ratings.

Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.

Risk Domain

7AI System Safety, Failures & Limitations

7.3Lack of capability or robustness

AI systems that fail to perform reliably or effectively under varying conditions, exposing them to errors and failures that can have significant consequences, especially in critical applications or areas that require moral reasoning.

Causal Classification

Entity

AI system

Due to a decision or action made by an AI system

Intent

Intentional

Due to an expected outcome from pursuing a goal

Timing

Post-deployment

Occurring after the AI model has been trained and deployed

Harm Severity Assessment

Highest Score:2: Minor(Differential Treatment, direct)

National Security Assessment

Overall Score

Stakeholders

: New York City Dept. Of Education
: New York City Dept. Of Education
: Teachers

AI System Classification

: Workforce Monitoring and Evaluation
: Automatic Skill Assessment
: Tool
: 2 High Risk
: 1

Population Impact

: 12,000
: 12,000

External Links

View on AI Incident Database