Overfit Kaggle Models Discouraged Data Science Competitors

May 1, 20171 reportToolMedium confidence

A Kaggle competition participant discovered that a pre-trained VGG network achieved unexpectedly high 99% accuracy on fisheries monitoring training data despite being trained on dissimilar ImageNet data, but performed much worse on the actual test set.

In the Nature Conservancy Fisheries Monitoring Kaggle competition, participants were tasked with classifying fish species from images taken by automatic cameras on ships to ensure only legitimate fish like tuna were caught, not protected species like sharks. The competition involved challenging real-world images of fish randomly positioned on boats, with some images containing multiple fish or no fish at all. One participant discovered that applying a pre-trained VGG network (originally trained on ImageNet data) to the training set achieved around 99% accuracy, which was surprising given that ImageNet images are quite different from the boat-based fish images. However, when these same results were submitted to Kaggle for evaluation on the test set, the performance was much lower than the 99% training accuracy, revealing a significant discrepancy between training and test performance. The competition used a two-stage format where 1,000 test images were initially available, with an additional 12,000 images released one week before the deadline. Only around 300 of the original 2,300 participants made submissions in the second stage.

Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.

Risk Domain

7AI System Safety, Failures & Limitations

7.3Lack of capability or robustness

AI systems that fail to perform reliably or effectively under varying conditions, exposing them to errors and failures that can have significant consequences, especially in critical applications or areas that require moral reasoning.

Causal Classification

Entity

AI system

Due to a decision or action made by an AI system

Intent

Unintentional

Due to an unexpected outcome from pursuing a goal

Timing

Post-deployment

Occurring after the AI model has been trained and deployed

Harm Severity Assessment

Highest Score:1: Negligible

National Security Assessment

Overall Score

Stakeholders

: Individual Kaggle Competitors
: Individual Kaggle Competitors
: Individual Kaggle Competitors

AI System Classification

: Image Classification
: Tool
: 4 Minimal or No Risk
: 1

Population Impact

: 2,300

External Links

View on AI Incident Database