Statistical tests for comparing Machine Learning models
When making comparisons and conclusions regarding ML / DL models it is important to try to get as much evidence as possible and then perform statistical tests. Below is a cheatsheet I made while trying to find the appropriate test I needed to do depending on the case.
- Classification and I have access to the test predictions —> Make a contingency table
My label type is:
- Binary
Number of classifiers to compare:
- Two -> McNemar’s test
- Three or more -> Cochran’s Q test
-
Multi-class —> Stuart - Maxwell test
- Models based on mean performance values (e.g mean test accuracy, average rewards, success rate etc)
Is your sample size large enough? Do you have insights about the data and your samples?
- Parametric methods: I can assume normal distribution of samples
Number of models to compare:
- Two
- Student’s t-test (assuming equal variance) (there is also a version for paired samples)
- Welch’s t-test (not assuming equal variance)
- Three or more
- ANOVA (F-test) or ANOVA for paired samples
- Two
- Non-Parametric: I cannot assume normal distribution of samples
Number of models to compare:
- Two
- Mann-Whitney U test
- Wilcoxon for paired samples
- Three or more
- Kruskal-Wallis
- Friedman for paired samples
- Two
- Parametric methods: I can assume normal distribution of samples
Number of models to compare:
Sources: