Statistical tests for comparing Machine Learning models

When making comparisons and conclusions regarding ML / DL models it is important to try to get as much evidence as possible and then perform statistical tests. Below is a cheatsheet I made while trying to find the appropriate test I needed to do depending on the case.

Classification and I have access to the test predictions —> Make a contingency table

My label type is:

Binary Number of classifiers to compare:
- Two -> McNemar’s test
- Three or more -> Cochran’s Q test
Multi-class —> Stuart - Maxwell test
Models based on mean performance values (e.g mean test accuracy, average rewards, success rate etc) Is your sample size large enough? Do you have insights about the data and your samples?
- Parametric methods: I can assume normal distribution of samples Number of models to compare:
  - Two
    - Student’s t-test (assuming equal variance) (there is also a version for paired samples)
    - Welch’s t-test (not assuming equal variance)
  - Three or more
    - ANOVA (F-test) or ANOVA for paired samples
- Non-Parametric: I cannot assume normal distribution of samples Number of models to compare:
  - Two
    - Mann-Whitney U test
    - Wilcoxon for paired samples
  - Three or more
    - Kruskal-Wallis
    - Friedman for paired samples

Sources:

Kostis-S-Z.github.io

Projects | Blog | About

Statistical tests for comparing Machine Learning models