written by Eric J. Ma on 2019-03-24 | tags: data science machine learning
Variance explained, as a regression quality metric, is one that I have begun to like a lot, especially when used in place of a metric like the correlation coefficient (r2).
Here's variance explained defined:
Why do I like it? It’s because this metric gives us a measure of the scale of the error in predictions relative to the scale of the data.
The numerator in the fraction calculates the variance in the errors, in other words, the scale of the errors. The denominator in the fraction calculates the variance in the data, in other words, the scale of the data. By subtracting the fraction from 1, we get a number upper-bounded at 1 (best case), and unbounded towards negative infinity.
Here's a few interesting scenarios.
A thing that is really nice about variance explained is that it can be used to compare related machine learning tasks that have different unit scales, for which we want to compare how good one model performs across all of the tasks. Mean squared error makes this an apples-to-oranges comparison, because the unit scales of each machine learning task is different. On the other hand, variance explained is unit-less.
Now, we know that single metrics can have failure points, as does the coefficient of correlation $r^2$, as shown in Ansecombe's quartet and the Datasaurus Dozen:
Fig. 1: Ansecombe's quartet, taken from Wikipedia
Fig. 2: Datasaurus Dozen, taken from Revolution Analytics
One place where the variance explained can fail is if the predictions are systematically shifted off from the true values. Let's say prediction was shifted off by 2 units.
There's no variance in errors, even though they are systematically shifted off from the true prediction. Like $r^2$, variance explained will fail here.
As usual, Ansecombe's quartet, as does The Datasaurus Dozen, gives us a pertinent reminder that visually inspecting your model predictions is always a good thing!
h/t to my colleague, Clayton Springer, for sharing this with me.
@article{
ericmjl-2019-variance-explained,
author = {Eric J. Ma},
title = {Variance Explained},
year = {2019},
month = {03},
day = {24},
howpublished = {\url{https://ericmjl.github.io}},
journal = {Eric J. Ma's Blog},
url = {https://ericmjl.github.io/blog/2019/3/24/variance-explained},
}
I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.
If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!
Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!