In this notebook I explore the usefulness of data visualization and the fact that simple statistics can be misleading and are not always capable of capturing reality. To do this, I examine the famous Anscombe’s Quartet, a set of 4 data sets that will be analyzed in the following R Markdown file.
Load the data:
data(anscombe)
head(anscombe)
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
So this is a set of 4 data sets each consisting of a predictor variable, \(x_i\) and a response variable, \(y_i\).
Let us begin by splitting the data into 4 explicit data sets.
dataset.1 <- anscombe[,c("x1", "y1")]
dataset.2 <- anscombe[,c("x2", "y2")]
dataset.3 <- anscombe[,c("x3", "y3")]
dataset.4 <- anscombe[,c("x4", "y4")]
Now let’s do some exploratory data analysis by looking at some basic descriptive statistics for each data set including \(\bar{x}\), \(\bar{y}\), \(s_x^2\), \(s_y^2\), \(r_{xy}\), the mean of x, mean of y, variance of x, variance of y, and Pearson’s correlation between x and y respectively.
mean(dataset.1$x1)
## [1] 9
mean(dataset.2$x2)
## [1] 9
mean(dataset.3$x3)
## [1] 9
mean(dataset.4$x4)
## [1] 9
All \(\bar{x}_i\) are exactly equal.
mean(dataset.1$y1)
## [1] 7.500909
mean(dataset.2$y2)
## [1] 7.500909
mean(dataset.3$y3)
## [1] 7.5
mean(dataset.4$y4)
## [1] 7.500909
All \(\bar{y}_i\) are equal up to the third digit after the decimal point.
var(dataset.1$x1)
## [1] 11
var(dataset.2$x2)
## [1] 11
var(dataset.3$x3)
## [1] 11
var(dataset.4$x4)
## [1] 11
All \(s^2_{x_i}\) are exactly equal.
var(dataset.1$y1)
## [1] 4.127269
var(dataset.2$y2)
## [1] 4.127629
var(dataset.3$y3)
## [1] 4.12262
var(dataset.4$y4)
## [1] 4.123249
All \(s^2_{y_i}\) are equal up to the second digit after the decimal point.
cor(dataset.1$x1, dataset.1$y1)
## [1] 0.8164205
cor(dataset.2$x2, dataset.2$y2)
## [1] 0.8162365
cor(dataset.3$x3, dataset.3$y3)
## [1] 0.8162867
cor(dataset.4$x4, dataset.4$y4)
## [1] 0.8165214
All \(r_{x_iy_i}\) are equal up to the third digit after the decimal point.
So it seems that these data sets should be pretty similar, right? Now that we have done some exploratory analysis, let’s go ahead and fit predictive models to each of the data sets and examine the \(\hat{\beta}\) coefficients and \(R^2\).
lm1 <- lm(y1 ~ x1, data=dataset.1)
lm2 <- lm(y2 ~ x2, data=dataset.2)
lm3 <- lm(y3 ~ x3, data=dataset.3)
lm4 <- lm(y4 ~ x4, data=dataset.4)
coef(lm1)
## (Intercept) x1
## 3.0000909 0.5000909
coef(lm2)
## (Intercept) x2
## 3.000909 0.500000
coef(lm3)
## (Intercept) x3
## 3.0024545 0.4997273
coef(lm4)
## (Intercept) x4
## 3.0017273 0.4999091
Once again approximately equal in each data set.
summary(lm1)$r.squared
## [1] 0.6665425
summary(lm2)$r.squared
## [1] 0.666242
summary(lm3)$r.squared
## [1] 0.666324
summary(lm4)$r.squared
## [1] 0.6667073
And approximately equal once more.
Ok it seems that each regression line is fitting some pattern from each data set to approximately the same degree. Let’s some non-extrapolative predictions, perhaps \(\hat{y}\) for \(\bar{x}_1 + s_{x_1}\) since the means and variances are approximately equal.
input_value <- mean(dataset.1$x1) + sd(dataset.1$x1)
pred.lm1 <- predict(lm1, newdata=data.frame(x1 = input_value))
pred.lm2 <- predict(lm2, newdata=data.frame(x2 = input_value))
pred.lm3 <- predict(lm3, newdata=data.frame(x3 = input_value))
pred.lm4 <- predict(lm4, newdata=data.frame(x4 = input_value))
input_value
## [1] 12.31662
pred.lm1
## 1
## 9.159523
pred.lm2
## 1
## 9.159221
pred.lm3
## 1
## 9.157408
pred.lm4
## 1
## 9.15892
Let’s see what this all looks like, lines, data sets, predicted points.
par(mfrow = c(2, 2), # 2x2 layout
mar = c(4, 4, 2, 1), # margins: bottom, left, top, right
oma = c(1, 1, 2, 1)) # outer margins
xlim <- c(2, 20)
ylim <- c(2, 14)
plot_with_lm <- function(x, y, model, pred_y, dataset_number) {
plot(x, y, xlim = xlim, ylim = ylim, main = paste("Data Set ", dataset_number),
xlab = paste("x", dataset_number), ylab = paste("y", dataset_number), pch = 16, col = "gray30")
abline(model, col = "steelblue", lwd = 2)
points(input_value, pred_y, col = "red", pch = 19, cex = 1.5)
legend("topleft", legend = c("Data", "Regression Line", "Prediction"),
col = c("gray30", "steelblue", "red"), pch = c(16, NA, 19),
lty = c(NA, 1, NA), pt.cex = c(1, NA, 1.5), bty = "n", cex = 0.8)
}
plot_with_lm(dataset.1$x1, dataset.1$y1, lm1, pred.lm1, "1")
plot_with_lm(dataset.2$x2, dataset.2$y2, lm2, pred.lm2, "2")
plot_with_lm(dataset.3$x3, dataset.3$y3, lm3, pred.lm3, "3")
plot_with_lm(dataset.4$x4, dataset.4$y4, lm4, pred.lm4, "4")
mtext("Anscombe's Quartet with Regression Lines & Predictions", outer = TRUE, cex = 1.5)
These are obviously very different data sets when visualized and some of the predictions don’t make much sense (i.e. the one in data set 4). I think a better analysis and prediction model could be done and created if a bit of effort went into the visualization up front. Some changes that could be made to get better models for each data set could be:
Data Set 2: Fit a polynomial model of the form \(y = \beta_0 + \beta_1x + \beta_2x^2\).
Data Set 3: Address the outlier point (\(y = 12.74\)) and refit the regression line.
Data Set 4: Address the outlier point (\(x = 19\)) and note that all of the remaining x values are identical.
If you made it to the end, thanks for reading. I hope this was interesting and ideally useful in thinking about data, statistics, predictive modelling, and data visualization. If I had to name one takeaway from this notebook, albeit perhaps obvious I would say, don’t neglect data visualization.