The Importance of Data Visualization

In this notebook I explore the usefulness of data visualization and the fact that simple statistics can be misleading and are not always capable of capturing reality. To do this, I examine the famous Anscombe’s Quartet, a set of 4 data sets that will be analyzed in the following R Markdown file.

Load the data:

data(anscombe)

head(anscombe)

##   x1 x2 x3 x4   y1   y2    y3   y4
## 1 10 10 10  8 8.04 9.14  7.46 6.58
## 2  8  8  8  8 6.95 8.14  6.77 5.76
## 3 13 13 13  8 7.58 8.74 12.74 7.71
## 4  9  9  9  8 8.81 8.77  7.11 8.84
## 5 11 11 11  8 8.33 9.26  7.81 8.47
## 6 14 14 14  8 9.96 8.10  8.84 7.04

So this is a set of 4 data sets each consisting of a predictor variable, \(x_i\) and a response variable, \(y_i\).

Let us begin by splitting the data into 4 explicit data sets.

dataset.1 <- anscombe[,c("x1", "y1")]
dataset.2 <- anscombe[,c("x2", "y2")]
dataset.3 <- anscombe[,c("x3", "y3")]
dataset.4 <- anscombe[,c("x4", "y4")]

Now let’s do some exploratory data analysis by looking at some basic descriptive statistics for each data set including \(\bar{x}\), \(\bar{y}\), \(s_x^2\), \(s_y^2\), \(r_{xy}\), the mean of x, mean of y, variance of x, variance of y, and Pearson’s correlation between x and y respectively.

\(\bar{x}\)

mean(dataset.1$x1)

## [1] 9

mean(dataset.2$x2)

## [1] 9

mean(dataset.3$x3)

## [1] 9

mean(dataset.4$x4)

## [1] 9

All \(\bar{x}_i\) are exactly equal.

\(\bar{y}\)

mean(dataset.1$y1)

## [1] 7.500909

mean(dataset.2$y2)

## [1] 7.500909

mean(dataset.3$y3)

## [1] 7.5

mean(dataset.4$y4)

## [1] 7.500909

All \(\bar{y}_i\) are equal up to the third digit after the decimal point.

\(s^2_x\)

var(dataset.1$x1)

## [1] 11

var(dataset.2$x2)

## [1] 11

var(dataset.3$x3)

## [1] 11

var(dataset.4$x4)

## [1] 11

All \(s^2_{x_i}\) are exactly equal.

\(s^2_y\)

var(dataset.1$y1)

## [1] 4.127269

var(dataset.2$y2)

## [1] 4.127629

var(dataset.3$y3)

## [1] 4.12262

var(dataset.4$y4)

## [1] 4.123249

All \(s^2_{y_i}\) are equal up to the second digit after the decimal point.

\(r_{xy}\)

cor(dataset.1$x1, dataset.1$y1)

## [1] 0.8164205

cor(dataset.2$x2, dataset.2$y2)

## [1] 0.8162365

cor(dataset.3$x3, dataset.3$y3)

## [1] 0.8162867

cor(dataset.4$x4, dataset.4$y4)

## [1] 0.8165214

All \(r_{x_iy_i}\) are equal up to the third digit after the decimal point.

So it seems that these data sets should be pretty similar, right? Now that we have done some exploratory analysis, let’s go ahead and fit predictive models to each of the data sets and examine the \(\hat{\beta}\) coefficients and \(R^2\).

lm1 <- lm(y1 ~ x1, data=dataset.1)
lm2 <- lm(y2 ~ x2, data=dataset.2)
lm3 <- lm(y3 ~ x3, data=dataset.3)
lm4 <- lm(y4 ~ x4, data=dataset.4)

\(\beta\) Coeffcients

coef(lm1)

## (Intercept)          x1 
##   3.0000909   0.5000909

coef(lm2)

## (Intercept)          x2 
##    3.000909    0.500000

coef(lm3)

## (Intercept)          x3 
##   3.0024545   0.4997273

coef(lm4)

## (Intercept)          x4 
##   3.0017273   0.4999091

Once again approximately equal in each data set.

\(R^2\)

summary(lm1)$r.squared

## [1] 0.6665425

summary(lm2)$r.squared

## [1] 0.666242

summary(lm3)$r.squared

## [1] 0.666324

summary(lm4)$r.squared

## [1] 0.6667073

And approximately equal once more.

Ok it seems that each regression line is fitting some pattern from each data set to approximately the same degree. Let’s some non-extrapolative predictions, perhaps \(\hat{y}\) for \(\bar{x}_1 + s_{x_1}\) since the means and variances are approximately equal.

input_value <- mean(dataset.1$x1) + sd(dataset.1$x1)

pred.lm1 <- predict(lm1, newdata=data.frame(x1 = input_value))
pred.lm2 <- predict(lm2, newdata=data.frame(x2 = input_value))
pred.lm3 <- predict(lm3, newdata=data.frame(x3 = input_value))
pred.lm4 <- predict(lm4, newdata=data.frame(x4 = input_value))

input_value

## [1] 12.31662

pred.lm1

##        1 
## 9.159523

pred.lm2

##        1 
## 9.159221

pred.lm3

##        1 
## 9.157408

pred.lm4

##       1 
## 9.15892

Let’s see what this all looks like, lines, data sets, predicted points.

par(mfrow = c(2, 2),    # 2x2 layout
    mar = c(4, 4, 2, 1), # margins: bottom, left, top, right
    oma = c(1, 1, 2, 1)) # outer margins

xlim <- c(2, 20)
ylim <- c(2, 14)

plot_with_lm <- function(x, y, model, pred_y, dataset_number) {
  plot(x, y, xlim = xlim, ylim = ylim, main = paste("Data Set ", dataset_number),
       xlab = paste("x", dataset_number), ylab = paste("y", dataset_number), pch = 16, col = "gray30")
  abline(model, col = "steelblue", lwd = 2)
  points(input_value, pred_y, col = "red", pch = 19, cex = 1.5)
  legend("topleft", legend = c("Data", "Regression Line", "Prediction"),
         col = c("gray30", "steelblue", "red"), pch = c(16, NA, 19),
         lty = c(NA, 1, NA), pt.cex = c(1, NA, 1.5), bty = "n", cex = 0.8)
}

plot_with_lm(dataset.1$x1, dataset.1$y1, lm1, pred.lm1, "1")
plot_with_lm(dataset.2$x2, dataset.2$y2, lm2, pred.lm2, "2")
plot_with_lm(dataset.3$x3, dataset.3$y3, lm3, pred.lm3, "3")
plot_with_lm(dataset.4$x4, dataset.4$y4, lm4, pred.lm4, "4")

mtext("Anscombe's Quartet with Regression Lines & Predictions", outer = TRUE, cex = 1.5)

These are obviously very different data sets when visualized and some of the predictions don’t make much sense (i.e. the one in data set 4). I think a better analysis and prediction model could be done and created if a bit of effort went into the visualization up front. Some changes that could be made to get better models for each data set could be:

Data Set 2: Fit a polynomial model of the form \(y = \beta_0 + \beta_1x + \beta_2x^2\).
Data Set 3: Address the outlier point (\(y = 12.74\)) and refit the regression line.
Data Set 4: Address the outlier point (\(x = 19\)) and note that all of the remaining x values are identical.

If you made it to the end, thanks for reading. I hope this was interesting and ideally useful in thinking about data, statistics, predictive modelling, and data visualization. If I had to name one takeaway from this notebook, albeit perhaps obvious I would say, don’t neglect data visualization.

The Importance of Data Visualization

Nick Chandler

2025-06-03

The Importance of Data Visualization

\(\bar{x}\)

\(\bar{y}\)

\(s^2_x\)

\(s^2_y\)

\(r_{xy}\)

\(\beta\) Coeffcients

\(R^2\)