K-Means Clustering of Geographic Data

In this project, I will illustrate the K-means clustering algorithm to produce an interesting visual. Data is sourced from the mdsr library.

Begin by loading the data

data("world_cities")
str(world_cities)
## tibble [4,428 × 9] (S3: tbl_df/tbl/data.frame)
##  $ geoname_id       : num [1:4428] 291074 292223 292672 292932 292968 ...
##  $ name             : chr [1:4428] "Ras Al Khaimah City" "Dubai" "Sharjah" "Ajman City" ...
##  $ latitude         : num [1:4428] 25.8 25.1 25.3 25.4 24.5 ...
##  $ longitude        : num [1:4428] 55.9 55.3 55.4 55.5 54.4 ...
##  $ country          : chr [1:4428] "AE" "AE" "AE" "AE" ...
##  $ country_region   : chr [1:4428] "05" "03" "06" "02" ...
##  $ population       : num [1:4428] 351943 2956587 1324473 490035 603492 ...
##  $ timezone         : chr [1:4428] "Asia/Dubai" "Asia/Dubai" "Asia/Dubai" "Asia/Dubai" ...
##  $ modification_date: Date[1:4428], format: "2019-09-09" "2019-08-28" ...
##  - attr(*, "problems")= tibble [1,001 × 5] (S3: tbl_df/tbl/data.frame)
##   ..$ row     : int [1:1001] 1154 2694 2742 7567 7567 7608 7608 7650 7650 7652 ...
##   ..$ col     : chr [1:1001] "admin3_code" "admin4_code" "admin4_code" "admin3_code" ...
##   ..$ expected: chr [1:1001] "a double" "a double" "a double" "no trailing characters" ...
##   ..$ actual  : chr [1:1001] "BLG33-34" "CN" "LN" "B2" ...
##   ..$ file    : chr [1:1001] "'/tmp/RtmpHweL4x/cities15000/cities15000.txt'" "'/tmp/RtmpHweL4x/cities15000/cities15000.txt'" "'/tmp/RtmpHweL4x/cities15000/cities15000.txt'" "'/tmp/RtmpHweL4x/cities15000/cities15000.txt'" ...

Let’s look at cities with a population of \(200,000\) or more.

min_pop <- 200000

Here let’s use dplyr to filter out cities with population less than \(200,000\) and then keep only the geographic coordinates in degrees.

big_cities <- world_cities %>%
    filter(population >= min_pop) %>%
    select(longitude, latitude)

What does this look like on the map?

plot(big_cities, col = 'black', pch = 16, asp = 1,
     main = "Cities with population over 200,000",
     cex = 0.5, ylim=c(-90, 90), xlim=c(-180,180), add=TRUE)
  
# Add the map overlay
map("world", add=TRUE, col="gray", lwd=0.5)

Here are some plots of the clustering results.

# Setup plotting
par(mfrow=c(1,2))

# Define cities with lat/lon
city_names <- c("Berlin", "Sydney", "Dallas", "Cape Town", "Beijing", "São Paulo")
latitudes <- c(52.52, -33.87, 32.78, -33.92, 39.90, -23.55)  # in degrees
longitudes <- c(13.40, 151.21, -96.80, 18.42, 116.41, -46.63) # in degrees

# Repeat with k = 2, ..., 7
for(k in 2:7){
  # clustering
  km.out <- kmeans(big_cities, k, nstart=50)
  
  # viz
  plot(big_cities, col = km.out$cluster, pch = 16, asp = 1,
       main = paste("K-Means, K=",k), cex = 0.5, ylim=c(-90, 90), xlim=c(-180,180))
  points(km.out$centers, col = 1:k, pch = 3, cex = 2)
  
  # Add points to the existing plot
  points(longitudes, latitudes, pch = 19, col = "purple")
  
  # Add labels
  text(longitudes, latitudes, labels = city_names, pos = 3, cex = 0.5, col = "purple")
  
  # Add the map overlay
  map("world", add=TRUE, col="gray", lwd=0.5)
}

Now one could apply the “elbow method” but here we already know that there are 7 continents (6 if we exclude Antarctica which has no big cities).

So really the final two plots are the ones to pay attention to. That is, the plots with \(k = 6\) and \(k = 7\).

Below, I repeat the code with the minimum population at \(500,000\) and \(1,000,000\) respectively.


Population 500,000


Population 1,000,000