In this project, I will illustrate the K-means clustering algorithm to produce an interesting visual. Data is sourced from the mdsr library.
Begin by loading the data
data("world_cities")
str(world_cities)
## tibble [4,428 × 9] (S3: tbl_df/tbl/data.frame)
## $ geoname_id : num [1:4428] 291074 292223 292672 292932 292968 ...
## $ name : chr [1:4428] "Ras Al Khaimah City" "Dubai" "Sharjah" "Ajman City" ...
## $ latitude : num [1:4428] 25.8 25.1 25.3 25.4 24.5 ...
## $ longitude : num [1:4428] 55.9 55.3 55.4 55.5 54.4 ...
## $ country : chr [1:4428] "AE" "AE" "AE" "AE" ...
## $ country_region : chr [1:4428] "05" "03" "06" "02" ...
## $ population : num [1:4428] 351943 2956587 1324473 490035 603492 ...
## $ timezone : chr [1:4428] "Asia/Dubai" "Asia/Dubai" "Asia/Dubai" "Asia/Dubai" ...
## $ modification_date: Date[1:4428], format: "2019-09-09" "2019-08-28" ...
## - attr(*, "problems")= tibble [1,001 × 5] (S3: tbl_df/tbl/data.frame)
## ..$ row : int [1:1001] 1154 2694 2742 7567 7567 7608 7608 7650 7650 7652 ...
## ..$ col : chr [1:1001] "admin3_code" "admin4_code" "admin4_code" "admin3_code" ...
## ..$ expected: chr [1:1001] "a double" "a double" "a double" "no trailing characters" ...
## ..$ actual : chr [1:1001] "BLG33-34" "CN" "LN" "B2" ...
## ..$ file : chr [1:1001] "'/tmp/RtmpHweL4x/cities15000/cities15000.txt'" "'/tmp/RtmpHweL4x/cities15000/cities15000.txt'" "'/tmp/RtmpHweL4x/cities15000/cities15000.txt'" "'/tmp/RtmpHweL4x/cities15000/cities15000.txt'" ...
Let’s look at cities with a population of \(200,000\) or more.
min_pop <- 200000
Here let’s use dplyr to filter out cities with population less than \(200,000\) and then keep only the geographic coordinates in degrees.
big_cities <- world_cities %>%
filter(population >= min_pop) %>%
select(longitude, latitude)
What does this look like on the map?
plot(big_cities, col = 'black', pch = 16, asp = 1,
main = "Cities with population over 200,000",
cex = 0.5, ylim=c(-90, 90), xlim=c(-180,180), add=TRUE)
# Add the map overlay
map("world", add=TRUE, col="gray", lwd=0.5)
Here are some plots of the clustering results.
# Setup plotting
par(mfrow=c(1,2))
# Define cities with lat/lon
city_names <- c("Berlin", "Sydney", "Dallas", "Cape Town", "Beijing", "São Paulo")
latitudes <- c(52.52, -33.87, 32.78, -33.92, 39.90, -23.55) # in degrees
longitudes <- c(13.40, 151.21, -96.80, 18.42, 116.41, -46.63) # in degrees
# Repeat with k = 2, ..., 7
for(k in 2:7){
# clustering
km.out <- kmeans(big_cities, k, nstart=50)
# viz
plot(big_cities, col = km.out$cluster, pch = 16, asp = 1,
main = paste("K-Means, K=",k), cex = 0.5, ylim=c(-90, 90), xlim=c(-180,180))
points(km.out$centers, col = 1:k, pch = 3, cex = 2)
# Add points to the existing plot
points(longitudes, latitudes, pch = 19, col = "purple")
# Add labels
text(longitudes, latitudes, labels = city_names, pos = 3, cex = 0.5, col = "purple")
# Add the map overlay
map("world", add=TRUE, col="gray", lwd=0.5)
}
Now one could apply the “elbow method” but here we already know that there are 7 continents (6 if we exclude Antarctica which has no big cities).
So really the final two plots are the ones to pay attention to. That is, the plots with \(k = 6\) and \(k = 7\).
Below, I repeat the code with the minimum population at \(500,000\) and \(1,000,000\) respectively.