--- title: "basic usage" author: "Aviezer Lifshitz" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{usage} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Basic usage of the package. ## Basic usage ```{r, echo = FALSE, message = FALSE} library(dplyr) library(ggplot2) library(tglkmeans) theme_set(theme_classic()) set.seed(60427) ``` First, let's create 5 clusters normally distributed around 1 to 5, with sd of 0.3: ```{r} data <- simulate_data(n = 100, sd = 0.3, nclust = 5, dims = 2) data ``` This is how our data looks like: ```{r, fig.show='hold'} data %>% ggplot(aes(x = V1, y = V2, color = factor(true_clust))) + geom_point() + scale_color_discrete(name = "true cluster") ``` Now we can cluster it using kmeans++: ```{r} rownames(data) <- data$id data_for_clust <- data %>% select(starts_with("V")) km <- TGL_kmeans_tidy(data_for_clust, k = 5, metric = "euclid", verbose = TRUE ) ``` The returned list contains 3 fields: ```{r} names(km) ``` `km$centers` contains a tibble with `clust` column and the cluster centers: ```{r} km$centers ``` clusters are numbered according to `order_func` (see 'Custom cluster ordering' section). `km$cluster` contains tibble with `id` column with the observation id (`1:n` if no id column was supplied), and `clust` column with the observation assigned cluster: ```{r} km$cluster ``` `km$size` contains tibble with `clust` column and `n` column with the number of points in each cluster: ```{r} km$size ``` We can now check our clustering performance - fraction of observations that were classified correctly (Note that `match_clusters` function is internal to the package and is used only in this vignette): ```{r} d <- tglkmeans:::match_clusters(data, km, 5) sum(d$true_clust == d$new_clust, na.rm = TRUE) / sum(!is.na(d$new_clust)) ``` And plot the results: ```{r, fig.show='hold'} d %>% ggplot(aes(x = V1, y = V2, color = factor(new_clust), shape = factor(true_clust))) + geom_point() + scale_color_discrete(name = "cluster") + scale_shape_discrete(name = "true cluster") + geom_point(data = km$centers, size = 7, color = "black", shape = "X") ``` ## Custom cluster ordering By default, the clusters where ordered using the following function: `hclust(dist(cor(t(centers))))` - hclust of the euclidean distance of the correlation matrix of the centers. We can supply our own function to order the clusters using `reorder_func` argument. The function would be applied to each center and he clusters would be ordered by the result. ```{r} km <- TGL_kmeans_tidy(data %>% select(id, starts_with("V")), k = 5, metric = "euclid", verbose = FALSE, reorder_func = median ) km$centers ``` ## Missing data tglkmeans can deal with missing data, as long as at least one dimension is not missing. for example: ```{r} data$V1[sample(1:nrow(data), round(nrow(data) * 0.2))] <- NA data ``` ```{r} km <- TGL_kmeans_tidy(data %>% select(id, starts_with("V")), k = 5, metric = "euclid", verbose = FALSE ) d <- tglkmeans:::match_clusters(data, km, 5) sum(d$true_clust == d$new_clust, na.rm = TRUE) / sum(!is.na(d$new_clust)) ``` and plotting the results (without the NA's) we get: ```{r, fig.show='hold'} d %>% ggplot(aes(x = V1, y = V2, color = factor(new_clust), shape = factor(true_clust))) + geom_point() + scale_color_discrete(name = "cluster") + scale_shape_discrete(name = "true cluster") + geom_point(data = km$centers, size = 7, color = "black", shape = "X") ``` ## High dimensions Let's move to higher dimensions (and higher noise): ```{r} data <- simulate_data(n = 100, sd = 0.3, nclust = 30, dims = 300) km <- TGL_kmeans_tidy(data %>% select(id, starts_with("V")), k = 30, metric = "euclid", verbose = FALSE, id_column = TRUE ) ``` Note that here we supplied `id_column = TRUE` to indicate that the first column is the id column. ```{r} d <- tglkmeans:::match_clusters(data, km, 30) sum(d$true_clust == d$new_clust, na.rm = TRUE) / sum(!is.na(d$new_clust)) ``` ## Comparison with R vanilla kmeans Let's compare it to R vanilla kmeans: ```{r} km_standard <- kmeans(data %>% select(starts_with("V")), 30) km_standard$clust <- tibble(id = 1:nrow(data), clust = km_standard$cluster) d <- tglkmeans:::match_clusters(data, km_standard, 30) sum(d$true_clust == d$new_clust, na.rm = TRUE) / sum(!is.na(d$new_clust)) ``` We can see that kmeans++ clusters significantly better than R vanilla kmeans. ## Random seed we can set the seed for reproducible results: ```{r} km1 <- TGL_kmeans_tidy(data %>% select(starts_with("V")), k = 30, metric = "euclid", verbose = FALSE, seed = 60427 ) km2 <- TGL_kmeans_tidy(data %>% select(starts_with("V")), k = 30, metric = "euclid", verbose = FALSE, seed = 60427 ) all(km1$centers[, -1] == km2$centers[, -1]) ```