Title: | Efficient Implementation of K-Means++ Algorithm |
---|---|
Description: | Efficient implementation of K-Means++ algorithm. For more information see (1) "kmeans++ the advantages of the k-means++ algorithm" by David Arthur and Sergei Vassilvitskii (2007), Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 1027-1035, and (2) "The Effectiveness of Lloyd-Type Methods for the k-Means Problem" by Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman and Chaitanya Swamy <doi:10.1145/2395116.2395117>. |
Authors: | Aviezer Lifshitz [aut, cre], Amos Tanay [aut], Weizmann Institute of Science [cph] |
Maintainer: | Aviezer Lifshitz <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.5.7.9000 |
Built: | 2025-01-20 11:59:16 UTC |
Source: | https://github.com/tanaylab/tglkmeans |
This function takes a matrix and downsamples it to a target number of samples. It uses a random seed for reproducibility and allows for removing columns with small sums.
downsample_matrix( mat, target_n = NULL, target_q = NULL, seed = NULL, remove_columns = FALSE )
downsample_matrix( mat, target_n = NULL, target_q = NULL, seed = NULL, remove_columns = FALSE )
mat |
An integer matrix to be downsampled. Can be a matrix or sparse matrix (dgCMatrix).
If the matrix contains NAs, the function will run significantly slower. Values that are
not integers will be coerced to integers using |
target_n |
The target number of samples to downsample to. |
target_q |
A target quantile of sums to downsample to. Only one of 'target_n' or 'target_q' can be provided. |
seed |
The random seed for reproducibility (default is NULL) |
remove_columns |
Logical indicating whether to remove columns with small sums (default is FALSE) |
The downsampled matrix
mat <- matrix(1:12, nrow = 4) downsample_matrix(mat, 2) # Remove columns with small sums downsample_matrix(mat, 12, remove_columns = TRUE) # sparse matrix mat_sparse <- Matrix::Matrix(mat, sparse = TRUE) downsample_matrix(mat_sparse, 2) # with a quantile downsample_matrix(mat, target_q = 0.5)
mat <- matrix(1:12, nrow = 4) downsample_matrix(mat, 2) # Remove columns with small sums downsample_matrix(mat, 12, remove_columns = TRUE) # sparse matrix mat_sparse <- Matrix::Matrix(mat, sparse = TRUE) downsample_matrix(mat_sparse, 2) # with a quantile downsample_matrix(mat, target_q = 0.5)
Creates nclust
clusters normally distributed around 1:nclust
simulate_data( n = 100, sd = 0.3, nclust = 30, dims = 2, frac_na = NULL, add_true_clust = TRUE, id_column = TRUE )
simulate_data( n = 100, sd = 0.3, nclust = 30, dims = 2, frac_na = NULL, add_true_clust = TRUE, id_column = TRUE )
n |
number of observations per cluster |
sd |
sd |
nclust |
number of clusters |
dims |
number of dimensions |
frac_na |
fraction of NA in the first dimension |
add_true_clust |
add a column with the true cluster ids |
id_column |
add a column with the id |
simulated data
simulate_data(n = 100, sd = 0.3, nclust = 5, dims = 2) # add 20% missing data simulate_data(n = 100, sd = 0.3, nclust = 5, dims = 2, frac_na = 0.2)
simulate_data(n = 100, sd = 0.3, nclust = 5, dims = 2) # add 20% missing data simulate_data(n = 100, sd = 0.3, nclust = 5, dims = 2, frac_na = 0.2)
kmeans++ with return value similar to R kmeans
TGL_kmeans( df, k, metric = "euclid", max_iter = 40, min_delta = 0.0001, verbose = FALSE, keep_log = FALSE, id_column = FALSE, reorder_func = "hclust", hclust_intra_clusters = FALSE, seed = NULL, use_cpp_random = FALSE )
TGL_kmeans( df, k, metric = "euclid", max_iter = 40, min_delta = 0.0001, verbose = FALSE, keep_log = FALSE, id_column = FALSE, reorder_func = "hclust", hclust_intra_clusters = FALSE, seed = NULL, use_cpp_random = FALSE )
df |
a data frame or a matrix. Each row is a single observation and each column is a dimension. the first column can contain id for each observation (if id_column is TRUE), otherwise the rownames are used. |
k |
number of clusters. Note that in some cases the algorithm might return less clusters than k. |
metric |
distance metric for kmeans++ seeding. can be 'euclid', 'pearson' or 'spearman' |
max_iter |
maximal number of iterations |
min_delta |
minimal change in assignments (fraction out of all observations) to continue iterating |
verbose |
display algorithm messages |
keep_log |
keep algorithm messages in 'log' field |
id_column |
|
reorder_func |
function to reorder the clusters. operates on each center and orders by the result. e.g. |
hclust_intra_clusters |
run hierarchical clustering within each cluster and return an ordering of the observations. |
seed |
seed for the c++ random number generator |
use_cpp_random |
use c++ random number generator instead of R's. This should be used for only for backwards compatibility, as from version 0.4.0 onwards the default random number generator was changed o R. |
list with the following components:
A vector of integers (from â1:kâ) indicating the cluster to which each point is allocated.
A matrix of cluster centers.
The number of points in each cluster.
messages from the algorithm run (only if id_column == TRUE
).
A vector of integers with the new ordering if the observations. (only if hclust_intra_clusters = TRUE)
# create 5 clusters normally distributed around 1:5 d <- simulate_data( n = 100, sd = 0.3, nclust = 5, dims = 2, add_true_clust = FALSE, id_column = FALSE ) head(d) # cluster km <- TGL_kmeans(d, k = 5, "euclid", verbose = TRUE) names(km) km$centers head(km$cluster) km$size
# create 5 clusters normally distributed around 1:5 d <- simulate_data( n = 100, sd = 0.3, nclust = 5, dims = 2, add_true_clust = FALSE, id_column = FALSE ) head(d) # cluster km <- TGL_kmeans(d, k = 5, "euclid", verbose = TRUE) names(km) km$centers head(km$cluster) km$size
TGL kmeans with 'tidy' output
TGL_kmeans_tidy( df, k, metric = "euclid", max_iter = 40, min_delta = 0.0001, verbose = FALSE, keep_log = FALSE, id_column = FALSE, reorder_func = "hclust", add_to_data = FALSE, hclust_intra_clusters = FALSE, seed = NULL, use_cpp_random = FALSE )
TGL_kmeans_tidy( df, k, metric = "euclid", max_iter = 40, min_delta = 0.0001, verbose = FALSE, keep_log = FALSE, id_column = FALSE, reorder_func = "hclust", add_to_data = FALSE, hclust_intra_clusters = FALSE, seed = NULL, use_cpp_random = FALSE )
df |
a data frame or a matrix. Each row is a single observation and each column is a dimension. the first column can contain id for each observation (if id_column is TRUE), otherwise the rownames are used. |
k |
number of clusters. Note that in some cases the algorithm might return less clusters than k. |
metric |
distance metric for kmeans++ seeding. can be 'euclid', 'pearson' or 'spearman' |
max_iter |
maximal number of iterations |
min_delta |
minimal change in assignments (fraction out of all observations) to continue iterating |
verbose |
display algorithm messages |
keep_log |
keep algorithm messages in 'log' field |
id_column |
|
reorder_func |
function to reorder the clusters. operates on each center and orders by the result. e.g. |
add_to_data |
return also the original data frame with an extra 'clust' column with the cluster ids ('id' is the first column) |
hclust_intra_clusters |
run hierarchical clustering within each cluster and return an ordering of the observations. |
seed |
seed for the c++ random number generator |
use_cpp_random |
use c++ random number generator instead of R's. This should be used for only for backwards compatibility, as from version 0.4.0 onwards the default random number generator was changed o R. |
list with the following components:
tibble with 'id' column with the observation id ('1:n' if no id column was supplied), and 'clust' column with the observation assigned cluster.
tibble with 'clust' column and the cluster centers.
tibble with 'clust' column and 'n' column with the number of points in each cluster.
tibble with 'clust' column the original data frame.
messages from the algorithm run (only if id_column = FALSE
).
tibble with 'id' column, 'clust' column, 'order' column with a new ordering if the observations and 'intra_clust_order' column with the order within each cluster. (only if hclust_intra_clusters = TRUE)
# create 5 clusters normally distributed around 1:5 d <- simulate_data( n = 100, sd = 0.3, nclust = 5, dims = 2, add_true_clust = FALSE, id_column = FALSE ) head(d) # cluster km <- TGL_kmeans_tidy(d, k = 5, "euclid", verbose = TRUE) km
# create 5 clusters normally distributed around 1:5 d <- simulate_data( n = 100, sd = 0.3, nclust = 5, dims = 2, add_true_clust = FALSE, id_column = FALSE ) head(d) # cluster km <- TGL_kmeans_tidy(d, k = 5, "euclid", verbose = TRUE) km
Set parallel threads
tglkmeans.set_parallel(thread_num)
tglkmeans.set_parallel(thread_num)
thread_num |
number of threads. use '1' for non parallel behavior |
None
tglkmeans.set_parallel(8)
tglkmeans.set_parallel(8)