What is mini-batch K-means clustering algorithm? Why it is considered better than normal K-means algorithm?

MINI-BATCH K-MEANS

The mini-batch k-means clustering algorithm is the modified version of the k-means algorithm. It uses min-batches to reduce the computation time in large datasets. In addition, it attempts to optimize the result of the clustering. To achieve this, the mini-batch k-means takes mini-batches as inputs which are subsets of the whole dataset, randomly. 

The mini-batch k-means is considered faster thank-means and it is normally used for large datasets. Mini Batch K-means algorithm's main idea is to use small random batches of data of a fixed size, so they can be stored in memory. In each iteration, a new random sample from the dataset is obtained and used to update the clusters and this is repeated until convergence. Each mini-batch updates the clusters using a convex combination of the values of the prototypes and the data, applying a learning rate that decreases with the number of iterations. This learning rate is the inverse of the number of data assigned to a cluster during the process. As the number of iterations increases, the effect of new data is reduced, so convergence can be detected when no changes in the clusters occur in several consecutive iterations.

The algorithm takes small randomly chosen batches of the dataset for each iteration. Each data in the batch is assigned to the clusters, depending on the previous locations of the cluster centroids. It then updates the locations of cluster centroids based on the new points from the batch. The update is a gradient descent update, which is significantly faster than a normal K-Means update.


2nd part 

it is considered better than the normal K-means algorithm because Mini-Batch K-Means uses mini-batches of samples rather than the entire dataset, which can make it faster for large datasets, and perhaps more robust to statistical noise. The mini-batch K-means is faster but gives slightly different results than the normal K-means. As the number of clusters and the number of data increases, the relative saving in computational time also increases. The saving in computational time is more noticeable only when the number of clusters is very large. The effect of the batch size in the computational time is also more evident when the number of clusters is larger. Increasing the number of clusters decreases the similarity of the mini-batch K-means solution to the K-means solution. Despite that the agreement between the partitions decreases as the number of clusters increases; the objective function does not degrade at the same rate. It means that the final partitions are different, but closer in quality.



Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Pure Versus Partial EC

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?