Explain K-medoid Clustering Algorithm with example.
K-Medoids is also a partitioning technique of clustering that clusters the data set of n objects into k clusters with k known a priori. It starts from an initial set of Medoids and iteratively replaces one of the Medoids by one of the non-Medoids if it improves the total distance of the resulting clustering. A popular realization of K-Medoids clustering is PAM (Partitioning Around Medoids) algorithm.
- It is simple to understand and easy to implement. K-Medoid Algorithm is fast and converges in a fixed number of steps.
- PAM is less sensitive to outliers than other partitioning algorithms.
- Pam works efficiently for small data sets but does not scale well for large data sets. It may obtain different results for different runs on the same dataset because the first k medoids are chosen randomly.
- It is not suitable for clustering arbitrary-shaped clusters because it relies on minimizing the distances between the non-medoid data points and the medoid.
K-medoid Clustering Algorithm
A medoid can be defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is minimal. i.e. it is the most centrally located point in the cluster. In contrast to the K-means algorithm, the K-medoid algorithm chooses data points as centers and works with an arbitrary matrix of distances instead of l2.K-medoid is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori. It is more robust to noise and outliers compared to because it may minimize a sum of pair-wise dissimilarities instead of a sum of squared Euclidean distances.
Algorithms
The most common realization of k-medoid clustering is the Partitioning around Medoid (PAM) algorithm and is as follows:
1. Initialize: randomly select (without replacement) k of the n data points as the medoid
2. Associate each data point to the closest medoid.
3. While the cost of the configuration decreases:
- For each medoid m, for each non-medoid data point o:
• Swap m and o, re-compute the cost (sum of distances of points to their medoid)
• If the total cost of the configuration increased in the previous step, undo the swap.
Example
Cluster the following data set of ten objects into two clusters i.e. k = 2. Data Points are {(1,3), (4,2), (6,2), (3,5), (4,1)}
Solution:-
Let p1=(1,3) p2=(4,5) p3=(6,3) p4=(3,4) p5=(2,1)
Initial step
Let m1=p1=(1,3) and m2=p4=(3,4) are two initial medoid
Iteration 1
Calculate distance between clusters centers and each data point
d(m1, p2) = 3 + 2 + 5 d(m2, p2) = 1+ 0 + 1
d(m1, p3) = 5 + 0 + 5 d(m2, p3) = 3 + 1 + 4
d(m1, p5) = 1 + 2 + 3 d(m2, p5) = 1 + 3 + 4
Thus after first iteration
Cluster 1= {p1, p5}
Cluster 2={p2,p3,p4}
Total Cost= {d(m1, p5)+d(m2, p2)+d(m2, p3)}=10
Iteration 2
Lets swap m1 and p2
Now m1=p2= (4,5) and m2=p4=(3,4) are two medoid
Let p1=(1,3) p2=(4,5) p3=(6,3) p4=(3,4) p5=(2,1)
Calculate distance between new cluster centers and each data points
d(m1, p1) = 3 + 2 + 5 d(m2, p1) = 2 +1 + 3
d(m1, p3) = 2 + 1 + 3 d(m2, p3) = 3 + 1 + 4
d(m1, p5) = 2 + 4 + 6 d(m2, p5) = 1 + 3 + 4
Thus after second iteration
Cluster 1= {p2,p3}
Cluster 2={p1,p4,p5}
Total Cost= {d(m1, p3)+d(m2, p1)+d(m2, p5)}=3+3+4=10
Same as previous, thus no undo swap
Repeat this process till the cost of configuration do not decrease
Comments
Post a Comment