Explain K-medoid Clustering Algorithm with example.

K-Medoids is also a partitioning technique of clustering that clusters the data set of n objects into k clusters with k known a priori. It starts from an initial set of Medoids and iteratively replaces one of the Medoids by one of the non-Medoids if it improves the total distance of the resulting clustering. A popular realization of K-Medoids clustering is PAM (Partitioning Around Medoids) algorithm.



Advantages of K-Medoids
  • It is simple to understand and easy to implement. K-Medoid Algorithm is fast and converges in a fixed number of steps.
  • PAM is less sensitive to outliers than other partitioning algorithms.
Disadvantages of K-Medoids
  • Pam works efficiently for small data sets but does not scale well for large data sets. It may obtain different results for different runs on the same dataset because the first k medoids are chosen randomly.
  • It is not suitable for clustering arbitrary-shaped clusters because it relies on minimizing the distances between the non-medoid data points and the medoid.


 K-medoid Clustering Algorithm

A medoid can be defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is minimal. i.e. it is the most centrally located point in the cluster. In contrast to the K-means algorithm, the K-medoid algorithm chooses data points as centers and works with an arbitrary matrix of distances instead of l2.K-medoid is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori. It is more robust to noise and outliers compared to because it may minimize a sum of pair-wise dissimilarities instead of a sum of squared Euclidean distances.

Algorithms

The most common realization of k-medoid clustering is the Partitioning around Medoid (PAM) algorithm and is as follows:

1. Initialize: randomly select (without replacement) k of the n data points as the medoid

2. Associate each data point to the closest medoid.

3. While the cost of the configuration decreases:

- For each medoid m, for each non-medoid data point o:

• Swap m and o, re-compute the cost (sum of distances of points to their medoid)

• If the total cost of the configuration increased in the previous step, undo the swap.


Example

Cluster the following data set of ten objects into two clusters i.e. k = 2. Data Points are {(1,3), (4,2), (6,2), (3,5), (4,1)}

Solution:-

Let p1=(1,3) p2=(4,5) p3=(6,3) p4=(3,4) p5=(2,1)

Initial step

Let m1=p1=(1,3) and m2=p4=(3,4) are two initial medoid

Iteration 1

Calculate distance between clusters centers and each data point

d(m1, p2) = 3 + 2 + 5  d(m2, p2) = 1+ 0 + 1

d(m1, p3) = 5 + 0 + 5  d(m2, p3) = 3 + 1 + 4

d(m1, p5) = 1 + 2 + 3  d(m2, p5) = 1 + 3 + 4

Thus after first iteration

Cluster 1= {p1, p5}

Cluster 2={p2,p3,p4}

Total Cost= {d(m1, p5)+d(m2, p2)+d(m2, p3)}=10


Iteration 2

Lets swap m1 and p2

Now m1=p2= (4,5) and m2=p4=(3,4) are two medoid

Let p1=(1,3) p2=(4,5) p3=(6,3) p4=(3,4) p5=(2,1)

Calculate distance between new cluster centers and each data points

d(m1, p1) = 3 + 2 + 5  d(m2, p1) = 2 +1 + 3

d(m1, p3) = 2 + 1 + 3   d(m2, p3) = 3 + 1 + 4

d(m1, p5) = 2 + 4 + 6   d(m2, p5) = 1 + 3 + 4

Thus after second iteration

Cluster 1= {p2,p3}

Cluster 2={p1,p4,p5}

Total Cost= {d(m1, p3)+d(m2, p1)+d(m2, p5)}=3+3+4=10

Same as previous, thus no undo swap

Repeat this process till the cost of configuration do not decrease

Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Pure Versus Partial EC

Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where a charge is the fee that a doctor charges a patient for a visit. a) Draw a schema diagram for the above data warehouse using one of the schemas. [star, snowflake, fact constellation] b) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004? c) To obtain the same list, write an SQL query assuming the data are stored in a relational database with the schema fee (day, month, year, doctor, hospital, patient, count, charge)