How to generate Association Rules from Frequent Itemsets?

 Generating Association Rules from Frequent Itemsets

Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong association rules from them (where strong association rules satisfy both minimum support and minimum confidence). This can be done using Eq. (6.4) for confidence, which we show again here for completeness: support_count (AUB)

confidence (A⇒ B)=P(B|A) = support_count (A)/support_count (B)

The conditional probability is expressed in terms of itemset support count, where support_count (AUB) is the number of transactions containing the itemsets AUB, and support count(A) is the number of transactions containing the itemset A. Based on this equation, association rules can be generated as follows:

■ For each frequent itemset I, generate all nonempty subsets of I.

■ For every nonempty subset s of I, output the rule "s⇒ (I-s)" if support_count(I) /support_count(s)>=min_conf, where min_conf is the minimum confidence threshold.

Because the rules are generated from frequent itemsets, each one automatically satisfies the minimum support. Frequent itemsets can be stored ahead of time in hash tables along with their counts so that they can be accessed quickly.

Example  Generating association rules. Let's try an example based on the transactional data for AllElectronics shown before in Table 6.1.


 The data contain frequent itemset X = {I1, I2, I5). What are the association rules that can be generated from X? The nonempty subsets of X are (I1, I2), (I1, I5}, {I2, I5}, {I1}, {I2}, and (I5). The resulting association rules are as shown below, each listed with its confidence:

{I1, I2) ⇒I5,  confidence = 2/4= 50%

(I1,I5) ⇒I2, confidence = 2/2 = 100% 

 (I2, I5) I1,   confidence = 2/2 = 100%

I1⇒ (I2,I5),   confidence = 2/6 33%

 I2 (I1,I5),     confidence = 2/7 = 29%

I5⇒ (I1,I2), confidence = 2/2 = 100% 

If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules are output, because these are the only ones generated that are strong. Note  that, unlike conventional classification rules, association rules can contain more than one conjunct in the right side of the rule.

                       OR,




Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

Discuss classification or taxonomy of virtualization at different levels.