How Does Classification Works?/LEARNING AND TESTING OF CLASSIFICATION


Data classification is a two-step process. They are:

1. Building the classifier: This step is also known as the model construction, training, or learning phase. In this step, a model or classifier is constructed by using any one classification algorithm based on a training set made up of database tuples and their associated class labels. The classifier constructed in this step can be a decision tree, if-then rules, weight-adjusted neural network or mathematical formulae, etc. For example, Consider, a class labeled training dataset of employees with attributes Name, Rank, and Years as input attributes features and Tenured as a category attribute with two possible values no and yes. In the learning step, from this given training data set classification algorithm learns the model or rule < IF rank="Professor OR years>6 THEN tenured='yes'> as shown in the figure below.


Using the classifier for classification: After the model has been constructed in the model construction step, then we can use the constructed model for classification purposes only if the model is accurate as per our application demands. So, before using the model, we first need to test its accuracy. To measure the accuracy of a model we need test data. The test data is randomly selected from the general data set, and it is similar in its structure to training data i.e., test data is also already labeled data. However, the test data should be independent of the training dataset, otherwise over-fitting will occur. To measure the accuracy of a model the known label of test data is compared with the classified result obtained from the model. The accuracy rate is the percentage of test samples that are correctly classified by the model.

Accuracy =Number of correct classifications/ Total number of test cases

If the accuracy calculated in this way is acceptable, then use the model to classify new data tuples whose class labels is not known. For example:


                               OR.

Data classification is a two-step process:

(1) Model construction

Training data are analyzed by a classification algorithm. A classifier is built describing a predetermined set of data classes or concepts. Also called as training phase or learning stage.




(2) Model usage

Test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples.



                                                       OR,

With the help of the bank loan application that we have discussed above, let us understand the working of classification. The Data Classification process includes two steps:

a) Building the Classifier or Model

b) Using Classifier for Classification


a)Building the Classifier or Model

• This step is the learning step or the learning phase.

• In this step the classification algorithms build the classifier.

• The classifier is built from the training set made up of database tuples and their associated class labels.

• Each tuple that constitutes the training set is referred to as a category or class. These tuples can also be referred to as samples, objects, or data points.



b) Using Classifier for Classification

In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy of classification rules. The classification rules can be applied to the new data tuples if the accuracy is considered acceptable.



Comments

Popular posts from this blog

Pure Versus Partial EC

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Short note on E-Government Architecture