What are the stages of knowledge discovery in database(KDD)?

 Knowledge Discovery in Database(KKD)

• Dating back to 1989, the namesake Knowledge Discovery in Database (KDD) represents the overall process of collecting data and methodically refining it.

• KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.

• Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial extraction of implicit, previously unknown, and potentially useful information from data stored in databases.

Stages of knowledge discovery in database(KDD)



IN LONG FORM


1. Data Cleaning: Data cleaning is defined as the removal of noisy and irrelevant data from the collection.

• Cleaning in case of Missing values.

• Cleaning noisy data, where noise is a random or variance error.

• Cleaning with Data discrepancy detection and Data transformation tools.

Ideally, it is a process of filtering noisy content and redundancies to eliminate irrelevancy from the records. All in all, this method gives you the glasses to see inconsistencies that are right in front of you but lying untapped and unrecognized.


2. Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a common source(DataWarehouse).

• Data integration using Data Migration tools.

• Data integration using Data Synchronization tools.

• Data integration using ETL(Extract-Load-Transformation) process.

When you put heterogeneous information from multiple resources at one place or warehouse, this is called integration. This step takes the knowledge discovery further where you need to conflate all that you have captured from primary and secondary resources.


3. Data Selection: Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the data collection.

• Data selection using Neural network.

• Data selection using Decision Trees.

• Data selection using Naive Bayes.

• Data selection using Clustering, Regression, etc.


4. Data Transformation: Data Transformation is defined as the process of transforming data into the appropriate form required by the mining procedure.

Data Transformation is a two-step process:

• Data Mapping: Assigning elements from source base to destination to capture transformations.

• Code generation: Creation of the actual transformation program.


5. Data Mining: Data mining is defined as clever techniques that are applied to extract patterns potentially useful.

• Transforms task-relevant data into patterns.

• Decides purpose of the model using classification or characterization.


6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on given measures.

• Find interesting scores for each pattern.

• Uses summarization and Visualization to make data understandable by the user.


7. Knowledge representation: Knowledge representation is defined as a technique that utilizes visualization tools to represent data mining results.

• Generate reports.

• Generate tables.

• Generate discriminant rules, classification rules, characterization rules, etc.

            OR

Data Cleaning: data cleaning is a process of removing unnecessary and inconsistent data from the databases. The main purpose of cleaning is to improve the quality of the data by filling in the missing values, configuring the data to make sure that it is an inconsistent format.

Data Integration: In this step data from various sources such as databases, data warehouses, and transactional data are combined.

Data Selection: Data that is required for the data mining process can be extracted from multiple and heterogeneous data sources such as databases, files, etc. Data selection is a process where the appropriate data required for analysis is fetched from the databases.

Data Transformation:: In the transformation stage data extracted from multiple data sources are converted into an appropriate format for the data mining process. Data reduction or summarization is used to decrease the number of possible values of data without affecting the integrity of data.

Data Mining: It is the most essential step of the KDD process where intelligent methods are applied in order to extract hidden patterns from data stored in databases.

Pattern Evaluation: This step identifies the truly interesting patterns representing knowledge on the basis of some interestingness measures. Support and confidence are two widely used interesting measures. These patterns are helpful for decision support systems.

Knowledge Presentation: In this step, visualization and knowledge representation techniques are used to present mined knowledge to users. Visualizations can be in form of graphs, charts, or tables.

IN SHORT FORM

.•  Data Cleaning - In this step, the noise and inconsistent data are removed.

• Data Integration - In this step, multiple data sources are combined.

• Data Selection - In this step, data relevant to the analysis task are retrieved from the database.

• Data Transformation - In this step, data is transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations.

• Data Mining - In this step, intelligent methods are applied in order to extract data patterns.

• Pattern Evaluation - In this step, data patterns are evaluated, or It identifies the truly interesting representing knowledge based on interesting measures. 

• Knowledge Presentation - In this step, knowledge is represented, where visualization and knowledge representation techniques are used to present mined knowledge to users.

Steps 1 through 4 is different forms of data preprocessing, where data are prepared for mining. The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base.

Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

Discuss classification or taxonomy of virtualization at different levels.