What are the key steps involved in the knowledge discovery process in database?

 Knowledge Discovery in Database (KDD)

Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data.



The steps involved in the knowledge discovery process:

Data Cleaning: data cleaning is a process of removing unnecessary and inconsistent data from the databases. The main purpose of cleaning is to improve the quality of the data by filling in the missing values, configuring the data to make sure that it is an inconsistent format.

Data Integration: In this step data from various sources such as databases, data warehouses, and transactional data are combined.

Data Selection: Data that is required for the data mining process can be extracted from multiple and heterogeneous data sources such as databases, files, etc. Data selection is a process where the appropriate data required for analysis is fetched from the databases.

Data Transformation:: In the transformation stage data extracted from multiple data sources are converted into an appropriate format for the data mining process. Data reduction or summarization is used to decrease the number of possible values of data without affecting the integrity of data.

Data Mining: It is the most essential step of the KDD process where intelligent methods are applied in order to extract hidden patterns from data stored in databases.

Pattern Evaluation: This step identifies the truly interesting patterns representing knowledge on the basis of some interestingness measures. Support and confidence are two widely used interesting measures. These patterns are helpful for decision support systems.

Knowledge Presentation: In this step, visualization and knowledge representation techniques are used to present mined knowledge to users. Visualizations can be in form of graphs, charts, or tables.

                               OR,

KDD includes multidisciplinary activities. This encompasses data storage and access, scaling algorithms to massive data sets, and interpreting results. The data cleansing and data access process included in data warehousing facilitate the KDD process. Artificial intelligence also supports KDD by discovering empirical laws from experimentation and observations. The patterns recognized in the data must be valid on new data and possess some degree of certainty. These patterns are considered new knowledge. Steps involved in the entire KDD process are:

.•  Data Cleaning - In this step, the noise and inconsistent data are removed.

• Data Integration - In this step, multiple data sources are combined.

• Data Selection - In this step, data relevant to the analysis task are retrieved from the database.

• Data Transformation - In this step, data is transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations.

• Data Mining - In this step, intelligent methods are applied in order to extract data patterns.

• Pattern Evaluation - In this step, data patterns are evaluated, or It identifies the truly interesting representing knowledge based on interesting measures. 

• Knowledge Presentation - In this step, knowledge is represented or, where visualization and knowledge representation techniques are used to present mined knowledge to users.

Steps 1 through 4 is different forms of data preprocessing, where data are prepared for mining. The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base.

Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

Discuss classification or taxonomy of virtualization at different levels.