Why You Need Data Preprocessing

 Incomplete, noisy, and inconsistent data are commonplace properties of large real-world databases and data warehouses. Incomplete data can occur for a number of reasons. Attributes of interest may not always be available, such as customer information for sales transactions important at the time of entry. Relevant data may not be recorded due to a misunderstanding, or because of equipment malfunctions. Data that were inconsistent with other recorded data may have been deleted. Furthermore, recording of the history of modifications to the data may have been overlooked. Missing data, particularly for tuples with a missing value for some mining results. Therefore to improve the quality of data and, consequently, of the mining results, data preprocessing is needed.

                               OR,

By now, you’ve surely realized why your data preprocessing is so important. Since mistakes, redundancies, missing values, and inconsistencies all compromise the integrity of the set, you need to fix all those issues for a more accurate outcome. Imagine you are training a Machine Learning algorithm to deal with your customers’ purchases with a faulty dataset. Chances are that the system will develop biases and deviations that will produce a poor user experience.

Thus, before using that data for the purpose you want, you need it to be as organized and “clean” as possible. There are several ways to do so, depending on what kind of problem you’re tackling. Ideally, you’d use all of the following techniques to get a better data set.

Comments

Popular posts from this blog

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

Define Business ethics . Explain its significance.

Short Note on Security Architecture of E-governance