Why data preprocessing is mandatory? Justify.

Incomplete, noisy, and inconsistent data are commonplace properties of large real-world databases and data warehouses. Incomplete data can occur for a number of reasons. Attributes of interest may not always be available, such as customer information for sales transactions important at the time of entry. Relevant data may not be recorded due to a misunderstanding, or because of equipment malfunctions. Data that were inconsistent with other recorded data may have been deleted. Furthermore, recording of the history of modifications to the data may have been overlooked. Missing data, particularly for tuples with a missing value for some mining results. Therefore to improve the quality of data and, consequently, of the mining results, data preprocessing is needed/mandatory.

                                                  OR, 

Virtually any type of data analytics, data science, or AI development requires some type of data preprocessing to provide reliable, precise, and robust results for enterprise applications. Good preprocessing can help align the way data is fed into various algorithms for building machine learning or deep learning models.

Real-world data is messy and is often created, processed, and stored by a variety of humans, business processes, and applications. While it may be suitable for the purpose at hand, a data set may be missing individual fields, contain manual input errors, or have duplicate data or different names to describe the same thing. Although humans can often identify and rectify these problems in the line of business, this data needs to be automatically preprocessed when it is used to train machine learning or deep learning algorithms.

Machine learning and deep learning algorithms work best when data is presented in a particular format that highlights the relevant aspects required to solve a problem. Feature engineering practices that involve data wrangling, data transformation, data reduction, and feature scaling help restructure raw data into a form better suited for a particular type of algorithm. This can significantly reduce the processing and time required to train a new machine learning or AI algorithm, or run an inference against it.

One caution that should be observed in preprocessing is identifying the possibility of reencoding bias into the data set. This is critical for applications that help make decisions that affect people, such as loan approvals. Although data scientists may deliberately ignore variables like gender, race or religion, these traits may be correlated with other variables like zip codes or schools attended.

Most modern data science packages and services now include various preprocessing libraries that help to automate many of these tasks.

                          OR,

Data Preprocessing is required because:

Real-world data are generally:

Incomplete: Missing attribute values, missing certain attributes of importance, or having only aggregate data

Noisy: Containing errors or outliers

Inconsistent: Containing discrepancies in codes or names

Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Describe how cloud computing technology can be applied to support remote ECG monitoring.

Explain market-Oriented Cloud computing architecture.