Why data preprocessing is mandatory? Justify.
Incomplete, noisy, and inconsistent data are commonplace properties of large real-world databases and data warehouses. Incomplete data can occur for a number of reasons. Attributes of interest may not always be available, such as customer information for sales transactions important at the time of entry. Relevant data may not be recorded due to a misunderstanding, or because of equipment malfunctions. Data that were inconsistent with other recorded data may have been deleted. Furthermore, recording of the history of modifications to the data may have been overlooked. Missing data, particularly for tuples with a missing value for some mining results. Therefore to improve the quality of data and, consequently, of the mining results, data preprocessing is needed/mandatory.
OR,
Virtually any type of data analytics, data science, or AI development requires some type of data preprocessing to provide reliable, precise, and robust results for enterprise applications. Good preprocessing can help align the way data is fed into various algorithms for building machine learning or deep learning models.
Real-world data is messy and is often created, processed, and stored by a variety of humans, business processes, and applications. While it may be suitable for the purpose at hand, a data set may be missing individual fields, contain manual input errors, or have duplicate data or different names to describe the same thing. Although humans can often identify and rectify these problems in the line of business, this data needs to be automatically preprocessed when it is used to train machine learning or deep learning algorithms.
Machine learning and deep learning algorithms work best when data is presented in a particular format that highlights the relevant aspects required to solve a problem. Feature engineering practices that involve data wrangling, data transformation, data reduction, and feature scaling help restructure raw data into a form better suited for a particular type of algorithm. This can significantly reduce the processing and time required to train a new machine learning or AI algorithm, or run an inference against it.
One caution that should be observed in preprocessing is identifying the possibility of reencoding bias into the data set. This is critical for applications that help make decisions that affect people, such as loan approvals. Although data scientists may deliberately ignore variables like gender, race or religion, these traits may be correlated with other variables like zip codes or schools attended.
Most modern data science packages and services now include various preprocessing libraries that help to automate many of these tasks.
OR,
Data Preprocessing is required because:
Real-world data are generally:
Incomplete: Missing attribute values, missing certain attributes of importance, or having only aggregate data
Noisy: Containing errors or outliers
Inconsistent: Containing discrepancies in codes or names
Comments
Post a Comment