Why is data preprocessing important?

 Virtually any type of data analytics, data science, or AI development requires some type of data preprocessing to provide reliable, precise, and robust results for enterprise applications. Good preprocessing can help align the way data is fed into various algorithms for building machine learning or deep learning models.

Real-world data is messy and is often created, processed, and stored by a variety of humans, business processes, and applications. While it may be suitable for the purpose at hand, a data set may be missing individual fields, contain manual input errors, or have duplicate data or different names to describe the same thing. Although humans can often identify and rectify these problems in the line of business, this data needs to be automatically preprocessed when it is used to train machine learning or deep learning algorithms.

Machine learning and deep learning algorithms work best when data is presented in a particular format that highlights the relevant aspects required to solve a problem. Feature engineering practices that involve data wrangling, data transformation, data reduction, and feature scaling help restructure raw data into a form better suited for a particular type of algorithm. This can significantly reduce the processing and time required to train a new machine learning or AI algorithm, or run an inference against it.

One caution that should be observed in preprocessing is identifying the possibility of reencoding bias into the data set. This is critical for applications that help make decisions that affect people, such as loan approvals. Although data scientists may deliberately ignore variables like gender, race or religion, these traits may be correlated with other variables like zip codes or schools attended.

Most modern data science packages and services now include various preprocessing libraries that help to automate many of these tasks.



Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

Discuss classification or taxonomy of virtualization at different levels.