. What is outlier? Why outlier detection is important? Explain with suitable example.

  An outlier is a data object that deviates significantly from the rest of the objects as if it were generated by a different mechanism. For ease of presentation within this chapter, we may refer to data objects that are not outliers as "normal" or expected data. Similarly, we may refer to outliers as "abnormal" data.


Example 12.1 Outliers.

In Figure 12.1, most objects follow a roughly Gaussian distribution. However, the objects in region R are significantly different. It is unlikely that they follow the same distribution as the other objects in the data set. Thus, the objects in Rare outliers in the data set.

outlier detection is important as Outliers are interesting because they are suspected of not being generated by the same mechanisms as the rest of the data. Therefore, in outlier detection, it is important to justify why the outliers detected are generated by some other mechanisms. This is often achieved by making various assumptions on the rest of the data and showing that the outliers detected violate those assumptions significantly.

Outlier detection is also related to novelty detection in evolving data sets. For example, by monitoring a social media website where new content is incoming, novelty detection may identify new topics and trends in a timely manner. Novel topics may initially appear as outliers. To this extent, outlier detection and novelty detection share some similarities in modeling and detection methods. However, a critical difference between the two is that in novelty detection, once new topics are confirmed, they are usually incorporated into the model of normal behavior so that follow-up instances are not treated as outliers anymore.

Te  important  exhaustive  lists  of applications in the outlier prediction are as follows:

  • Fraud  detection:  The  fraudulent  applications  are detected for credit  cards, state benefits or  detecting fraudulent usage of credit cards or mobile phones

  • Intrusion  detection:  The  unauthorized  access  in the computer networks is detected  

• Network  performance:  The  performance  of  the computer  networks  is  monitored  to  detect  the network bottleneck 

 • Fault diagnosis: The faults in the data are detected by monitoring the process

  • Structural  defect  detection:  The  manufacturing lines are monitored to detect fault production runs 

• Detecting mislabeled data in a training dataset

Comments

Popular posts from this blog

Pure Versus Partial EC

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Short note on E-Government Architecture