. What is outlier? Why outlier detection is important? Explain with suitable example.
An outlier is a data object that deviates significantly from the rest of the objects as if it were generated by a different mechanism. For ease of presentation within this chapter, we may refer to data objects that are not outliers as "normal" or expected data. Similarly, we may refer to outliers as "abnormal" data.
Example 12.1 Outliers.
In Figure 12.1, most objects follow a roughly Gaussian distribution. However, the objects in region R are significantly different. It is unlikely that they follow the same distribution as the other objects in the data set. Thus, the objects in Rare outliers in the data set.
outlier detection is important as Outliers are interesting because they are suspected of not being generated by the same mechanisms as the rest of the data. Therefore, in outlier detection, it is important to justify why the outliers detected are generated by some other mechanisms. This is often achieved by making various assumptions on the rest of the data and showing that the outliers detected violate those assumptions significantly.
Outlier detection is also related to novelty detection in evolving data sets. For example, by monitoring a social media website where new content is incoming, novelty detection may identify new topics and trends in a timely manner. Novel topics may initially appear as outliers. To this extent, outlier detection and novelty detection share some similarities in modeling and detection methods. However, a critical difference between the two is that in novelty detection, once new topics are confirmed, they are usually incorporated into the model of normal behavior so that follow-up instances are not treated as outliers anymore.
Te important exhaustive lists of applications in the outlier prediction are as follows:
• Fraud detection: The fraudulent applications are detected for credit cards, state benefits or detecting fraudulent usage of credit cards or mobile phones
• Intrusion detection: The unauthorized access in the computer networks is detected
• Network performance: The performance of the computer networks is monitored to detect the network bottleneck
• Fault diagnosis: The faults in the data are detected by monitoring the process
• Structural defect detection: The manufacturing lines are monitored to detect fault production runs
• Detecting mislabeled data in a training dataset
Comments
Post a Comment