Publications des scientifiques de l'IRD

Berti-Equille Laure, Loh J. M., Dasu T. (2015). A masking index for quantifying hidden glitches. Knowledge and Information Systems, 44 (2), p. 253-277. ISSN 0219-1377.

Titre du document
A masking index for quantifying hidden glitches
Année de publication
2015
Type de document
Article référencé dans le Web of Science WOS:000357678800001
Auteurs
Berti-Equille Laure, Loh J. M., Dasu T.
Source
Knowledge and Information Systems, 2015, 44 (2), p. 253-277 ISSN 0219-1377
Data glitches are errors in a dataset. They are complex entities that often span multiple attributes and records. When they co-occur in data, the presence of one type of glitch can hinder the detection of another type of glitch. This phenomenon is called masking. In this paper, we define two important types of masking and propose a novel, statistically rigorous indicator called masking index for quantifying the hidden glitches. We outline four cases of masking: outliers masked by missing values, outliers masked by duplicates, duplicates masked by missing values, and duplicates masked by outliers. The masking index is critical for data quality profiling and data exploration. It enables a user to measure the extent of masking and hence the confidence in the data. In this sense, it is a valuable data quality index for choosing an anomaly detection method that is best suited for the glitches that are present in any given dataset. We demonstrate the utility and effectiveness of the masking index by intensive experiments on synthetic and real-world datasets.
Plan de classement
Sciences fondamentales / Techniques d'analyse et de recherche [020] ; Informatique [122]
Localisation
Fonds IRD [F B010064844]
Identifiant IRD
fdi:010064844
Contact