<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-3.xsd">
  <mods>
    <titleInfo>
      <title>Discovery of complex glitch patterns : a novel approach to quantitative data cleaning</title>
    </titleInfo>
    <name type="personnal">
      <namePart type="family">Berti-Equille</namePart>
      <namePart type="given">Laure</namePart>
      <role>
        <roleTerm type="text">auteur</roleTerm>
        <roleTerm type="code" authority="marcrelator">aut</roleTerm>
      </role>
      <affiliation>IRD</affiliation>
    </name>
    <name type="personnal">
      <namePart type="family">Dasu</namePart>
      <namePart type="given">T.</namePart>
      <role>
        <roleTerm type="text">auteur</roleTerm>
        <roleTerm type="code" authority="marcrelator">aut</roleTerm>
      </role>
      <affiliation>IRD</affiliation>
    </name>
    <name type="personnal">
      <namePart type="family">Svrivastava</namePart>
      <namePart type="given">D.</namePart>
      <role>
        <roleTerm type="text">auteur</roleTerm>
        <roleTerm type="code" authority="marcrelator">aut</roleTerm>
      </role>
      <affiliation>IRD</affiliation>
    </name>
    <typeOfResource>text</typeOfResource>
    <genre authority="local">bookSection</genre>
    <language>
      <languageTerm type="code" authority="iso639-2b">eng</languageTerm>
    </language>
    <physicalDescription>
      <internetMediaType>text/pdf</internetMediaType>
      <digitalOrigin>born digital</digitalOrigin>
      <reformattingQuality>access</reformattingQuality>
    </physicalDescription>
    <abstract>Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, and correct data quality problems (or glitches). Current QDC approaches focus on addressing each category of data glitch individually. However, in real-world data, different types of data glitches co-occur in complex patterns. These patterns and interactions between glitches offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address the shortcomings of the extant QDC methods by proposing a novel framework, the DEC (Detect-Explore-Clean) framework. It is a comprehensive approach for the definition, detection and cleaning of complex, multi-type data glitches. We exploit the distributions and interactions of different types of glitches to develop data-driven cleaning strategies that may offer significant advantages over blind strategies. The DEC framework is a statistically rigorous methodology for evaluating and scoring glitches and selecting the quantitative cleaning strategies that result in cleaned data sets that are statistically proximal to user specifications. We demonstrate the efficacy and scalability of the DEC framework on very large real-world and synthetic data sets</abstract>
    <targetAudience authority="marctarget">specialized</targetAudience>
    <subject authority="local">
      <topic>RESEAU INFORMATIQUE</topic>
      <topic>TRAITEMENT DE DONNEES</topic>
      <topic>ERREUR</topic>
      <topic>METHODE D'ANALYSE</topic>
      <topic>ANALYSE STATISTIQUE</topic>
    </subject>
    <classification authority="local">122APPLIC</classification>
    <relatedItem type="host">
      <titleInfo>
        <title>Proceedings of the 27th international conference on data engineering</title>
      </titleInfo>
      <name type="personnal">
        <namePart type="family">Abiteboul</namePart>
        <namePart type="given">S.</namePart>
        <role>
          <roleTerm type="text">ed.</roleTerm>
          <roleTerm type="code" authority="marcrelator">edt</roleTerm>
        </role>
        <affiliation>IRD</affiliation>
      </name>
      <name type="personnal">
        <namePart type="family">Böhm</namePart>
        <namePart type="given">K.</namePart>
        <role>
          <roleTerm type="text">ed.</roleTerm>
          <roleTerm type="code" authority="marcrelator">edt</roleTerm>
        </role>
        <affiliation>IRD</affiliation>
      </name>
      <name type="personnal">
        <namePart type="family">Koch</namePart>
        <namePart type="given">C.</namePart>
        <role>
          <roleTerm type="text">ed.</roleTerm>
          <roleTerm type="code" authority="marcrelator">edt</roleTerm>
        </role>
        <affiliation>IRD</affiliation>
      </name>
      <name type="personnal">
        <namePart>Kian Lee Tan</namePart>
        <role>
          <roleTerm type="text">ed.</roleTerm>
          <roleTerm type="code" authority="marcrelator">edt</roleTerm>
        </role>
        <affiliation>IRD</affiliation>
      </name>
      <part>
        <extent unit="pages">
          <list> 733-744</list>
        </extent>
      </part>
      <originInfo>
        <dateIssued key="date">2011</dateIssued>
      </originInfo>
      <name type="conference">
        <namePart>ICDE.International Conference on Data Engineering, 27., Hanovre (DEU), 2011/04/11-16</namePart>
      </name>
    </relatedItem>
    <relatedItem type="series">
      <titleInfo>
        <title>IEEE Conference Publication</title>
      </titleInfo>
    </relatedItem>
    <identifier type="uri">https://www.documentation.ird.fr/hor/fdi:010055317</identifier>
    <identifier type="doi">10.1109/ICDE.2011.5767864</identifier>
    <identifier type="isbn">978-1-4244-9194-0</identifier>
    <location>
      <shelfLocator>[F B010055317]</shelfLocator>
      <url usage="primary display" access="object in context">https://www.documentation.ird.fr/hor/fdi:010055317</url>
      <url access="row object">https://www.documentation.ird.fr/intranet/publi/depot/2012-05-23/010055317.pdf</url>
    </location>
    <accessCondition type="restriction access" displayLabel="Accès réservé">Accès réservé (Intranet de l'IRD)</accessCondition>
    <recordInfo>
      <recordContentSource>IRD - Base Horizon / Pleins textes</recordContentSource>
      <recordCreationDate encoding="w3cdtf">2012-05-22</recordCreationDate>
      <recordChangeDate encoding="w3cdtf">2023-02-22</recordChangeDate>
      <recordIdentifier>fdi:010055317</recordIdentifier>
      <languageOfCataloging>
        <languageTerm authority="iso639-2b">fre</languageTerm>
      </languageOfCataloging>
    </recordInfo>
  </mods>
</modsCollection>
