Data Quality Assessment for Machine Learning

2nd International Workshop on

Data Quality Assessment
for Machine Learning

@ Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD)

14-18 August, 2021, Singapore

Important Notice: The safety and well-being of all workshop participants is our priority. Due to the COVID-19 situation, we will have the workshop as a virtual event in conjunction with SIGKDD 2021. More details to follow.

In the past decade, AI/ML technologies have become pervasive in academia and industry, finding their utility in newer and challenging applications. While there has been a focus to build better, smarter and automated ML models little work has been done to systematically understand the challenges in the data and assess its quality issues before it is fed to an ML pipeline. Issues such as incorrect labels, synonymous categories in a categorical variable, heterogeneity in columns etc. which might go undetected by standard pre-processing modules in these frameworks can lead to sub-optimal model performance. Although, some systems are able to generate comprehensive reports with details of the ML pipeline, a lack of insight and explainability w.r.t. to the data quality issues leads to data scientists spending ~80\% time on data preparation before employing these AutoML solutions. This is why data preparation has been called out as one of the most time-consuming step in an AI lifecyle. Since the quality of data is not known at Step 0, when the data is acquired, data preparation becomes an iterative debugging process and becomes more of an art, leveraging the experience of a data scientist. Because the performance of an ML model is only as good as the training data it sees, a systematic analysis of data quality before building AI/ML models is of utmost importance.

Important Dates

Paper Submission : May 20th, 2021
Author Notification : June 10th, 2021
Camera-Ready Submission : To Be Announced
All deadlines are at 23:59 Anywhere on Earth (AoE).

Call for Papers

Workshop CFP

Click Here to Download

Workshop Scope

The goal of this workshop is to attract researchers working in the fields of data acquisition, data labeling, data quality, data preparation and AutoML areas to understand how the data issues, their detection and remediation will help towards building better models. With a focus on different modalities such as structured data, time series data, text data and graph data, this workshop invites researchers from academia and industry to submit novel propositions for systematically identifying and mitigating data issues for making data AI ready.

Topics of Interest

Methods of data assessment can change depending on the modality of the data. This workshop will invite submissions for data quality assessment for different modalities: structured (or tabular) data, unstructured (such as text, log, images) data, graph structured (relational, network) data, time series data, spatio-temporal data etc. We would like to explore state-of-the-art deep learning and AI concepts such as deep reinforcement learning, graph neural networks, self-supervised learning, capsule networks and adversarial learning to address the problems of data assessment quality for ML. Following is a (non-exhaustive) list of topics that are of interest to this workshop:

  • Algorithms for assessment of data quality issues relevant to ML
  • Automatic remediation of data quality issues
  • Human-assisted data cleaning and remediation
  • Automated data cleaning workflows
  • Explainability and interpretability of quality assessment
  • Interactive debugging of data
  • Smarter data visualizations for high dimensional data
  • Evaluation techniques for data quality assessment
  • Real world use cases and applications of data quality assessment
  • Novel interfaces to assist human-in-the-loop intervention for interactive data cleaning
  • Quality-aware representations and sampling of high dimensional data
  • Representative sampling for high dimensional data
  • Detection of bias and privacy breach
  • Label noise detection, explanation and incorporating feedback
  • Noise and low-quality data robustness studies
  • Handling corrupted, missing and uncertain data
  • Outlier (or anomaly) detection and mitigation in data
  • Addressing Class Imbalance in data
  • Benchmarking of data preparation and cleaning systems and tools: data sets and frameworks

Submission Instructions

We solicit submission of papers of 4 to 10 pages representing reports of original research, preliminary research results, case studies, proposals for new work and position papers.

All papers will be peer reviewed, single blind (i.e. author names and affiliations should be listed). If accepted, at least one of the authors must attend the workshop to present the work. The submitted papers must be written in English and formatted in the double column standard according to the ACM Proceedings Template, Tighter Alternate style. The papers should be in PDF format and submitted via the EasyChair submission site. The workshop website will archive the published papers.

The submitted papers must not be previously published anywhere and must not be under consideration by any other conference or journal during the workshop review process.

Submissions should be made via the Easychair system through the submission page available here: https://easychair.org/my/conference?conf=datareadinesskdd2021#

Keynotes

To Be Announced

Organizing Committee

Program Committee

To Be Announced

Previous Workshop

1st International Workshop on Data Assessment and Readiness for AI @ Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2021 (link)

Contact Information

For any queries, reach out to us at data.readiness.kdd2021@gmail.com