Paris Descartes University Seminar Series on Data Analytics
in collaboration with the diNo group

Invited Seminar Talk

Data Cleaning in the Big Data Era
Prof. Paolo Papotti, Arizona State University (USA)

when: 19 April 2016, 10am
where: room Turing Reunion, 7th floor, Paris Descartes University, 45 Rue Des Saints Peres, Paris 75006


In the "big data" era, data is often dirty in nature because of several reasons, such as typos, missing values, and duplicates. The intrinsic problem with dirty data is that it can lead to poor results in analytic tasks. Therefore, data cleaning is an unavoidable task in data preparation to have reliable data for final applications, such as querying and mining. Unfortunately, data cleaning is hard in practice and it requires a great amount of manual work. Several systems have been proposed to increase automation and scalability in the process. They rely on a formal, declarative approach based on first order logic: users provide high-level specifications of their tasks, and the systems compute optimal solutions without human intervention on the generated code. However, traditional .top-down. cleaning approaches quickly become unpractical when dealing with the complexity and variety found in big data. In this talk, we first describe recent results in tackling data cleaning with a declarative approach. We then discuss how this experience has pushed several groups to propose new systems that recognize the central role of the users in cleaning big data.

Short Bio

Paolo Papotti is an Assistant Professor of Computer Science in the School of Computing, Informatics, and Decision Systems Engineering (CIDSE) at Arizona State University. He got his Ph.D. in Computer Science at Universita. degli Studi Roma Tre (2007, Italy) and before joining ASU he had been a senior scientist at Qatar Computing Research Institute. His research is focused on systems that assist users in complex, necessary tasks and that scale to large datasets with efficient algorithms and distributed platforms. He has authored more than 50 publications, and his work has been recognized with two "Best of the Conference" citations (SIGMOD 2009, VLDB 2015) and with a best demo award at SIGMOD 2015. He has been a program committee member for several venues, such as SIGMOD, VLDB, ICDE, WSDM, and is group leader for SIGMOD 2016. He is associate editor for the ACM Journal of Data and Information Quality (JDIQ).

Hosted by: Themis Palpanas

List of past seminars