Mon 20 - Fri 24 October 2014 Portland, Oregon, United States

Testing and static analysis can help root out bugs in programs, but not in data. This paper introduces data debugging, an approach that combines program analysis and statistical analysis to find potential data errors. Since it is impossible to know a priori whether data are erroneous or not, data debugging locates data that has an unusual impact on the computation. Such data is either very important, or wrong. Data debugging is especially useful in the context of data-intensive program- ming environments that intertwine data with programs in the form of queries or formulas. We present the first data debugging tool, CheckCell, an add-in for Microsoft Excel. CheckCell identifies cells that have an unusually high impact on the spreadsheet’s computations. We show that CheckCell is both analytically and empirically fast and effective. We show that it successfully finds injected typographical errors produced by a generative model trained with data entry from 100,000 Mechanical Turk tasks. CheckCell also automatically identifies a key flaw in the infamous Reinhart and Rogoff spreadsheet.