Using Web Corpus Statistics for Program Analysis (SPLASH 2014 - OOPSLA)

Mon 20 - Fri 24 October 2014 Portland, Oregon, United States

Who

Chun-Hung Hsiao, Michael Cafarella, Satish Narayanasamy

Track

SPLASH 2014 OOPSLA

Time Zone

The program is currently displayed in (GMT-07:00) Tijuana, Baja California.

Use conference time zone: (GMT-07:00) Tijuana, Baja CaliforniaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 22 Oct 2014 11:37 - 12:00 at Salon E - Program Analysis and the Web Chair(s): Stephen Chong

Abstract

Several program analysis tools—such as plagiarism detection and bug finding—rely on knowing a piece of code’s relative semantic importance. For example, a plagiarism detector should not bother reporting two programs that have an identical simple loop counter test, but should report programs that share more distinctive code. Traditional program analysis techniques (e.g., finding data and control dependencies) are useful, but do not say how surprising or common a line of code is. Natural language processing researchers have encountered a similar problem and addressed it using an n-gram model of text frequency, derived from statistics computed over text corpora.

We propose and compute an n-gram model for programming languages, computed over a corpus of 2.8 million JavaScript programs we downloaded from the Web. In contrast to previous techniques, we describe a code n-gram as a subgraph of the program dependence graph that contains all nodes and edges reachable in n steps from the statement. We can count n-grams in a program and count the frequency of n-grams in the corpus, enabling us to compute tf-idf-style measures that capture the differing importance of different lines of code. We demonstrate the power of this approach by implementing a plagiarism detector with accuracy that beats previous techniques, and a bug-finding tool that discovered over a dozen previously unknown bugs in a collection of real deployed programs.

Link to Publication

http://dl.acm.org/authorize?N80751

Chun-Hung Hsiao

University of Michigan

Michael Cafarella

University of Michigan

Satish Narayanasamy

University of Michigan

Time Zone

The program is currently displayed in (GMT-07:00) Tijuana, Baja California.

Use conference time zone: (GMT-07:00) Tijuana, Baja CaliforniaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 22 Oct
Displayed time zone: Tijuana, Baja California change

10:30 - 12:00	Program Analysis and the WebOOPSLA at Salon E Chair(s): Stephen Chong Harvard University

10:30 22m Talk		Checking Correctness of TypeScript Interfaces for JavaScript Libraries OOPSLA Asger Feldthaus Aarhus University, Anders Møller Aarhus University Link to publication
10:52 22m Talk		Determinacy in Static Analysis for jQuery OOPSLA Esben Andreasen Aarhus University, Anders Møller Aarhus University Link to publication
11:15 22m Talk		EventBreak: Analyzing the Responsiveness of User Interfaces through Performance-Guided Test Generation OOPSLA Michael Pradel University of California, Berkeley, USA, Parker Schuh University of California, Berkeley, George Necula University of California, Berkeley, Koushik Sen University of California, Berkeley Link to publication
11:37 22m Talk		Using Web Corpus Statistics for Program Analysis OOPSLA Chun-Hung Hsiao University of Michigan, Michael Cafarella University of Michigan, Satish Narayanasamy University of Michigan Link to publication