Information Extraction in Social Media

Status Messages written by users of Social Media websites (e.g. Facebook and Twitter) contain a great deal of timely and important information, however there are also many irrelevant and redundant messages which can easily lead to information overload. No person can read each of the hundreds of millions of messages produced every day, motivating the need for systems which can automatically extract and aggregate important information from these dynamically changing text streams.

Off-the shelf tools such as Part of Speech Taggers and Named Entity Recognizers perform poorly when applied to Social Media text due to its noisy and unique style. To address this I have been working towards building a set of Twitter-specific text processing tools [EMNLP 2011a].

Users of Social Media sites frequently discuss events which will occur in the future. By annotating Named Entities and resolving temporal expressions (for example "next Friday"), we are able to automatically extract a calendar of popular events occurring in the near future from Twitter [KDD 2012].

A recent talk on this work can be viewed here.

Conversational Modeling in Social Media

In addition to discussing upcoming events, users of social networking sites are having public conversations at an unprecedented scale. This presents a unique opportunity to collect millions of naturally occurring conversations and investigate new data-driven techniques for conversational modeling.

I have worked on unsupervised modeling of dialogue acts in Twitter [NAACL 2010]. By remaining agnostic about the set of classes, we are able to learn a model which provides insight into the nature of communication in a new medium.

I have investigated the feasibility of automatically replying to status messages by adapting techniques from Statistical Machine Translation [EMNLP 2011b], using millions of naturally occurring Twitter conversations as parallel text. Although there are many differences between conversation and translation, with a few conversation-specific adaptations we are able to build response models which often generate appropriate replies to Twitter status posts. This work has several possible applications, including conversationally aware predictive text entry.

A recent talk on this work can be viewed here.

Better Models of Weakly Supervised Knowledge Extraction

Modeling large datasets has revolutionized a number of fields including machine translation and speech recognition. This data is produced as a natural byproduct of people's activities (for example transcribing speech to text and professional translation services). Computational semantics has suffered in this respect because people don't naturally translate text into machine-processable meaning representations. In order to apply large scale data-driven approaches to semantic processing, we need to leverage readily available knowledge sources such as Wikipedia and Freebase as indirect supervision. These supervision sources have weaker correspondence with text thus requiring specialized learning methods involving latent variables.

I have explored the issue of Missing Data in Distant Supervision [TACL 2013]. Even large structured data sources such as Freebase lack complete coverage in many areas of interest. Most previous distantly supervised learning algorithms have relied on the closed world assumption: that all propositions missing from the KB are false. In the situation where information is missing from either the text or the database this leads to errors in the training data. These assumptions were relaxed in a novel latent variable model, which jointly models the process of information extraction in addition to missing information in both the text and the KB, and provides a natural way to incorporate side-information in the form of a missing data model. I designed an efficient and accurate inference method for this new model and presented results demonstrating large performance improvements by explicitly modeling missing data.

I have investigated Distant Supervision with Topic Models, which is appropriate for weakly supervised learning problems involving highly ambiguous training data. To demonstrate the feasibility of this approach we make use of entity categories from Freebase as a distant source of supervision in a weakly supervised named entity categorization task. This approach leverages the ambiguous supervision provided by Freebase in a principled way, significantly outperforming Co-Training [EMNLP 2011a].

Latent Variable Models of Lexical Semantics

I have applied a variant of Latent Dirichlet Allocation to automatically infer the argument types or Selectional Preferences of textual relations [ACL 2010]. Generative models have the advantage that they provide a principled way to perform many different kinds of probabilistic queries about the data. For example, our model of selectional preferences is useful in filtering improper applications of inference rules in context, showing a substantial improvement over a state-of-the-art rule-filtering system which makes use of a predefined set of classes. The topics discovered by our model can be browsed here. Inference and evaluation code is available for download here.

In addition, I have applied latent variable models to automatically induce an appropriate set of categories for events extracted from Twitter [KDD 2012]. By leveraging large quantities of unlabeled data we are able to outperform a supervised baseline at the task of categorizing extracted events using the types automatically inferred by our model.

Utilizing Implicit Feedback in Interactive File Selection

Selection tasks are common in modern computer interfaces; we are often required to select multiple files, emails, and other data entries for copying, modification, deletion etc... Complex selection tasks can require many clicks and mouse movements on behalf of the user; to aid users with these complex selections we propose an interactive machine learning solution [IUI 2009]. In addition to making use of explicit selections and deselections, we utilize implicit features of the user's behavior such as passing over files, or proximity in the interface. Since the behavior features are task-independent, we use historical interaction traces as training data. A video demonstration of our file-selection prototype can be viewed here.

Finding Contradictions in Web Text

Many textual relations map one argument to a unique value. For example the verb assassinated should map each direct object to a unique subject. We investigate automatically classifying relation functionality using an unsupervised EM-style algorithm, and evaluate performance at discovering naturally occurring contradictions within a large web corpus [EMNLP 2008]. We show that contradiction detection on the web is a difficult task for a variety of reasons including name ambiguity (e.g. John Smith was born in many different locations), synonyms and meronyms (Mozart was born in both Salzburg and Austria).

Interactive Information Integration with HTML Tables & Freebase

As part of the grad databases class, I investigated applying data-integration techniques to augment HTML tables with additional data from Freebase, in addition to enabling users to quickly verify and contribute data contained in the table. Users can choose to display columns which are not present in the original table, but for which data exists in Freebase, providing direct benefit. They can also easily verify our algorithmically generated mapping from table columns to Freebase attributes, allowing data contained in the table (but missing in Freebase) to be imported. Details on this project including a prototype Firefox browser plugin can be found here.