Status Messages written by users of Social Media websites (e.g. Facebook and Twitter) contain a great deal of timely and important information, however there are also many irrelevant and redundant messages which can easily lead to information overload. No person can read each of the hundreds of millions of messages produced every day, motivating the need for systems which can automatically extract and aggregate important information from these dynamically changing text streams.
Off-the shelf tools such as Part of Speech Taggers and Named Entity Recognizers perform poorly when applied to Social Media text due to its noisy and unique style. To address this I have been working towards building a set of Twitter-specific text processing tools [EMNLP 2011a].
Users of Social Media sites frequently discuss events which will occur in the future. By annotating Named Entities and resolving temporal expressions (for example "next Friday"), we are able to automatically extract a calendar of popular events occurring in the near future from Twitter [KDD 2012].
A recent talk on this work can be viewed here.
In addition to discussing upcoming events, users of social networking sites are having public conversations at an unprecedented scale. This presents a unique opportunity to collect millions of naturally occurring conversations and investigate new data-driven techniques for conversational modeling.
I have worked on unsupervised modeling of dialogue acts in Twitter [NAACL 2010]. By remaining agnostic about the set of classes, we are able to learn a model which provides insight into the nature of communication in a new medium.
I have investigated the feasibility of automatically replying to status messages by adapting techniques from Statistical Machine Translation [EMNLP 2011b], using millions of naturally occurring Twitter conversations as parallel text. Although there are many differences between conversation and translation, with a few conversation-specific adaptations we are able to build response models which often generate appropriate replies to Twitter status posts. This work has several possible applications, including conversationally aware predictive text entry.
A recent talk on this work can be viewed here.
Modeling large datasets has revolutionized a number of fields including machine translation and speech recognition. This data is produced as a natural byproduct of people's activities (for example transcribing speech to text and professional translation services). Computational semantics has suffered in this respect because people don't naturally translate text into machine-processable meaning representations. In order to apply large scale data-driven approaches to semantic processing, we need to leverage readily available knowledge sources such as Wikipedia and Freebase as indirect supervision. These supervision sources have weaker correspondence with text thus requiring specialized learning methods involving latent variables.
I have explored the issue of Missing Data in Distant Supervision [TACL 2013]. Even large structured data sources such as Freebase lack complete coverage in many areas of interest. Most previous distantly supervised learning algorithms have relied on the closed world assumption: that all propositions missing from the KB are false. In the situation where information is missing from either the text or the database this leads to errors in the training data. These assumptions were relaxed in a novel latent variable model, which jointly models the process of information extraction in addition to missing information in both the text and the KB, and provides a natural way to incorporate side-information in the form of a missing data model. I designed an efficient and accurate inference method for this new model and presented results demonstrating large performance improvements by explicitly modeling missing data.
I have investigated Distant Supervision with Topic Models, which is appropriate for weakly supervised learning problems involving highly ambiguous training data. To demonstrate the feasibility of this approach we make use of entity categories from Freebase as a distant source of supervision in a weakly supervised named entity categorization task. This approach leverages the ambiguous supervision provided by Freebase in a principled way, significantly outperforming Co-Training [EMNLP 2011a].
I have applied a variant of Latent Dirichlet Allocation to automatically infer the argument types or Selectional Preferences of textual relations [ACL 2010]. Generative models have the advantage that they provide a principled way to perform many different kinds of probabilistic queries about the data. For example, our model of selectional preferences is useful in filtering improper applications of inference rules in context, showing a substantial improvement over a state-of-the-art rule-filtering system which makes use of a predefined set of classes. The topics discovered by our model can be browsed here. Inference and evaluation code is available for download here.
In addition, I have applied latent variable models to automatically induce an appropriate set of categories for events extracted from Twitter [KDD 2012]. By leveraging large quantities of unlabeled data we are able to outperform a supervised baseline at the task of categorizing extracted events using the types automatically inferred by our model.