CSE 5539: NLP and IE for the Social Web

The recent explosion of user generated content in online social media presents a wealth of new opportunities for data analytics applications. Structured data is just the tip of the iceberg; most of this data is locked up as unstructured text, which is difficult for current algorithms to process. There has been growing interest in adapting natural language processing (NLP) and information extraction (IE) technology to this data, as well as identifying new opportunities for applications on big, noisy, informal text data. Examples include computational social science, user modeling, personalization, news recommendation, event detection and more.

The course will involve reading and discussing recent papers from top conferences in the field. Students will propose and complete an open-ended course project; example projects might include anything from extracting a concert calendar from Twitter to automatically generating answers to health questions in online patient forums.

While the course will cover some technical material, emphasis will be on applications and building systems rather than mathematical details. Some prior coursework in Artificial Intelligence or Machine Learning will be very helpful.

While there are no formal prerequisites for the class, students are assumed to have some minimal background in applied machine learning, or are willing to learn the basics on their own.

Grading will be based on 2 components:

50% participation

The class will read and discuss two papers each week. Before class, each student should read the assigned paper and write a short critique (about half a page). These reviews should not be simple summaries, but discuss positive aspects of the paper and limitations, or suggestions for how the work could be improved or extended, and ask questions about the paper which were either difficult to understand, or not clear which we can discuss in class. The idea is that writing reviews will help to spark discussion in class. Students are allowed to skip 2 reviews throughout the semester. Reviews should be submitted to a course discussion forum for the paper in Carmen before 9am on the day of class. Please email your reviews to the instructor if there are any technical issues with submission.

Each student will also lead discussion for one paper. email the instructor before class on 8/29 to reserve a presentation slot. The discussion leader should the main contributions of the paper within about 15 minutes after which the class will open up for discussion. Please email a your presentation materials (slides or discussion notes) to the instructor for review 24 hours before class. Students leading the discussion of a paper do not need to write a review for that paper.
  • 25% Paper Summaries
  • 15% Paper Presentation
  • 10% Participation in Class Discussions

50% course project

Students will propose and carry out a mini research project with guidance from the instructor. The project should produce a 4-5 page report and students will give a short in-class presentation on their project at the end of the semester. Projects should be done in small groups of about 2 or 3 students. The scope of the project should be appropriate to the size of the group, and all students in a group will receive the same grade. Students should submit an initial 1-2 page project proposal by 9/26 which will be reviewed by the instructor who will provide feedback. Students are encouraged to meet with the instructor to discuss project ideas. Students are free to choose from the sample projects, or propose their own. It is OK for multiple groups of students to work on the same project, though they are encouraged to communicate, share data and focus on different aspects. If you are currently working on a research project in a related area, feel free to discuss with the instructor about using that as a course project. I am open to the possibility of literature-reviews or other types of projects, however empirical projects which evaluate performance using metrics such as precision and recall are preferred.
Please submit your project report, source code and data to the dropbox on Carmen by 12/12.
  • 10% Initial Proposal
  • 10% Final Presentation
  • 10% Project Code + Data
  • 20% Final Report

Useful Resources
The following is a list of potentially useful resources for the class project. You are not required to use these.
Project Ideas
Following is a list of potential directions for course projects. You are also encouraged to come up with your own idea.
Date Topic Reading
8/27 Course Overview No Reading
8/29 Relation Extraction (Alan will present) - Useful videos: 1 2 3 4 Distant supervision for relation extraction without labeled data, Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky, ACL 2009 Email instructor with your preferred presentation slot date before class.
9/3 Relation Extraction Coupled Semi-Supervised Learning for Information Extraction A. Carlson, J. Betteridge, R.C. Wang, E.R. Hruschka Jr. and T.M. Mitchell. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2010. Email instructor with project groups before class.
9/5 Event Extraction Open Domain Event Extraction from Twitter Alan Ritter, Mausam, Oren Etzioni, Sam Clark, KDD 2012
9/10 Text-Driven Forecasting Predicting a Scientific Community’s Response to an Article Dani Yogatama, Michael Heilman, Brendan O’Connor, Chris Dyer, EMNLP 2011
9/12 Brainstorm Project Ideas No Reading
9/17 Text-Driven Forecasting Predicting the Present with Google Trends Hyunyoung Choi, Hal Varian
9/19 Computational Social Science No Country for Old Members: User Lifecycle and Linguistic Change in Online Communities Cristian Danescu-Niculescu-Mizil, Robert West, Dan Jurafsky, Jure Leskovec, Christopher Potts, WWW 2013
9/24 Relation Extraction Modeling Missing Data in Distant Supervision for Information Extraction Alan Ritter, Luke Zettlemoyer, Mausam, Oren Etzioni, TACL 2013
9/26 Geographical Modeling Hierarchical Geographical Modeling of User Locations from Social Media Posts Amr Ahmed, Liangjie Hong, Alex Smola, WWW 2013
10/1 Geographical Modeling Initial Project Proposals Due Finding Your Friends and Following Them to Where You Are Adam Sadilek, Henry Kautz, Jeffrey Bigham, WSDM 2012
10/3 NLP in Noisy Text Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, Noah A. Smith, NAACL 2013
10/8 Summarization Towards Twitter Context Summarization with User Influence Models Yi Chang, Xuanhui Wang, Qiaozhu Mei, Yan Liu WSDM 2013
10/10 Text Driven Forecasting Success with style: Using writing style to predict the success of novels Vikas Ganjigunte, Ashok Song Feng, Yejin Choi, EMNLP 2013
10/15 NLP in Noisy Text Learning part-of-speech taggers with inter-annotator agreement loss, Barbara Plank, Dirk Hovy, Anders Søgaard, EACL 2014
10/17 Event Extraction Major Life Event Extraction from Twitter based on Congratulations/Condolences Speech Acts, Jiwei Li, Alan Ritter, Claire Cardie and Eduard Hovy, EMNLP 2014
10/22 Entity Linking To Link or Not to Link? A Study on End-to-EndTweet Entity Linking Stephen Guo, Ming-Wei Chang, Emre Kiciman, NAACL 2013
10/24 NLP in Noisy Text Lexical Normalisation of Short Text Messages: Makn Sens a #twitter Bo Han, Timothy Baldwin, ACL 2011 (note: the paper was changed on 10/20 - if you read the previous paper it's fine to submit a critique for that instead).
10/29 NLP in Noisy Text What to do about bad language on the internet Jacob Eisenstein
10/31 Guest Lecture: Micha Elsner Disentangling chat with local coherence models, ACL 2011
11/5 NLP in Noisy Text A Dependency Parser for Tweets Liangpeng Kong et. al., EMNLP 2014
11/7 Guest Lecture: Wei Xu (Upenn) A Preliminary Study of Tweet Summarization using Information Extraction
11/12 Distributed Representations of Words and Phrases and their Compositionality
11/14 Sentiment Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs
11/19 Guest Lecture: Marie-Catherine de Marneffe Easy Victories and Uphill Battles in Coreference Resolution Durrett and Klein, EMNLP 2013
11/21 Event Extraction Event Discovery in Social Media Feeds Benson et. al. ACL 2011
11/26 Thanksgiving break
11/28 Columbus Day
12/3 Course Projects Office hours, feel free to drop by to discuss any questions about projects. Dreese 595.
12/15 Course Projects Final Project Presentations @ 4pm