CSE 5539: Web Information Extraction

Encyclopedic knowledge bases (KBs) such as Wikipedia and Freebase form the underlying intelligence behind Google’s Knowledge Graph, Facebook’s Graph Search, IBM’s Watson and more. These broad-coverage databases contain facts about entities, for example a person’s employer or a city’s mayor. While today’s KBs have good coverage for popular entities, little structured information is available for less prominent entities, suggesting the need for information extraction from large unstructured and semi-structured data sources found on the Web.

The course will involve reading and discussing recent papers from top conferences in the field. Students will propose and complete an open-ended course project; example projects might include anything from extracting relations from HTML tables on the web to event extraction from news or social media text.

Previous experience with natural language processing and/or machine learning will be very helpful.

Resources
Course Details
Prerequisites
While there are no formal prerequisites for the class, students are assumed to have some background in machine learning, or are willing to learn the basics on their own independently.
Grading

Grading will be based on 2 components:

50% participation

The class will read and discuss two papers each week. Before class, each student should read the assigned paper and write a short critique (between 100-200 words). These reviews should not be simple summaries, but discuss positive aspects of the paper and limitations, or suggestions for how the work could be improved or extended, or raise questions regarding which points were difficult to understand so we can discuss in class. The point of writing reviews is to help spark discussions in class. Students are allowed to skip 2 reviews throughout the semester. Reviews should be submitted to a course discussion forum for the paper in Carmen before 9am on the day of class. Please email your reviews to the instructor if there are any technical issues with online submission.

Each student will also lead discussion of two papers. The discussion leader should the main contributions of the paper within about 15 minutes after which the class will open up for discussion. Please email a your presentation materials (slides or discussion notes) to the instructor for review 24 hours before class. Students leading the discussion of a paper do not need to write a review for that paper.
  • 20% Paper Summaries
  • 20% Paper Presentation
  • 10% Participation in Class Discussions

50% course project

Students will propose and carry out a mini research project with guidance from the instructor. The project should produce a 4-5 page report including experimental results comparing against some baseline. Students will give a short in-class presentation on their project at the end of the semester. Projects can be done individually or in groups of 2 or 3 students. The scope of the project should be appropriate to the size of the group, and all students in a group will receive the same grade. Students will submit an initial 1-2 page project proposal by 10/8 which will be reviewed by the instructor who will provide feedback. Students are encouraged to meet with the instructor to discuss project ideas. It is OK for multiple groups of students to work on similar projects, though they are encouraged to communicate, share data and focus on different aspects. If you are currently working on a research project in a related area, feel free to discuss with the instructor about using that as a course project. I am also open to the possibility of literature-reviews or other types of projects, however empirical projects which evaluate performance using metrics such as precision and recall and compare against an appropriate baseline system are preferred.
Please submit your project report, source code and data to the dropbox on Carmen by 12/12.
  • 10% Initial Proposal
  • 10% Final Presentation
  • 10% Project Code + Data
  • 20% Final Report

Paper List

System Overview Papers

  • Dong, Xin, et al. "Knowledge vault: A web-scale approach to probabilistic knowledge fusion." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.
  • Wu, Sen, et al. "Incremental Knowledge Base Construction Using DeepDive." arXiv preprint arXiv:1502.00731 (2015).
  • Mitchell, Tom M., et al. "Never-Ending Learning." Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.

Relation Extraction

  • Snow, Rion, Daniel Jurafsky, and Andrew Y. Ng. "Learning syntactic patterns for automatic hypernym discovery." Advances in Neural Information Processing Systems 17
  • Mintz, Mike, et al. "Distant supervision for relation extraction without labeled data." Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 2009.
  • Cafarella, Michael J., et al. "Webtables: exploring the power of tables on the web." Proceedings of the VLDB Endowment 1.1 (2008): 538-549.
  • Suchanek, Fabian M., Gjergji Kasneci, and Gerhard Weikum. "Yago: a core of semantic knowledge." Proceedings of the 16th international conference on World Wide Web. ACM, 2007.
  • Surdeanu, Mihai, et al. "Multi-instance multi-label learning for relation extraction." Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012.
  • Parikh, Ankur P., Hoifung Poon, and Kristina Toutanova. "Grounded Semantic Parsing for Complex Knowledge Extraction."
  • Fader, Anthony, Stephen Soderland, and Oren Etzioni. "Identifying relations for open information extraction." Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.
  • Leveraging Linguistic Structure For Open Domain Information Extraction Gabor Angeli, Melvin Johnson Premkumar, Chris Manning Association for Computational Linguistics (ACL). 2015

Event Extraction

  • Benson, Edward, Aria Haghighi, and Regina Barzilay. "Event discovery in social media feeds." Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 2011.
  • Tim Althoff, Xin Luna Dong, Kevin Murphy, Safa Alai, Van Dang, and Wei Zhang. 2015. TimeMachine: Timeline Generation for Knowledge-Base Entities. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15)
  • Ander Intxaurrondo, Eneko Agirre, Oier Lopez de Lacalle, and Mihai Surdeanu. Diamonds in the Rough: Event Extraction from Imperfect Microblog Data. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT), 2015.
  • Marchetti-Bowick, Micol, and Nathanael Chambers. "Learning for microblogs with distant supervision: Political forecasting with twitter." Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2012.
  • Emre KıcKıman and Matthew Richardson. 2015. Towards Decision Support and Goal Achievement: Identifying Action-Outcome Relationships From Social Media. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15).
  • Chambers, Nathanael, and Dan Jurafsky. "Template-based information extraction without the templates." Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 2011.

Taxonomy Induction

  • Snow, Rion, Daniel Jurafsky, and Andrew Y. Ng. "Semantic taxonomy induction from heterogenous evidence." Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006.
  • "Structured Learning for Taxonomy Induction with Belief Propagation" Mohit Bansal, David Burkett, Gerard de Melo and Dan KleinProceedings of ACL2014

Entity Extraction and Linking

  • Ratinov, Lev, and Dan Roth. "Design challenges and misconceptions in named entity recognition." Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2009.
  • Colin Cherry and Hongyu Guo, The Unreasonable Effectiveness of Word Representations for Twitter Named Entity Recognition, In Proceedings of NAACL, June 2015
  • Durrett, Greg, and Dan Klein. "A joint model for entity analysis: Coreference, typing, and linking." Transactions of the Association for Computational Linguistics 2 (2014): 477-490.

Inference

  • Lao, Ni, Tom Mitchell, and William W. Cohen. "Random walk inference and learning in a large scale knowledge base." Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.
  • Incorporating Vector Space Similarity in Random Walk Inference over Knowledge Bases. Matt Gardner, Partha Talukdar, Jayant Krishnamurthy, and Tom Mitchell. EMNLP 2014
  • Relation Extraction with Matrix Factorization and Universal Schemas, Sebastian Riedel, Limin Yao, Benjamin M. Marlin and Andrew McCallum, Joint Human Language Technology Conference/Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL '13) 2013
  • Representing Text for Joint Embedding of Text and Knowledge Bases Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury and Michael Gamon, EMNLP 2015
  • "Joint Information Extraction and Reasoning: A Scalable Statistical Relational Learning Approach", to appear in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and The 7th International Joint Conference of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015)
Schedule
Date Topic Reading
8/25 Course Overview No Reading
8/27 Please fill out the Paper Selection Form before class Relation Extraction (Alan will present) - Useful videos: 1 2 3 4 Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations, Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, Daniel S. Weld, ACL 2011
Please See the course schedule spreadsheet for further reading and presentation assignments. Feel free to email the instructor for the link in case you did not recieve it by email.