Keyphrases that efficiently summarize a document’s content are used in various document processing and retrieval tasks. Current state-of-the-art techniques for keyphrase extraction operate at a phrase-level and involve scoring candidate phrases based on features of their component words. In this paper, we apply Conditional Random Fields to keyphrase extraction. We use token-based features incorporating linguistic, surface-form, and document-structure information to learn sequence taggers that predict keyphrases for research papers. We experimentally illustrate that using within-document features alone, our tagger performs on-par with existing state-of-the-art systems that rely on information from Wikipedia and citation networks. In addition, we are also able to harness recent work on feature labeling in discriminative models to seamlessly incorporate expert knowledge and predictions from existing systems into the keyphrase extraction process. We highlight the modeling advantages of our keyphrase taggers and show significant performance improvements on two recently-compiled datasets of Computer Science research papers.
In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI 2017), San Francisco, California, USA
Sujatha Das Gollapalli, Xiao-Li Li, and Peng Yang