2014
Doctoral thesis  Unknown

Beyond linear chain: a journey through conditional random fields for information extraction from text

Marcheggiani D.

Machine learning  Sequence labeling  Conditional random fields  Natural language processing  ARTIFICIAL INTELLIGENCE 

Natural language, spoken and written, is the most important way for humans to communicate information to each other. In the last decades emph{natural language processing} (NLP) researchers have studied methods aimed at making computers "understand" the information enclosed in human language. emph{Information Extraction} (IE) is a field of NLP that studies methods aimed at extracting information from text so that it can be used to populate a structured information repository, such as a relational database. IE is divided into several subtasks, each of which aims to extract different structures from text, such as entities, relations, or more complex structures such as ontologies. In this thesis the term ``information extraction'' is (somehow arbitrarily) used to identify only the subtasks that are formulated as emph{sequence labeling} tasks. Recently, the main approaches by means of which IE has been tackled rely on supervised machine learning, which needs human-labeled data examples in order to train the systems that extract information from yet unseen data. When IE is tackled as a sequence labeling task (as in e.g., emph{named-entity recognition}, emph{concept extraction}, and in some cases emph{opinion mining}), among the best-performing supervised machine learning methods are certainly emph{probabilistic graphical models}, and, specifically, emph{Conditional Random Fields} (CRFs). In this thesis we investigate two major aspects related to information extraction from text via CRFs: the creation of CRFs models that outperform the commonly adopted, state-of-the-art, ``linear-chain'' CRFs, and the impact of the quality of training data on the accuracy of CRFs system for IE. In the first part of the thesis we use the capabilities of the CRFs framework to create new kinds of CRFs (i.e., two-stage, ensemble, multi-label, hierarchical), that unlike the commonly adopted linear-chain CRFs have a customized structure that fits the task taken into consideration. We exemplify this approach on two different tasks, i.e., IE from medical documents and opinion mining from product reviews. CRFs, like any machine learning-based approach, may suffer if the quality of the training data is low. Therefore, the second part of the thesis is devoted to (1) the study of how the quality of the training data affects the accuracy of a CRFs system for IE; and (2) the production of human-annotated training data via semi-supervised emph{active learning} (AL).



Back to previous page
BibTeX entry
@phdthesis{oai:it.cnr:prodotti:354616,
	title = {Beyond linear chain: a journey through conditional random fields for information extraction from text},
	author = {Marcheggiani D.},
	year = {2014}
}