Page 1 of 1

CNR Author operator: and / or

Typology operator: and / or

Language operator: and / or

Date operator: and / or

Rights operator: and / or

2014 Conference article Open Access

An Experimental Comparison of Active Learning Strategies for Partially Labeled Sequences
Marcheggiani D., Thierry A.
Active learning (AL) consists of asking human annotators to annotate automatically selected data that are assumed to bring the most benefit in the creation of a classifier. AL allows to learn accurate systems with much less annotated data than what is required by pure supervised learning algorithms, hence limiting the tedious effort of annotating a large collection of data. We experimentally investigate the behavior of several AL strategies for sequence labeling tasks (in a partially-labeled scenario) tailored on Partially-Labeled Conditional Random Fields, on four sequence labeling tasks: phrase chunking, part-of-speech tagging, named-entity recognition, and bio-entity recognition.Source: Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 898–906, Doha, Qatar, 25-29 /10 2014

See at: aclweb.org Open Access | CNR ExploRA

2014 Other Unknown

Beyond linear chain: a journey through conditional random fields for information extraction from text
Marcheggiani D.
Natural language, spoken and written, is the most important way for humans to communicate information to each other. In the last decades emph{natural language processing} (NLP) researchers have studied methods aimed at making computers "understand" the information enclosed in human language. emph{Information Extraction} (IE) is a field of NLP that studies methods aimed at extracting information from text so that it can be used to populate a structured information repository, such as a relational database. IE is divided into several subtasks, each of which aims to extract different structures from text, such as entities, relations, or more complex structures such as ontologies. In this thesis the term ``information extraction'' is (somehow arbitrarily) used to identify only the subtasks that are formulated as emph{sequence labeling} tasks. Recently, the main approaches by means of which IE has been tackled rely on supervised machine learning, which needs human-labeled data examples in order to train the systems that extract information from yet unseen data. When IE is tackled as a sequence labeling task (as in e.g., emph{named-entity recognition}, emph{concept extraction}, and in some cases emph{opinion mining}), among the best-performing supervised machine learning methods are certainly emph{probabilistic graphical models}, and, specifically, emph{Conditional Random Fields} (CRFs). In this thesis we investigate two major aspects related to information extraction from text via CRFs: the creation of CRFs models that outperform the commonly adopted, state-of-the-art, ``linear-chain'' CRFs, and the impact of the quality of training data on the accuracy of CRFs system for IE. In the first part of the thesis we use the capabilities of the CRFs framework to create new kinds of CRFs (i.e., two-stage, ensemble, multi-label, hierarchical), that unlike the commonly adopted linear-chain CRFs have a customized structure that fits the task taken into consideration. We exemplify this approach on two different tasks, i.e., IE from medical documents and opinion mining from product reviews. CRFs, like any machine learning-based approach, may suffer if the quality of the training data is low. Therefore, the second part of the thesis is devoted to (1) the study of how the quality of the training data affects the accuracy of a CRFs system for IE; and (2) the production of human-annotated training data via semi-supervised emph{active learning} (AL).

See at: CNR ExploRA