Page 1 of 1

2013 Contribution to book Restricted

The Tanl tagger for named entity recognition on transcribed broadcast news at Evalita 2011
Berardi G., Attardi G., Dei Rossi S., Simi M.
The Tanl tagger is a configurable tagger based on a Maximum Entropy classifier, which uses dynamic programming to select the best sequences of tags. We applied it to the NER tagging task, customizing the set of features to use, and including features deriving from dictionaries extracted from the training corpus. The final accuracy of the tagger is further improved by applying simple heuristic rules.Source: Evaluation of Natural Language and Speech Tools for Italian. International Workshop. Revised selected papers, edited by Bernardo Magnini, Francesco Cutugno, Mauro Falcone, Emanuele Pianta, pp. 116–125. Berlin: Springer, 2013
DOI: 10.1007/978-3-642-35828-9_13
Metrics:

See at: doi.org Restricted | link.springer.com | CNR ExploRA

2014 Other Unknown

Semi-automated text classification
Berardi G.
There is currently a high demand for information systems that automatically analyze textual data, since many organizations, both private and public, need to process large amounts of such data as part of their daily routine, an activity that cannot be performed by means of human work only. One of the answers to this need is text classification (TC), the task of automatically labelling textual documents from a domain D with thematic categories from a predefined set C. Modern text classification systems have reached high efficiency standards, but cannot always guarantee the labelling accuracy that applications demand. When the level of accuracy that can be obtained is insufficient, one may revert to processes in which classification is performed via a combination of automated activity and human effort. One such process is semi-automated text classification (SATC), which we define as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected such increase is maximized. An obvious strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top-ranked. In this dissertation we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose new effectiveness measures for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a ranked list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measures, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error. We therefore explore the task of SATC and the potential of our methods, in multiple text classification contexts. This dissertation is, to the best of our knowledge, the first to systematically address the task of semi-automated text classification.

See at: CNR ExploRA

2012 Conference article Open Access

Blog distillation via sentiment-sensitive link analysis
Berardi G., Esuli A., Sebastiani F., Silvestri F.
In this paper we approach blog distillation by adding a link analysis phase to the standard retrieval-by-topicality phase, where we also we check whether a given hyperlink is a citation with a positive or a negative nature. This allows us to test the hypothesis that distinguishing approval from disapproval brings about benefits in blog distillation.DOI: 10.1007/978-3-642-31178-9_26
Metrics: