Page 1 of 9

2010 Conference article Restricted

PP-Index: using permutation prefixes for efficient and scalable similarity search (Extended Abstract)
Esuli A
The Permutation Prefix Index (PP-Index) is a data structure that allows to perform efficient approximate similarity search. It is a permutation-based index, which is based on representing any indexed object with "its view of the surrounding world", i.e., a list of the elements of a set of reference objects sorted by their distance order with respect to the indexed object. In its basic formulation, the PP-Index is biased toward efficiency. We show how the effectiveness can reach optimal levels just by adopting two "boosting" strategies: multiple index search and multiple query search, which both have nice parallelization properties. We study both the efficiency and the effectiveness properties of the PP-Index, experimenting with collections of sizes up to one hundred million objects, represented in a very high-dimensional similarity space.

See at: CNR IRIS Restricted | CNR IRIS

2009 Conference article Open Access

PP-Index: using permutation prefixes for efficient and scalable approximate similarity search
Esuli A
We present the Permutation Prefix Index (PP-Index), an index data structure that allows to perform efficient approximate similarity search. The PP-Index belongs to the family of the permutation-based indexes, which are based on representing any indexed object with "its view of the surrounding world", i.e., a list of the elements of a set of reference objects sorted by their distance order with respect to the indexed object. In its basic formulation, the PP-Index is strongly biased toward efficiency, treating effectiveness as a secondary aspect. We show how the effectiveness can easily reach optimal levels just by adopting two "boosting" strategies: multiple index search and multiple query search. Such strategies have nice parallelization properties that allow to distribute the search process in order to keep high efficiency levels. We study both the efficiency and the effectiveness properties of the PP-Index. We report experiments on collections of sizes up to one hundred million images, represented in a very high-dimensional similarity space based on the combination of ve MPEG-7 visual descriptors.Source: CEUR WORKSHOP PROCEEDINGS, pp. 17-24. Boston, USA, 23 luglio 2009

See at: CNR IRIS Open Access | CNR IRIS Restricted

2008 Contribution to conference Open Access

Annotating WordNet synsets by sentiment-related information: issues and potential solutions
Esuli A.
Many works in sentiment analysis have focused on the problem of subjectivity detection, at various levels: from terms (or term senses), as in the automatic annotation of lexical resources, to fragments of text, as in opinion extraction, to entire documents, as in sentiment classification. At all these levels, the two dimensions that have been investigated more actively are polarity ("positive/negative") and force ("strong/mild/weak" expression of positivity or negativity). In the SentiWordNet project we made a first attempt at automatically adding information concerning these two dimensions to WordNet. In another, more recent research we have explored a further dimension of subjective language, i.e, attitude type, which distinguishes, for example, between moral appreciation ("honest") and aesthetic appreciation ("beautiful"). We think that endowing WordNet with annotations pertaining to these three dimensions (polarity + force + attitude type) would make WordNet an even more invaluable resource for sentiment analysis. Adding this information to WordNet would not be an easy task, for at least two reasons. One is the sheer size of the resource; this might call, at least initially, for a semi-automatic approach, on the line of the SentiWordnet or of the "WordNet Evocation" projects. The other is the choice of the taxonomy of sentiment types, which needs to compromise between conceptual subtlety and real-world applicability. For our recent work on attitude type we have adopted a taxonomy of attitude types originally defined in Martin and White's Appraisal Theory; however, other potentially interesting alternatives have been developed, e.g. in the EU-funded Simple project. However, we conjecture that even this three-dimensional specification of the sentiment-related properties of synsets might not be sufficient for application purposes, at least for some parts of speech. For example, it is conceivable that a verb's polarity should not be characterized as positive or negative tout court, but that a distinction should be made as to which semantic role of the verb such polarity is bestowed upon. For instance, the verbs "torture" and "discard" both have a negative slant; however, while "torture" casts a negative character on the subject of the action (and on the action itself), "discard" typically casts a negative character on the direct object of the action. Such distinctions should be accounted for in a lexicon, especially in order to make it useful for opinion extraction applications.Source: Fourth Global WordNet Conference, Szeged, Hungary, 22-25 gennaio 2008

See at: ISTI Repository Open Access | www.inf.u-szeged.hu | CNR ExploRA

2008 Contribution to journal Open Access

See at: CNR IRIS Open Access | www.sigir.org | CNR IRIS Restricted

2009 Conference article Restricted

MiPai: using the PP-Index to build an efficient and scalable similarity search system
Esuli A
MiPai is an image search system that provides visual similarity search and text-based search functionalities. The similarity search functionality is implemented by means of the Permutation Prefix Index (PP-Index), a novel data structure for approximate similarity search. The text-based search functionality is based on a traditional inverted list index data structure. MiPai also provides a combined visual similarity/text search function.DOI: 10.1109/sisap.2009.14
Metrics:

See at: doi.org Restricted | CNR IRIS | ieeexplore.ieee.org | CNR IRIS

2010 Software Metadata Only Access

MP-Boost++
Esuli A
MPBoost++ is a C++ implementation of MPBoost a variant of the multi-label AdaBoost.MH algorithm that improves its efficacy and efficiency by performing a multiple pivot selection at each boosting iteration.

See at: CNR IRIS Restricted | www.esuli.it

2008 Other Open Access

Automatic generation of lexical resources for opinion mining: models, algorithms and applications
Esuli A
Opinion mining is a recent discipline at the crossroads of Information Retrieval and of Computational Linguistics which is concerned not with the topic a document is about, but with the opinion it expresses. It has a rich set of applications, ranging from tracking users' opinions about products or about political candidates as expressed in online forums, to customer relationship management. Functional to the extraction of opinions from text is the determination of the relevant entities of the language that are used to express opinions, and their opinion-related properties. For example, determining that the term beautiful casts a positive connotation to its subject. In this thesis we investigate on the automatic recognition of opinion-related properties of terms. This results into building opinion-related lexical resources, which can be used into opinion mining applications. We start from the (relatively) simple problem of determining the orientation of subjective terms. We propose an original semi-supervised term classification model that is based on the quantitative analysis of the glosses of such terms, i.e. the definitions that these terms are given in on-line dictionaries. This method outperforms all known methods when tested on the recognized standard benchmarks for this task. We show how our method is capable to produce good results on more complex tasks, such as discriminating subjective terms (e.g., good) from objective ones (e.g., green), or classifying terms on a fine-grained attitude taxonomy. We then propose a relevant refinement of the task, i.e., distinguishing the opinion-related properties of distinct term senses. We present SentiWordNet, a novel high-quality, high-coverage lexical resource, where each one of the 115,424 senses contained in WordNet has been automatically evaluated on the three dimensions of positivity, negativity, and objectivity. We propose also an original and effective use of random-walk models to rank term senses by their positivity or negativity. The random-walk algorithms we present have a great application potential also outside the opinion mining area, for example in word sense disambiguation tasks. A result of this experience is the generation of an improved version of SentiWordNet. We finally evaluate and compare the various versions of SentiWordNet we present here with other opinion-related lexical resources well-known in literature, experimenting their use in an Opinion Extraction application. We show that the use of SentiWordNet produces a significant improvement with respect to the baseline system, not using any specialized lexical resource, and also with respect to the use of other opinion-related lexical resources

See at: etd.adm.unipi.it Open Access | CNR IRIS | CNR IRIS Restricted

2010 Other Restricted

Use of permutation prefixes for efficient and scalable approximate similarity search
Esuli A
We present the Permutation Prefix Index (PP-Index), an index data structure that allows to perform efficient approximate similarity search. The PP-Index belongs to the family of the permutation-based indexes, which are based on representing any indexed object with ``its view of the surrounding world'', i.e., a list of the elements of a set of reference objects sorted by their distance order with respect to the indexed object. In its basic formulation, the PP-Index is strongly biased toward efficiency. We show how the effectiveness can easily reach optimal levels just by adopting two ``boosting'' strategies: multiple index search and multiple query search, which both have nice parallelization properties. We study both the efficiency and the effectiveness properties of the PP-Index, experimenting with collections of sizes up to one hundred million objects, represented in a very high-dimensional similarity space.

See at: CNR IRIS Restricted | CNR IRIS

2012 Journal article Open Access

Use of permutation prefixes for efficient and scalable approximate similarity search
Esuli A
We present the Permutation Prefix Index (this work is a revised and extended version of Esuli (2009b), presented at the 2009 LSDS-IR Workshop, held in Boston) (PP-Index), an index data structure that supports efficient approximate similarity search. The PP-Index belongs to the family of the permutation-based indexes, which are based on representing any indexed object with "its view of the surrounding world", i.e., a list of the elements of a set of reference objects sorted by their distance order with respect to the indexed object. In its basic formulation, the PP-Index is strongly biased toward efficiency. We show how the effectiveness can easily reach optimal levels just by adopting two "boosting" strategies: multiple index search and multiple query search, which both have nice parallelization properties. We study both the efficiency and the effectiveness properties of the PP-Index, experimenting with collections of sizes up to one hundred million objects, represented in a very high-dimensional similarity space.Source: INFORMATION PROCESSING & MANAGEMENT, vol. 48 (issue 5), pp. 889-902
DOI: 10.1016/j.ipm.2010.11.011
Metrics:

2013 Other Open Access

The User Feedback on SentiWordNet
Esuli A
With the release of SentiWordNet 3.0 the related Web interface has been restyled and improved in order to allow users to submit feedback on the SentiWordNet entries, in the form of the suggestion of alternative triplets of values for an entry. This paper reports on the release of the user feedback collected so far and on the plans for the future.

See at: CNR IRIS Open Access | ISTI Repository | swn.isti.cnr.it | CNR IRIS Restricted

2014 Software Metadata Only Access

MiPai
Esuli A
This is the repository for the MiPai project, which provides a reference implementation of the Permutation Prefix Index (PP-Index), along with index and search example programs for various data types.

See at: CNR IRIS Restricted

2014 Software Metadata Only Access

TreeBoost
Esuli A
TreeBoost is a Java implementation of TreeBoost.MH a variant of the multi-label AdaBoost.MH algorithm that exploit the hierarchical relation among categories to improve both the efficacy and efficiency of the classifier.

See at: CNR IRIS Restricted

2012 Conference article Open Access

ISTI@ TREC Microblog track 2012: real-time filtering through supervised learning
Berardi G, Esuli A, Marcheggiani D
Our approach to the microblog filtering task is based on learning a relevance classifier from an initial training set of relevant and non relevant tweets, generated by using a simple retrieval method. The classifier is then retrained using the (simulated) user feedback collected during the training process, in order to improve its accuracy as the filtering process goes on. In the official runs the system scored low effectiveness values, suffering a strong imbalance toward recall.

See at: CNR IRIS Open Access | trec.nist.gov | CNR IRIS Restricted

2015 Conference article Restricted

Word embeddings go to Italy: A comparison of models and training datasets
Berardi G, Esuli A, Marcheggiani D
In this paper we present some preliminary results on the generation of word embeddings for the Italian language. We compare two popular word representation models, word2vec and GloVe, and train them on two datasets with different stylistic properties. We test the generated word embeddings on a word analogy test derived from the one originally proposed for word2vec, adapted to capture some of the linguistic aspects that are specific of Italian. Results show that the tested models are able to create syntactically and semantically meaningful word embeddings despite the higher morphological complexity of Italian with respect to English. Moreover, we have found that the stylistic properties of the training dataset plays a relevant role in the type of information captured by the produced vectors.Source: CEUR WORKSHOP PROCEEDINGS. Cagliari, 25-26/05/2015

See at: ceur-ws.org Restricted | CNR IRIS | CNR IRIS

2015 Conference article Open Access

On the impact of Entity Linking in microblog real-time filtering
Berardi G, Ceccarelli D, Esuli A, Marcheggiani D
Microblogging is a model of content sharing in which the temporal locality of posts with respect to important events, either of foreseeable or unforeseeable nature, makes applications of real-time filtering of great practical interest. We propose the use of Entity Linking (EL) in order to improve the retrieval effectiveness, by enriching the representation of microblog posts and filtering queries. EL is the process of recognizing in an unstructured text the mention of relevant entities described in a knowledge base. EL of short pieces of text is a difficult task, but it is also a scenario in which the information EL adds to the text can have a substantial impact on the retrieval process. We implement a start-of-the-art filtering method, based on the best systems from the TREC Microblog track real-time adhoc retrieval and filtering tasks , and extend it with a Wikipedia-based EL method. Results show that the use of EL significantly improves over non-EL based versions of the filtering methods. Copyright is held by the owner/author(s).DOI: 10.1145/2695664.2695761
DOI: 10.48550/arxiv.1611.03350
Metrics:

2016 Conference article Open Access

ISTI-CNR at SemEval-2016 Task 4: quantification on an ordinal scale
Esuli A
This paper details on the participation of ISTI-CNR to task 4 of Semeval 2016. Among the five subtasks, special attention has been paid to the five-point scale quantification subtask. The quantification method we propose is based on the observation that a standard document-by-document regression method usually has a bias towards assigning high prevalence labels. Our method models such bias with a linear model, in order to compensate it and to produce the quantification estimates.DOI: 10.18653/v1/s16-1011
Metrics:

See at: aclanthology.org Open Access | CNR IRIS | www.aclweb.org | doi.org Restricted | CNR IRIS

2021 Software Open Access

TwiGet
Esuli A
TwiGet is a python package for the management of the queries on filtered stream of the Twitter API, and the collection of tweets from it. It can be used as a command line tool (twiget-cli) or as a python class (TwiGet).Project(s): AI4Media via OpenAIRE

See at: github.com Open Access | CNR IRIS | CNR IRIS Restricted

2022 Journal article Open Access

ICS: total freedom in manual text classification supported by unobtrusive machine learning
Esuli A
We present the Interactive Classification System (ICS), a web-based application that supports the activity of manual text classification. The application uses machine learning to continuously fit automatic classification models that are in turn used to actively support its users with classification suggestions. The key requirement we have established for the development of ICS is to give its users total freedom of action: they can at any time modify any classification schema and any label assignment, possibly reusing any relevant information from previous activities. We investigate how this requirement challenges the typical scenarios faced in machine learning research, which instead give no active role to humans or place them into very constrained roles, e.g., on-demand labeling in active learning processes, and always assume some degree of batch processing of data. We satisfy the "total freedom" requirement by designing an unobtrusive machine learning model, i.e., the machine learning component of ICS as an unobtrusive observer of the users, that never interrupts them, continuously adapts and updates its models in response to their actions, and it is always available to perform automatic classifications. Our efficient implementation of the unobtrusive machine learning model combines various machine learning methods and technologies, such as hash-based feature mapping, random indexing, online learning, active learning, and asynchronous processing.Source: IEEE ACCESS, vol. 10, pp. 64741-64760
DOI: 10.1109/access.2022.3184009
Project(s): AI4Media via OpenAIRE

, ARIADNEplus via OpenAIRE

, SoBigData-PlusPlus via OpenAIRE

Metrics:

2023 Journal article Open Access

The interactive classification system
Esuli A
ISTI-CNR released a new web application for the manual and automatic classification of documents. Human annotators collaboratively label documents with machine learning algorithms that learn from annotators' actions and support the activity with classification suggestions. The platform supports the early stages of document labelling, with the ability to change the classification scheme on the go and to reuse and adapt existing classifiers.Source: ERCIM NEWS, pp. 34-35
Project(s): AI4Media via OpenAIRE

, SoBigData-PlusPlus via OpenAIRE

See at: ercim-news.ercim.eu Open Access | CNR IRIS | ISTI Repository | CNR IRIS Restricted

2022 Software Open Access

Interactive Classification System
Esuli A.
The Interactive Classification System (ICS), is a web-based application that supports the activity of manual text classification, i.e., labeling documents according to their content. The system is designed to give total freedom of action to its users: they can at any time modify any classification schema and any label assignment, possibly reusing any relevant information from previous activities. The application uses machine learning to actively support its users with classification suggestions The machine learning component of the system is an unobtrusive observer of the users' activities, never interrupting them, constantly adapting and updating its models in response to their actions, and always available to perform automatic classifications.DOI: 10.5281/zenodo.6586244
Project(s): AI4Media via OpenAIRE

, ARIADNEplus via OpenAIRE

, SoBigData-PlusPlus via OpenAIRE

Metrics:

See at: github.com Open Access | CNR IRIS | CNR IRIS Restricted