Page 1 of 4

2007 Conference article Restricted

Multilingual search for cultural heritage archives via combining multiple translation resource
Jones G, Zhang Y, Newman E, Fantino F, Debole F
The linguistic features of material in Cultural Heritage (CH) archives may be in various languages requiring a facility for effective multilingual search. The specialised language often associated with CH content introduces problems for automatic translation to support search applications. The MultiMatch project is focused on enabling users to interact with CH content across different media types and languages. We present results from a MultiMatch study exploring various translation techniques for the CH domain. Our experiments examine translation techniques for the English language CLEF 2006 Cross-Language Speech Retrieval (CL-SR) task using Spanish, French and German queries. Results compare effectiveness of our query translation against a monolingual baseline and show improvement when combining a domain-speci c translation lexicon with a standard machine translation system.

See at: CNR IRIS Restricted | CNR IRIS

2008 Conference article Open Access

The MultiMatch project: multilingual/multimedia access to cultural heritage on the web
Marlow J, Clough P, Ireson N, Cigarrán Recuero J, Artiles J, Debole F
The EU-funded MultiMatch project aims to overcome language barriers, and media and distribution problems currently affecting access to on-line cultural heritage material. Partners are developing a vertical search engine able to harvest heterogeneous information from distributed sources and present it in a synthesized manner. To design such a system, user requirements were initially gathered and then translated into specific design features to ensure that the search engine developed was consistent with user needs. This paper presents these user requirements, the initial design of the MultiMatch system, and technical discussion of the system architecture and components used to turn these design implications into a working interactive prototype. Following this, we discuss user evaluation and present results from an initial user study. These are being used, in addition to other input, to drive the functionality and design of the final system.

See at: CNR IRIS Open Access | www.archimuse.com | CNR IRIS Restricted

2019 Other Metadata Only Access

SEBD 2019 web site
Debole F
Website of the twenty-seventh edition of the Italian Symposium on Advanced Database Systems (SEBD - Sistemi Evoluti per Basi di Dati)

See at: CNR IRIS Restricted | sebd2019.isti.cnr.it

2006 Other Metadata Only Access

An Efficient XML Search Engine Supporting Approximate Search
Franca Debole
In this thesis we discuss the design and the realization of a novel XML search engine (XMLSE), which merge the exact-match search and the approximate match search paradigms. Our research focuses on developing innovative techniques, in terms of indexing structures and query processing methods, to efficiently and effectively support, beyond the exact-match search, the structure search and the similarity access also when huge XML data repositories are involved. This includes the construction of access methods especially conceived for XML, the development of efficient algorithms for the evaluation of queries and the extension of the standard query language XQuery. As for traditional databases, we have realized indexes that are efficient, i.e. the time needed to process a query as short as possible, and complete, i.e. all the objects satisfying the query appear in the result set. Most important, we have conceived a scalable search engine suitable to important multimedia applications. The essential steps of this research activity are: study and realization of special indexes for efficient XML query execution; extension of the XQuery syntax to support the approximate search; realization of a specific query processor for the new indexes; study and realization some query optimization techniques.

See at: CNR IRIS Restricted

2005 Journal article Restricted

An Analysis of the relative hardness of reuters-21578 subsets
Debole F, Sebastiani F
The existence, public availability, and widespread acceptance of a standard benchmark for a given information retrieval (IR) task are beneficial to research on this task, because they allow different researchers to experimentally compare their own systems by comparing the results they have obtained on this benchmark. The Reuters-21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) task throughout the last 10 years.However , the benefits that this has brought about have somehow been limited by the fact that different researchers have 'carved' different subsets out of this collection and tested their systems on one of these subsets only; systems that have been tested on different Reuters-21578 subsets are thus not readily comparable.In this article, we present a systematic, comparative experimental study of the three subsets of Reuters-21578 that have been most popular among TC researchers.The results we obtain allow us to determine the relative hardness of these subsets, thus establishing an indirect means for comparing TC systems that have, or will be, tested on these different subsets.Source: JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, vol. 56 (issue 6), pp. 584-596

See at: CNR IRIS Restricted | CNR IRIS | onlinelibrary.wiley.com

2006 Conference article Open Access

The DELOS testbed for choosing a digital preservation strategy
Strodl S, Rauber A, Rauch C, Hofman H, Debole F, Amato G
With the rapid technological changes, digital preservation, i.e. the endeavor to provide long-term access to digital objects, is turning into one of the most pressing challenges to ensure the survival of our digital artefacts. A set of strategies has been proposed, with a range of tools supporting parts of digital preservation actions. Yet, with requirements on which strategy to follow and which tools to employ being different for each setting, depending e.g. on object characteristics or institutional requirements, deciding which solution to implement has turned into a crucial decision. This paper presents the DELOS Digital Preservation Testbed. It provides an approach to make informed and accountable decisions on which solution to implement in order to preserve digital objects for a given purpose. It is based on Utility Analysis to evaluate the performance of various solutions against well-defined objectives, and facilitates repeatable experiments in a standardized laboratory setting.

See at: CNR IRIS Open Access | link.springer.com | ISTI Repository | CNR IRIS Restricted | CNR IRIS

2007 Conference article Restricted

Evaluating preservation strategies for electronic theses and dissertations
Strodl S, Becker C, Neumayer R, Rauber A, Nicchiarelli Bettelli E, Kaiser M, Hofman H, Neuroth H, Strathmann S, Debole F, Amato G
Digital preservation has turned into a pressing challenge for institutions having the obligation to preserve digital objects over years. A range of tools exist today to support the variety of preservation strategies such as migration or emulation. Yet, di®erent preservation requirements across institutions and settings make the decision on which solution to implement very di±cult. The Austrian National Library will have to preserve electronic theses and dissertations provided as PDF ̄les and is thus investigating potential preservation solutions. The DELOS Digital Preservation Testbed is used to evaluate various alternatives with respect to speci ̄c requirements. It provides an approach to make informed and accountable decisions on which solution to implement in order to preserve digital objects for a given purpose.We analyse the performance of various preservation strategies with respect to the speci ̄ed requirements for the preservation of master theses and present the results.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 4877, pp. 238-247

See at: CNR IRIS Restricted | CNR IRIS | www.springerlink.com

2004 Journal article Open Access

Supervised term weighting for automated text categorization
Debole F, Sebastiani F
Researchers from ISTI-CNR, Pisa, aim at producing better text classification methods through the use of supervised learning techniques in the generation of the internal representations of the textsSource: ERCIM NEWS, vol. 56, pp. 55-56

See at: CNR IRIS Open Access | www.ercim.org | CNR IRIS Restricted

2003 Conference article Unknown

Supervised term weighting for automated text categorization
Debole F., Sebastiani F.
The construction of a text classi.er usually involves (i) a phase of term selection, in which the most relevant terms for the classi.cation task are identi.ed, (ii) a phase of term weighting, in which document weights for the selected terms are computed, and (iii) a phase of classi.er learning, in which a classi.er is generated from the weighted representations of the training documents. This process involves an activity of supervised learning, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from training data should also a.ect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea supervised term weighting (STW). As an example, we propose a number of "supervised variants" of tfidf weighting, obtained by replacing the idf function with the function that has been used in phase (i) for term selection. We present experimental results obtained on the standard Reuters-21578 benchmark with one classi.er learning method (support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.Source: SAC-03, 18th ACM Symposium on Applied Computing, pp. 784–788, Melbourne, US, March 9-12, 2003

See at: CNR ExploRA

2004 Conference article Open Access

An analysis of the relative hardness of reuters-21578 subsets
Debole F, Sebastiani F
The existence, public availability, and widespread acceptance of a standard benchmark for a given information retrieval (IR) task are beneficial to research on this task, since they allow different researchers to experimentally compare their own systems by comparing the results they have obtained on this benchmark. The Reuters-21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) task throughout the last ten years. However, the benefits that this has brought about have somehow been limited by the fact that different researchers have 'carved' different subsets out of this collection, and tested their systems on one of these subsets only; systems that have been tested on different Reuters-21578 subsets are thus not readily comparable. In this paper we present a systematic, comparative experimental study of the three subsets of Reuters-21578 that have been most popular among TC researchers. The results we obtain allow us to determine the relative difficulty of these subsets, thus establishing an indirect means for comparing TC systems that have, or will be, tested on these different subsets.

See at: CNR IRIS Open Access | ISTI Repository | www.lrec-conf.org | CNR IRIS Restricted

2004 Contribution to book Restricted

Supervised term weighting for automated text categorization
Debole F, Sebastiani F
The construction of a text classifier usually involves (i) a phase of term selection, in which the most relevant terms for the classification task are identified, (ii) a phase of term weighting, in which document weights for the selected terms are computed, and (iii) a phase of classifier learning, in which a classifier is generated from the weighted representations of the training documents. This process involves an activity of supervised learning, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from the training data should also affect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea supervised term weighting (STW). As an example of STW, we propose a number of supervised variants of tfidf weighting, obtained by replacing the idf function with the function that has been used in phase (i) for term selection. The use of STW allows the terms that are distributed most differently in the positive and negative examples of the categories of interest to be weighted highest. We present experimental results obtained on the standard Reuters-21578 benchmark with three classifier learning methods (Rocchio, k-NN, and support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.

See at: CNR IRIS Restricted | CNR IRIS | www.isti.cnr.it

2009 Conference article Open Access

Searching and browsing film archives. The European Film Gateway Approach
Debole F, Savino P, Eckes G
Metadata describing items in European film archives are very different so that it is difficult to have a uniform access to videos coming from many different archives. These and other relevant issues regarding interoperability among different archives are addressed within the EFG (European Film Gateway) Best Practices Network funded by the European Commission, which aims at enabling Europe's Film Archives and cinématèques to contribute their rich and valuable collections to the EUROPEANA digital library.Project(s): EFG1914

See at: CNR IRIS Open Access | CNR IRIS Restricted

2010 Software Metadata Only Access

Metadata Editor
Debole F, Savino P, Caruso E M
The Metadata Editor (ME) is a tool providing an easy way to create XML metadata by using a simple web interface. The ME is fully customizable both in terms of the metadata schema used, of the DBMS adopted to store metadata instances and to executed content-based searches, and of the end user interface since the forms used for the creation of new metadata are configurable just editing an XML configuration file. The ME allows users to create, search and edit metadata records. It also supports the use of controlled vocabularies for the different metadata elements. Currently, the ME has been used in EU funded project and in a Italian Regional Project.

See at: CNR IRIS Restricted | multimatch01.isti.cnr.it

2003 Other Open Access

An Analysis of the Relative Hardness of Reuters-21578 Subsets
Debole F, Sebastiani F
The existence, public availability, and widespread acceptance of a standard benchmark for a given information retrieval (IR) task are beneficial to research on this task, since they allow different researchers to experimentally compare their own systems by comparing the results they have obtained on this benchmark. The Reuters-21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) task throughout the last ten years. However, the benefits that this has brought about have somehow been limited by the fact that different researchers have 'carved' different subsets out of this collection, and tested their systems on one of these subsets only; systems that have been tested on different Reuters-21578 subsets are thus not readily comparable. In this paper we present a systematic, comparative experimental study of the three subsets of Reuters-21578 that have been most popular among TC researchers. The results we obtain allow us to determine the relative hardness of these subsets, thus establishing an indirect means for comparing TC systems that have, or will be, tested on these different subsets.

See at: CNR IRIS Open Access | ISTI Repository | CNR IRIS Restricted

2002 Other Open Access

Supervised term weighting for automated text categorization
Debole F, Sebastiani F
The construction of a text classifier usually involves (i) a phase of emph{term selection}, in which the most relevant terms for the classification task are identified, (ii) a phase of emph{term weighting}, in which document weights for the selected terms are computed, and (iii) a phase of emph{classifier learning}, in which a classifier is generated from the weighted representations of the training documents. This process involves an activity of {em supervised learning}, in which information on the membership of training documents in categories is used. Traditionally, supervised learning enters only phases (i) and (iii). In this paper we propose instead that learning from the training data should also affect phase (ii), i.e. that information on the membership of training documents to categories be used to determine term weights. We call this idea emph{supervised term weighting} (STW). As an example of STW, we propose a number of ``supervised variants'' of $tfidf$ weighting, obtained by replacing the $idf$ function with the function that has been used in phase (i) for term selection. The use of STW allows the terms that are distributed most differently in the positive and negative examples of the categories of interest to be weighted highest. We present experimental results obtained on the standard textsf{Reuters-21578} benchmark with three classifier learning methods (Rocchio, $k$-NN, and support vector machines), three term selection functions (information gain, chi-square, and gain ratio), and both local and global term selection and weighting.

See at: CNR IRIS Open Access | CNR IRIS Restricted

2005 Conference article Open Access

A native XML database supporting approximate match search
Amato G, Debole F
XML is becoming the standard representation format for metadata. Metadata for multimedia documents, as for instance MPEG-7, require approximate match search functionalities to be supported in addition to exact match search.As an example, consider image search performed by usingMPEG-7 visual descriptors. It does not make sense to search for images that are exactly equal to a query image. Rather, images similar to a query image are more likely to be searched. We present the architecture of an XML search engine where special techniques are used to integrate approximate and exact match search functionalities.

See at: CNR IRIS Open Access | link.springer.com | ISTI Repository | CNR IRIS Restricted | CNR IRIS

2017 Journal article Open Access

Mapping the ARIADNE catalogue data model to CIDOC CRM: Bridging resource discovery and item-level access
Aloia N, Debole F, Felicetti A, Galluccio I, Theodoridou M
ARIADNE is a European project aiming to integrate existing archaeological research infrastructures, services and distributed datasets, and to develop new technologies and tools to improve archaeological research methodology. The ARIADNE registry contains information about resources available among the various partners of the project and the metadata repository, which contains item level information of these resources. In order to provide an advanced discovery mechanism combining both item level and registry level information we propose a mapping from the ARIADNE Catalog Data Model, the model of the ARIADNE registry, to the CIDOC CRM, the underlying model of the metadata repository. The paper will present the requirements that led to the choice of different models for the registry and the metadata repository, will elaborate on the mapping, and will propose an integrated interface for information discovery and presentation.Source: SCIRES-IT, vol. 7 (issue 1), pp. 1-8

See at: CNR IRIS Open Access | ISTI Repository | www.sciresit.it | CNR IRIS Restricted

2024 Conference article Open Access

Italian word embeddings for the medical domain
Cardillo F. A., Debole F.
Neural word embeddings have proven valuable in the development of medical applications. However, for the Italian language, there are no publicly available corpora, embeddings, or evaluation resources tailored to this domain. In this paper, we introduce an Italian corpus for the medical domain, that includes texts from Wikipedia, medical journals, drug leaflets, and specialized websites. Using this corpus, we generate neural word embeddings from scratch. These embeddings are then evaluated using standard evaluation resources, that we translated into Italian exploiting the concept graph in the UMLS Metathesaurus. Despite the relatively small size of the corpus, our experimental results indicate that the new embeddings correlate well with human judgments regarding the similarity and the relatedness of medical concepts. Moreover, these medical-specific embeddings outperform a baseline model trained on the full Wikipedia corpus, which includes the medical pages we used. We believe that our embeddings and the newly introduced textual resources will foster further advancements in the field of Italian medical Natural Language Processing.Project(s): DeepHealth via OpenAIRE

, TAILOR

, STARWARS via OpenAIRE

See at: aclanthology.org Open Access | CNR IRIS | CNR IRIS Restricted

2005 Journal article Restricted

A native XML database supporting approximate match search
Amato G, Debole F
The digital library field is recently broadening its scope of applicability and it is also continuously adapting to the frequent changes occurring in the internet society. Accordingly, digital libraries are slightly moving from a controlled environment accessible only to professionals and domain-experts, to environments accessible to casual users that want to exploit the potentialities offered by the digital library technology. These new trends require, for instance, new search paradigms to be offered, new media content to be managed, and new description extraction techniques to be used. Building digital library applications, and effectively adapting them to new emerging trends, requires to develop a platform that offers standard and powerful building blocks to support application developers. In this paper we discuss our experience of using MILOS, a multimedia content management system oriented to the construction of digital libraries, to build a demanding application dedicated to non-professional users. Specifically, we discuss the design and implementation of an on-line photo album (PhotoBook), which is a digital library application that allows people to manage their own photos, to share them with friends, and to make them publicly available and searchable. PhotoBook, uses a complex internal metadata schema (MPEG-7) and allows users to simply express complex queries (combining similarity search and fielded search), enabling them to retrieve material of interest even if metadata are imprecise or missing.

See at: CNR IRIS Restricted | CNR IRIS | www.springerlink.com

2004 Conference article Open Access

A signature-based Approach for efficient relationship search on XML data collections
Amato G, Debole F, Rabitti F, Savino P, Zezula P
We study the problem of finding relevant relationships among user defined nodes of XML documents. We define a language that determines the nodes as results of XPath expressions. The expressions are structured in a conjunctive normal form and the relationships among nodes qualifying in different conjuncts are determined as tree twigs of the searched XML documents. The query execution is supported by an auxiliary index structure called the tree signature. We have implemented a prototype system that supports this kind of searching and we have conducted numerous experiments on XML data collections. We have found the query execution very efficient, thus suitable for on-line processing. We also demonstrate the superiority of our system with respect to a previous, rather restricted, approach of finding the lowest common ancestor of pairs of XML nodes.Source: LECTURE NOTES IN COMPUTER SCIENCE, vol. 3186, pp. 82-96

See at: CNR IRIS Open Access | www.springerlink.com | CNR IRIS Restricted | CNR IRIS