2012
Contribution to book
Restricted
Exploring the meaning behind Twitter hashtags through clustering.
Muntean C. I., Morar G. A., Moldovan D.Social networks are generators of large amount of data produced by users, who are not limited with respect to the content of the information they exchange. The data generated can be a good indicator of trends and topic preferences among users. In our paper we focus on analyzing and representing hashtags by the corpus in which they appear. We cluster a large set of hashtags using K-means on map reduce in order to process data in a distributed manner. Our intention is to retrieve connections that might exist between different hashtags and their textual representation, and grasp their semantics through the main topics they occur with.Source: BIS 2012 - Business Information Systems Workshops. Revised papers, edited by Witold Abramowicz, John Domingue, Krzysztof W?cel, pp. 231–242. London: Springer, 2012
DOI: 10.1007/978-3-642-34228-8_22Metrics:
See at:
doi.org
| gateway.webofknowledge.com
| link.springer.com
| CNR ExploRA
2020
Journal article
Open Access
Crime and its fear in social media
Prieto Curiel R., Cresci S., Muntean C., Bishop S. R.Social media posts incorporate real-time information that has, elsewhere, been exploited to predict social trends. This paper considers whether such information can be useful in relation to crime and fear of crime. A large number of tweets were collected from the 18 largest Spanish-speaking countries in Latin America, over a period of 70 days. These tweets are then classified as being crime-related or not and additional information is extracted, including the type of crime and where possible, any geo-location at a city level. From the analysis of collected data, it is established that around 15 out of every 1000 tweets have text related to a crime, or fear of crime. The frequency of tweets related to crime is then compared against the number of murders, the murder rate, or the level of fear of crime as recorded in surveys. Results show that, like mass media, such as newspapers, social media suffer from a strong bias towards violent or sexual crimes. Furthermore, social media messages are not highly correlated with crime. Thus, social media is shown not to be highly useful for detecting trends in crime itself, but what they do demonstrate is rather a reflection of the level of the fear of crime.Source: PALGRAVE COMMUNICATIONS, vol. 6 (issue 1)
DOI: 10.1057/s41599-020-0430-7Project(s): CIMPLEX 
,
SoBigData
Metrics:
See at:
Palgrave Communications
| CNR IRIS
| Palgrave Communications
| ISTI Repository
| www.nature.com
| Palgrave Communications
| CNR IRIS
2020
Conference article
Open Access
High-quality prediction of tourist movements using temporal trajectories in graphs
Moghtasedi S., Muntean C., Nardini F. M., Grossi R., Marino A.In this paper, we study the problem of predicting the next position of a tourist given his history. In particular, we propose a model to identify the next point of interest that a tourist will visit in the future, by making use of similarity between trajectories on a graph and taking into account the spatial-temporal aspect of trajectories. We compare our method with a well-known machine learning-based technique, as well as with a popularity baseline, using three public real-world datasets. Our experimental results show that our technique outperforms state-of-the-art machine learning-based methods effectively, by providing at least twice more accurate results.DOI: 10.1109/asonam49781.2020.9381450Metrics:
See at:
CNR IRIS
| ieeexplore.ieee.org
| CNR IRIS
| xplorestaging.ieee.org
2023
Conference article
Open Access
A spatial approach to predict performance of conversational search systems
Faggioli G, Ferro N, Muntean C, Perego R, Tonellotto NRecent advancements in Information Retrieval and Natural Language Processing have led to significant developments in the way users interact with search engines, with traditional one-shot textual queries being replaced by multi-turn conversations. As a highly interactive search scenario, Conversational Search (CS) can significantly benefit from Query Performance Prediction (QPP) techniques. However, the application of QPP in the CS domain is a relatively new field and requires proper framing. This study proposes a set of spatial-based QPP models, designed to work effectively in the conversational search domain, where dense neural retrieval models are the most common approach and query cutoffs are small. The proposed QPP approaches are shown to improve the predictive performance over the state-of-the-art in different scenarios and collections, highlighting the utility of QPP in the CS domain.Source: CEUR WORKSHOP PROCEEDINGS, pp. 41-46. Pisa, Italy, 8-9/06/2023
See at:
ceur-ws.org
| CNR IRIS
| CNR IRIS
| CNR IRIS
2020
Conference article
Open Access
Topic propagation in conversational search
Mele I, Muntean Ci, Nardini Fm, Perego R, Tonellotto N, Frieder OIn a conversational context, a user expresses her multi-faceted information need as a sequence of natural-language questions, i.e., utterances. Starting from a given topic, the conversation evolves through user utterances and system replies. The retrieval of documents relevant to a given utterance in a conversation is challenging due to ambiguity of natural language and to the difficulty of detecting possible topic shifts and semantic relationships among utterances. We adopt the 2019 TREC Conversational Assistant Track (CAsT) framework to experiment with a modular architecture performing: (i) topic-aware utterance rewriting, (ii) retrieval of candidate passages for the rewritten utterances, and (iii) neural-based re-ranking of candidate passages. We present a comprehensive experimental evaluation of the architecture assessed in terms of traditional IR metrics at small cutoffs. Experimental results show the effectiveness of our techniques that achieve an improvement of up to $0.28$ (+93%) for P@1 and $0.19$ (+89.9%) for nDCG@3 w.r.t. the CAsT baseline.DOI: 10.1145/3397271.3401268DOI: 10.48550/arxiv.2004.14054Project(s): BigDataGrapes
Metrics:
See at:
arXiv.org e-Print Archive
| arxiv.org
| dl.acm.org
| doi.org
| doi.org
| CNR IRIS
| CNR IRIS
2020
Journal article
Restricted
Weighting passages enhances accuracy
Muntean C., Nardini F. M., Perego R., Tonellotto N., Frieder O.We observe that in curated documents the distribution of the occurrences of salient terms, e.g., terms with a high Inverse Document Frequency, is not uniform, and such terms are primarily concentrated towards the beginning and the end of the document. Exploiting this observation, we propose a novel version of the classical BM25 weighting model, called BM25 Passage (BM25P), which scores query results by computing a linear combination of term statistics in the different portions of the document. We study a multiplicity of partitioning schemes of document content into passages and compute the collection-dependent weights associated with them on the basis of the distribution of occurrences of salient terms in documents. Moreover, we tune BM25P hyperparameters and investigate their impact on ad hoc document retrieval through fully reproducible experiments conducted using four publicly available datasets. Our findings demonstrate that our BM25P weighting model markedly and consistently outperforms BM25 in terms of effectiveness by up to 17.44% in NDCG@5 and 85% in NDCG@1, and up to 21% in MRR.Source: ACM TRANSACTIONS ON INFORMATION SYSTEMS, vol. 39 (issue 2)
DOI: 10.1145/3428687Metrics:
See at:
ACM Transactions on Information Systems
| CNR IRIS
| CNR IRIS
2021
Journal article
Restricted
Adaptive utterance rewriting for conversational search
Mele I, Muntean Ci, Nardini Fm, Perego R, Tonellotto N, Frieder OIn a conversational context, a user converses with a system through a sequence of natural-language questions, i.e., utterances. Starting from a given subject, the conversation evolves through sequences of user utterances and system replies. The retrieval of documents relevant to an utterance is difficult due to informal use of natural language in speech and the complexity of understanding the semantic context coming from previous utterances. We adopt the 2019 TREC Conversational Assistant Track (CAsT) framework to experiment with a modular architecture performing in order: (i) automatic utterance understanding and rewriting, (ii) first-stage retrieval of candidate passages for the rewritten utterances, and (iii) neural re-ranking of candidate passages. By understanding the conversational context, we propose adaptive utterance rewriting strategies based on the current utterance and the dialogue evolution of the user with the system. A classifier identifies those utterances lacking context information as well as the dependencies on the previous utterances. Experimentally, we evaluate the proposed architecture in terms of traditional information retrieval metrics at small cutoffs. Results demonstrate the effectiveness of our techniques, achieving an improvement up to 0.6512 for P@1 and 0.4484 for nDCG@3 w.r.t. the CAsT baseline.Source: INFORMATION PROCESSING & MANAGEMENT, vol. 58 (issue 6)
DOI: 10.1016/j.ipm.2021.102682Project(s): BigDataGrapes
Metrics:
See at:
Information Processing & Management
| Information Processing & Management
| CNR IRIS
| CNR IRIS
2023
Conference article
Restricted
A geometric framework for query performance prediction in conversational search
Faggioli G., Ferro N., Muntean C., Perego R., Tonellotto N.Thanks to recent advances in IR and NLP, the way users interact with search engines is evolving rapidly, with multi-turn conversations replacing traditional one-shot textual queries. Given its interactive nature, Conversational Search (CS) is one of the scenarios that can benefit the most from Query Performance Prediction (QPP) techniques. QPP for the CS domain is a relatively new field and lacks proper framing. In this study, we address this gap by proposing a framework for the application of QPP in the CS domain and use it to evaluate the performance of predictors. We characterize what it means to predict the performance in the CS scenario, where information needs are not independent queries but a series of closely related utterances. We identify three main ways to use QPP models in the CS domain: as a diagnostic tool, as a way to adjust the system's behaviour during a conversation, or as a way to predict the system's performance on the next utterance. Due to the lack of established evaluation procedures for QPP in the CS domain, we propose a protocol to evaluate QPPs for each of the use cases. Additionally, we introduce a set of spatial-based QPP models designed to work the best in the conversational search domain, where dense neural retrieval models are the most common approaches and query cutoffs are typically small. We show how the proposed QPP approaches improve significantly the predictive performance over the state-of-the-art in different scenarios and collections.DOI: 10.1145/3539618.3591625Project(s): SoBigData-PlusPlus
Metrics:
See at:
dl.acm.org
| CNR IRIS
| CNR IRIS
2023
Conference article
Open Access
Rewriting conversational utterances with instructed large language models
Galimzhanova E, Muntean Ci, Nardini Fm, Perego R, Rocchietti GMany recent studies have shown the ability of large language models (LLMs) to achieve state-of-the-art performance on many NLP tasks, such as question answering, text summarization, coding, and translation. In some cases, the results provided by LLMs are on par with those of human experts. These models' most disruptive innovation is their ability to perform tasks via zero-shot or few-shot prompting. This capability has been successfully exploited to train instructed LLMs, where reinforcement learning with human feedback is used to guide the model to follow the user's requests directly. In this paper, we investigate the ability of instructed LLMs to improve conversational search effectiveness by rewriting user questions in a conversational setting. We study which prompts provide the most informative rewritten utterances that lead to the best retrieval performance. Reproducible experiments are conducted on publicly-available TREC CAST datasets. The results show that rewriting conversational utterances with instructed LLMs achieves significant improvements of up to 25.2% in MRR, 31.7% in Precision@1, 27% in NDCG@3, and 11.5% in Recall@500 over state-of-the-art techniques.DOI: 10.1109/wi-iat59888.2023.00014Project(s): EFRA
Metrics:
See at:
CNR IRIS
| ieeexplore.ieee.org
| ISTI Repository
| CNR IRIS
| CNR IRIS
2022
Journal article
Open Access
Caching historical embeddings in conversational search
Frieder O., Mele I., Muntean C., Nardini F. M., Perego R., Tonellotto N.Rapid response, namely low latency, is fundamental in search applications; it is particularly so in interactive search sessions, such as those encountered in conversational settings. An observation with a potential to reduce latency asserts that conversational queries exhibit a temporal locality in the lists of documents retrieved. Motivated by this observation, we propose and evaluate a client-side document embedding cache, improving the responsiveness of conversational search systems. By leveraging state-of-the-art dense retrieval models to abstract document and query semantics, we cache the embeddings of documents retrieved for a topic introduced in the conversation, as they are likely relevant to successive queries. Our document embedding cache implements an efficient metric index, answering nearest-neighbor similarity queries by estimating the approximate result sets returned. We demonstrate the efficiency achieved using our cache via reproducible experiments based on TREC CAsT datasets, achieving a hit rate of up to 75% without degrading answer quality. Our achieved high cache hit rates significantly improve the responsiveness of conversational systems while likewise reducing the number of queries managed on the search back-end.Source: ACM TRANSACTIONS ON THE WEB, vol. 18 (issue 4)
DOI: 10.1145/3578519DOI: 10.48550/arxiv.2211.14155Metrics:
See at:
arXiv.org e-Print Archive
| IRIS Cnr
| IRIS Cnr
| IRIS Cnr
| ACM Transactions on the Web
| doi.org
| CNR IRIS
| CNR IRIS
2019
Conference article
Restricted
Enhanced news retrieval: passages lead the way!
Catena M, Nardini Fm, Frieder O, Perego R, Muntean Ci, Tonellotto NWe observe that most relevant terms in unstructured news articles are primarily concentrated towards the beginning and the end of the document. Exploiting this observation, we propose a novel version of the classical BM25 weighting model, called BM25 Passage (BM25P), which scores query results by computing a linear combination of term statistics in the different portions of news articles. Our experimentation, conducted using three publicly available news datasets, demonstrates that BM25P markedly outperforms BM25 in term of effectiveness by up to 17.44% in NDCG@5 and 85% in NDCG@1.DOI: 10.1145/3331184.3331373Metrics:
See at:
dl.acm.org
| doi.org
| CNR IRIS
| CNR IRIS
2020
Journal article
Open Access
RankEval: Evaluation and investigation of ranking models
Lucchese C., Muntean C., Nardini F. M., Perego R., Trani S.RankEval is a Python open-source tool for the analysis and evaluation of ranking models based on ensembles of decision trees. Learning-to-Rank (LtR) approaches that generate tree-ensembles are considered the most effective solution for difficult ranking tasks and several impactful LtR libraries have been developed aimed at improving ranking quality and training efficiency. However, these libraries are not very helpful in terms of hyper-parameters tuning and in-depth analysis of the learned models, and even the implementation of most popular Information Retrieval (IR) metrics differ among them, thus making difficult to compare different models. RankEval overcomes these limitations by providing a unified environment where to perform an easy, comprehensive inspection and assessment of ranking models trained using different machine learning libraries. The tool focuses on ensuring efficiency, flexibility and extensibility and is fully interoperable with most popular LtR libraries.Source: SOFTWAREX, vol. 12
DOI: 10.1016/j.softx.2020.100614Project(s): BigDataGrapes
Metrics:
See at:
SoftwareX
| CNR IRIS
| ISTI Repository
| SoftwareX
| www.sciencedirect.com
| CNR IRIS
2024
Patent
Restricted
Caching historical embeddings in conversational search
Frieder O., Mele I., Muntean C., Nardini F. M., Perego R., Tonellotto N.A method and system are described for improving the speed and efficiency of obtaining conversational search results. A user may speak a phrase to perform a conversational search or a series of phrases to perform a series of searches. These spoken phrases may be enriched by context and then converted into a query embedding. A similarity between the query embedding and document embeddings is used to determine the search results including a query cutoff number of documents and a cache cutoff number of documents. A second search phrase may use the cache of documents along with comparisons of the returned documents and the first query embedding to determine the quality of the cache for responding to the second search query. If the results are high-quality then the search may proceed much more rapidly by applying the second query only to the cached documents rather than to the server.
See at:
CNR IRIS
| CNR IRIS
2020
Journal article
Open Access
Human migration: the big data perspective
Sîrbu A, Andrienko G, Andrienko N, Boldrini C, Conti M, Giannotti F, Guidotti R, Bertoli S, Kim J, Muntean Ci, Pappalardo L, Passarella A, Pedreschi D, Pollacci L, Pratesi F, Sharma RHow can big data help to understand the migration phenomenon? In this paper, we try to answer this question through an analysis of various phases of migration, comparing traditional and novel data sources and models at each phase. We concentrate on three phases of migration, at each phase describing the state of the art and recent developments and ideas. The first phase includes the journey, and we study migration flows and stocks, providing examples where big data can have an impact. The second phase discusses the stay, i.e. migrant integration in the destination country. We explore various data sets and models that can be used to quantify and understand migrant integration, with the final aim of providing the basis for the construction of a novel multi-level integration index. The last phase is related to the effects of migration on the source countries and the return of migrants.Source: INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, vol. 11, pp. 341-360
DOI: 10.1007/s41060-020-00213-5Project(s): SoBigData
Metrics:
See at:
International Journal of Data Science and Analytics
| CNR IRIS
| link.springer.com
| ISTI Repository
| HAL Clermont Université
| CNR IRIS
| Fraunhofer-ePrints
2024
Conference article
Open Access
LongDoc summarization using instruction-tuned large language models for food safety regulations
Rocchietti G., Rulli C., Randl K., Muntean C., Nardini F. M., Perego R., Trani S., Karvounis M., Janostik J.We design and implement a summarization pipeline for regulatory documents, focusing on two main objectives: creating two silver standard datasets using instruction-tuned large language models (LLMs) and finetuning smaller LLMs to perform summarization of regulatory text. In the first task, we employ state-of-the-art models, Cohere C4AI Command-R-4bit and Llama-3-8B, to generate summaries of regulatory documents. These generated summaries serve as ground-truth data for the second task, where we finetune three general-purpose LLMs to specialize in high-quality summary generation for specific documents while reducing the computational requirements. Specifically, we finetune two Google Flan-T5 models using datasets generated by Llama-3-8B and Cohere C4AI, and we create a quantized (4-bit) version of Google Gemma 2-B based on summaries from Cohere C4AI. Additionally, we initiated a pilot activity involving legal experts from SGS-Digicomply to validate the effectiveness of our summarization pipeline.Source: CEUR WORKSHOP PROCEEDINGS, vol. 3802, pp. 33-42. Udine, Italy, 5-6/09/2024
Project(s): EFRA 
See at:
ceur-ws.org
| CNR IRIS
| CNR IRIS