107 result(s)
Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

CNR Author operator: and / or
more
Typology operator: and / or
Language operator: and / or
Date operator: and / or
more
Rights operator: and / or
2020 Journal article Restricted

Leveraging feature selection to detect potential tax fraudsters
Matos T., Macedo J. A., Lettich F., Monteiro J. M., Renso C., Perego R., Nardini F. M.
Tax evasion is any act that knowingly or unknowingly, legally or unlawfully, leads to non-payment or underpayment of tax due. Enforcing the correct payment of taxes by taxpayers is fundamental in maintaining investments that are necessary and benefits a society as a whole. Indeed, without taxes it is not possible to guarantee basic services such as health-care, education, sanitation, transportation, infrastructure, among other services essential to the population. This issue is especially relevant in developing countries such as Brazil. In this work we consider a real-world case study involving the Treasury Office of the State of Ceará (SEFAZ-CE, Brazil), the agency in charge of supervising more than 300,000 active taxpayers companies. SEFAZ-CE maintains a very large database containing vast amounts of information concerning such companies. Its enforcement team struggles to perform thorough inspections on taxpayers accounts as the underlying traditional human-based inspection processes involve the evaluation of countless fraud indicators (i.e., binary features), thus requiring burdensome amounts of time and being potentially prone to human errors. On the other hand, the vast amount of taxpayer information collected by fiscal agencies opens up the possibility of devising novel techniques able to tackle fiscal evasion much more effectively than traditional approaches. In this work we address the problem of using feature selection to select the most relevant binary features to improve the classification of potential tax fraudsters. Finding out possible fraudsters from taxpayer data with binary features presents several challenges. First, taxpayer data typically have features with low linear correlation between themselves. Also, tax frauds may originate from intricate illicit tactics, which in turn requires to uncover non-linear relationships between multiple features. Finally, few features may be correlated with the targeted class. In this work we propose Alicia, a new feature selection method based on association rules and propositional logic with a carefully crafted graph centrality measure that attempts to tackle the above challenges while, at the same time, being agnostic to specific classification techniques. Alicia is structured in three phases: first, it generates a set of relevant association rules from a set of fraud indicators (features). Subsequently, from such association rules Alicia builds a graph, which structure is then used to determine the most relevant features. To achieve this Alicia applies a novel centrality measure we call the Feature Topological Importance. We perform an extensive experimental evaluation to assess the validity of our proposal on four different real-world datasets, where we compare our solution with eight other feature selection methods. The results show that Alicia achieves F-measure scores up to 76.88%, and consistently outperforms its competitors.Source: Expert systems with applications 145 (2020). doi:10.1016/j.eswa.2019.113128
DOI: 10.1016/j.eswa.2019.113128

2020 Conference article Open Access

Dynamic Wi-Fi RSSI normalization in unmapped locations
Kavalionak H., Tosato M., Barsocchi P., Nardini F. M.
With the growing availability of open access WLAN networks, we assisted to the increase of marketing services that are based on the data collected from the WLAN access points. The identification of visitors of a commercial venue using WLAN data is one of the issues to create successful marketing products. One of the ways to separate visitors is to analyse the RSSI of the mobile devices signals coming to various access points at the venue. Nevertheless, the indoor signal distortion makes RSSI based methods unreliable. In this work we propose the algorithm for the WLAN based RSSI normalization in uncontrolled environments. Our approach is based on the two steps, where at first based on the collected data we detect the devices whose RSSI can be taken as a basic one. At the second step the algorithm allows based on the previously detected basic RSSI to normalize the received signal from mobile devices. We provide the analysis of a real dataset of WLAN probes collected in several real commercial venues in Italy.Source: EDBT/ICDT 2020 Joint Conference, Copenhagen, Denmark, 30th March - 2nd April, 2020

See at: ceur-ws.org | CNR ExploRA

2020 Report Open Access

Dynamic hard pruning of neural networks at the edge of the internet
Valerio L., Nardini F. M., Passarella A., Perego R.
Neural Networks (NN), although successfully applied to several Artificial Intelligence tasks, are often unnecessarily over-parametrized. In fog/edge computing, this might make their training prohibitive on resource-constrained devices, contrasting with the current trend of decentralising intelligence from remote data-centres to local constrained devices. Therefore, we investigate the problem of training effective NN models on constrained devices having a fixed, potentially small, memory budget. We target techniques that are both resource-efficient and performance effective while enabling significant network compression. Our technique, called Dynamic Hard Pruning (DynHP), incrementally prunes the network during training, identifying neurons that marginally contribute to the model accuracy. DynHP enables a tunable size reduction of the final neural network and reduces the NN memory occupancy during training. Freed memory is reused by a\emph {dynamic batch sizing} approach to counterbalance the accuracy degradation caused by the hard pruning strategy, improving its convergence and effectiveness. We assess the performance of DynHP through reproducible experiments on two public datasets, comparing them against reference competitors. Results show that DynHP compresses a NN up to times without significant performance drops (up to relative error wrt competitors), reducing up to the training memory occupancySource: IIT TR-21/2020 and ISTI Technical Reports 2020/016, 2020, 2020
DOI: 10.32079/isti-tr-2020/016
Project(s): BigDataGrapes

See at: ISTI Repository | CNR ExploRA

2020 Journal article Restricted

Weighting passages enhances accuracy
Muntean C. I., Nardini F. M., Perego R., Tonellotto N., Frieder O.
We observe that in curated documents the distribution of the occurrences of salient terms, e.g., terms with a high Inverse Document Frequency, is not uniform, and such terms are primarily concentrated towards the beginning and the end of the document. Exploiting this observation, we propose a novel version of the classical BM25 weighting model, called BM25 Passage (BM25P), which scores query results by computing a linear combination of term statistics in the different portions of the document. We study a multiplicity of partitioning schemes of document content into passages and compute the collection-dependent weights associated with them on the basis of the distribution of occurrences of salient terms in documents. Moreover, we tune BM25P hyperparameters and investigate their impact on ad hoc document retrieval through fully reproducible experiments conducted using four publicly available datasets. Our findings demonstrate that our BM25P weighting model markedly and consistently outperforms BM25 in terms of effectiveness by up to 17.44% in NDCG@5 and 85% in NDCG@1, and up to 21% in MRR.Source: ACM transactions on information systems 39 (2020). doi:10.1145/3428687
DOI: 10.1145/3428687

2020 Conference article Open Access

Efficient document re-ranking for transformers by precomputing term representations
Macavaney S., Nardini F. M., Perego R., Tonellotto N., Goharian N., Frieder O.
Deep pretrained transformer networks are effective at various ranking tasks, such as question answering and ad-hoc document ranking. However, their computational expenses deem them cost-prohibitive in practice. Our proposed approach, called PreTTR (Precomputing Transformer Term Representations), considerably reduces the query-time latency of deep transformer networks (up to a 42x speedup on web document ranking) making these networks more practical to use in a real-time ranking scenario. Specifically, we precompute part of the document term representations at indexing time (without a query), and merge them with the query representation at query time to compute the final ranking score. Due to the large size of the token representations, we also propose an effective approach to reduce the storage requirement by training a compression layer to match attention scores. Our compression technique reduces the storage required up to 95% and it can be applied without a substantial degradation in ranking performance.Source: 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 49–58, online, 25-30 July, 2020
DOI: 10.1145/3397271.3401093
Project(s): BigDataGrapes

2020 Conference article Open Access

Training curricula for open domain answer re-ranking
Macavaney S., Nardini F. M., Perego R., Tonellotto N., Goharian N., Frieder O.
DOI: 10.1145/3397271.3401094
Project(s): BigDataGrapes

2020 Conference article Open Access

Expansion via prediction of importance with contextualization
Macavaney S., Nardini F. M., Perego R., Tonellotto N., Goharian N., Frieder O.
The identification of relevance with little textual context is a primary challenge in passage retrieval. We address this problem with a representation-based ranking approach that: (1) explicitly models the importance of each term using a contextualized language model; (2) performs passage expansion by propagating the importance to similar terms; and (3) grounds the representations in the lexicon, making them interpretable. Passage representations can be pre-computed at index time to reduce query-time latency. We call our approach EPIC (Expansion via Prediction of Importance with Contextualization). We show that EPIC significantly outperforms prior importance-modeling and document expansion approaches. We also observe that the performance is additive with the current leading first-stage retrieval methods, further narrowing the gap between inexpensive and cost-prohibitive passage ranking approaches. Specifically, EPIC achieves a MRR@10 of 0.304 on the MS-MARCO passage ranking dataset with 78ms average query latency on commodity hardware. We also find that the latency is further reduced to 68ms by pruning document representations, with virtually no difference in effectiveness.Source: 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1573–1576, online, 25-30 July, 2020
DOI: 10.1145/3397271.3401262
Project(s): BigDataGrapes

2020 Conference article Open Access

Query-level early exit for additive learning-to-rank ensembles
Lucchese C., Nardini F. M., Orlando S., Perego R., Trani S.
Search engine ranking pipelines are commonly based on large ensembles of machine-learned decision trees. The tight constraints on query response time recently motivated researchers to investigate algorithms to make faster the traversal of the additive ensemble or to early terminate the evaluation of documents that are unlikely to be ranked among the top-k. In this paper, we investigate the novel problem of query-level early exiting, aimed at deciding the profitability of early stopping the traversal of the ranking ensemble for all the candidate documents to be scored for a query, by simply returning a ranking based on the additive scores computed by a limited portion of the ensemble. Besides the obvious advantage on query latency and throughput, we address the possible positive impact on ranking effectiveness. To this end, we study the actual contribution of incremental portions of the tree ensemble to the ranking of the top-k documents scored for a given query. Our main finding is that queries exhibit different behaviors as scores are accumulated during the traversal of the ensemble and that query-level early stopping can remarkably improve ranking quality. We present a reproducible and comprehensive experimental evaluation, conducted on two public datasets, showing that query-level early exiting achieves an overall gain of up to 7.5% in terms of NDCG@10 with a speedup of the scoring process of up to 2.2x.Source: 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2033–2036, online, 25-30 July, 2020
DOI: 10.1145/3397271.3401256
Project(s): BigDataGrapes

2020 Conference article Open Access

Predicting and explaining privacy risk exposure in mobility data
Naretto F., Pellungrini R., Monreale A., Nardini F. M., Musolesi M.
Mobility data is a proxy of different social dynamics and its analysis enables a wide range of user services. Unfortunately, mobility data are very sensitive because the sharing of people's whereabouts may arise serious privacy concerns. Existing frameworks for privacy risk assessment provide tools to identify and measure privacy risks, but they often (i) have high computational complexity; and (ii) are not able to provide users with a justification of the reported risks. In this paper, we propose expert, a new framework for the prediction and explanation of privacy risk on mobility data. We empirically evaluate privacy risk on real data, simulating a privacy attack with a state-of-the-art privacy risk assessment framework. We then extract individual mobility profiles from the data for predicting their risk. We compare the performance of several machine learning algorithms in order to identify the best approach for our task. Finally, we show how it is possible to explain privacy risk prediction on real data, using two algorithms: Shap, a feature importance-based method and Lore, a rule-based method. Overall, expert is able to provide a user with the privacy risk and an explanation of the risk itself. The experiments show excellent performance for the prediction task.Source: DS 2020 - International Conference on Discovery Science, pp. 403–418, Thessaloniki, Greece, October 19-21, 2020
DOI: 10.1007/978-3-030-61527-7_27
Project(s): XAI , SoBigData-PlusPlus

2020 Journal article Restricted

A novel approach to define the local region of dynamic selection techniques in imbalanced credit scoring problems
Melo Junior L., Nardini F. M. Renso C., Trani R., Macedo J. A.
Lenders, such as banks and credit card companies, use credit scoring models to evaluate the potential risk posed by lending money to customers, and therefore to mitigate losses due to bad credit. The profitability of the banks thus highly depends on the models used to decide on the customer's loans. State-of-the-art credit scoring models are based on machine learning and statistical methods. One of the major problems of this field is that lenders often deal with imbalanced datasets that usually contain many paid loans but very few not paid ones (called defaults). Recently, dynamic selection methods combined with ensemble methods and preprocessing techniques have been evaluated to improve classification models in imbalanced datasets presenting advantages over the static machine learning methods. In a dynamic selection technique, samples in the neighborhood of each query sample are used to compute the local competence of each base classifier. Then, the technique selects only competent classifiers to predict the query sample. In this paper, we evaluate the suitability of dynamic selection techniques for credit scoring problem, and we present Reduced Minority k-Nearest Neighbors (RMkNN), an approach that enhances state of the art in defining the local region of dynamic selection techniques for imbalanced credit scoring datasets. This proposed technique has a superior prediction performance in imbalanced credit scoring datasets compared to state of the art. Furthermore, RMkNN does not need any preprocessing or sampling method to generate the dynamic selection dataset (called DSEL). Additionally, we observe an equivalence between dynamic selection and static selection classification. We conduct a comprehensive evaluation of the proposed technique against state-of-the-art competitors on six real-world public datasets and one private one. Experiments show that RMkNN improves the classification performance of the evaluated datasets regarding AUC, balanced accuracy, H-measure, G-mean, F-measure, and Recall.Source: Expert systems with applications 152 (2020). doi:10.1016/j.eswa.2020.113351
DOI: 10.1016/j.eswa.2020.113351
Project(s): MC2020 , BigDataGrapes , MASTER

2020 Conference article Open Access

Topic propagation in conversational search
Mele I., Muntean C. I., Nardini F. M., Perego R., Tonellotto N., Frieder O.
In a conversational context, a user expresses her multi-faceted information need as a sequence of natural-language questions, i.e., utterances. Starting from a given topic, the conversation evolves through user utterances and system replies. The retrieval of documents relevant to a given utterance in a conversation is challenging due to ambiguity of natural language and to the difficulty of detecting possible topic shifts and semantic relationships among utterances. We adopt the 2019 TREC Conversational Assistant Track (CAsT) framework to experiment with a modular architecture performing: (i) topic-aware utterance rewriting, (ii) retrieval of candidate passages for the rewritten utterances, and (iii) neural-based re-ranking of candidate passages. We present a comprehensive experimental evaluation of the architecture assessed in terms of traditional IR metrics at small cutoffs. Experimental results show the effectiveness of our techniques that achieve an improvement of up to $0.28$ (+93%) for P@1 and $0.19$ (+89.9%) for nDCG@3 w.r.t. the CAsT baseline.Source: SIGIR 2020 - 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2057–2060, Online Conference, July 25-30, 2020
DOI: 10.1145/3397271.3401268
Project(s): BigDataGrapes

2020 Journal article Open Access

RankEval: Evaluation and investigation of ranking models
Lucchese C., Muntean C. I., Nardini F. M., Perego R., Trani S.
RankEval is a Python open-source tool for the analysis and evaluation of ranking models based on ensembles of decision trees. Learning-to-Rank (LtR) approaches that generate tree-ensembles are considered the most effective solution for difficult ranking tasks and several impactful LtR libraries have been developed aimed at improving ranking quality and training efficiency. However, these libraries are not very helpful in terms of hyper-parameters tuning and in-depth analysis of the learned models, and even the implementation of most popular Information Retrieval (IR) metrics differ among them, thus making difficult to compare different models. RankEval overcomes these limitations by providing a unified environment where to perform an easy, comprehensive inspection and assessment of ranking models trained using different machine learning libraries. The tool focuses on ensuring efficiency, flexibility and extensibility and is fully interoperable with most popular LtR libraries.Source: Softwarex (Amsterdam) 12 (2020). doi:10.1016/j.softx.2020.100614
DOI: 10.1016/j.softx.2020.100614
Project(s): BigDataGrapes

2019 Journal article Open Access

Parallel Traversal of Large Ensembles of Decision Trees
Lettich F., Lucchese C., Nardini F. M., Orlando S., Perego R., Tonellotto N., Venturini R.
Machine-learnt models based on additive ensembles of regression trees are currently deemed the best solution to address complex classification, regression, and ranking tasks. The deployment of such models is computationally demanding: to compute the final prediction, the whole ensemble must be traversed by accumulating the contributions of all its trees. In particular, traversal cost impacts applications where the number of candidate items is large, the time budget available to apply the learnt model to them is limited, and the users' expectations in terms of quality-of-service is high. Document ranking in web search, where sub-optimal ranking models are deployed to find a proper trade-off between efficiency and effectiveness of query answering, is probably the most typical example of this challenging issue. This paper investigates multi/many-core parallelization strategies for speeding up the traversal of large ensembles of regression trees thus obtaining machine-learnt models that are, at the same time, effective, fast, and scalable. Our best results are obtained by the GPU-based parallelization of the state-of-the-art algorithm, with speedups of up to 102.6x.Source: IEEE transactions on parallel and distributed systems (Print) 30 (2019): 2075–2089. doi:10.1109/TPDS.2018.2860982
DOI: 10.1109/tpds.2018.2860982
DOI: 10.5281/zenodo.2668379
DOI: 10.5281/zenodo.2668378
Project(s): BigDataGrapes

2019 Report Open Access

ISTI Young Researcher Award "Matteo Dellepiane" - Edition 2019
Barsocchi P., Candela L., Crivello A., Esuli A., Ferrari A., Girardi M., Guidotti R., Lonetti F., Malomo L., Moroni D., Nardini F. M., Pappalardo L., Rinzivillo S., Rossetti G., Robol L.
The ISTI Young Researcher Award (YRA) selects yearly the best young staff members working at Institute of Information Science and Technologies (ISTI). This award focuses on quality and quantity of the scientific production. In particular, the award is granted to the best young staff members (less than 35 years old) by assessing their scientific production in the year preceding the award. This report documents the selection procedure and the results of the 2019 YRA edition. From the 2019 edition on the award is named as "Matteo Dellepiane", being dedicated to a bright ISTI researcher who prematurely left us and who contributed a lot to the YRA initiative from its early start.Source: ISTI Technical reports, 2019

See at: ISTI Repository | CNR ExploRA

2019 Conference article Open Access

Fast Approximate Filtering of Search Results Sorted by Attribute
Nardini F. M., Trani R., Venturini R.
Several Web search services enable their users with the possibility of sorting the list of results by a specific attribute, e.g., sort "by price" in e-commerce. However, sorting the results by attribute could bring marginally relevant results in the top positions thus leading to a poor user experience. This motivates the definition of the relevance-aware filtering problem. This problem asks to remove results from the attribute-sorted list to maximize its final overall relevance. Recently, an optimal solution to this problem has been proposed. However, it has strong limitations in the Web scenario due to its high computational cost. In this paper, we propose ?-Filtering: an efficient approximate algorithm with strong approximation guarantees on the relevance of the final list. More precisely, given an allowed approximation error ?, the proposed algorithm finds a(1-?)"optimal filtering, i.e., the relevance of its solution is at least (1-?) times the optimum. We conduct a comprehensive evaluation of ?-Filtering against state-of-the-art competitors on two real-world public datasets. Experiments show that ?-Filtering achieves the desired levels of effectiveness with a speedup of up to two orders of magnitude with respect to the optimal solution while guaranteeing very small approximation errors.Source: 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 815–824, Parigi, Francia, 21/07/2019, 25/07/2019
DOI: 10.1145/3331184.3331227
Project(s): BigDataGrapes

2019 Conference article Open Access

Learning to Rank in Theory and Practice: From Gradient Boosting to Neural Networks and Unbiased Learning
Lucchese C., Nardini F. M., Pasumarthi R. K., Bruch S., Bendersky M., Wang X., Oosterhuis H., Jagerman R., De Rijke M.
This tutorial aims to weave together diverse strands of modern Learning to Rank (LtR) research, and present them in a unified full-day tutorial. First, we will introduce the fundamentals of LtR, and an overview of its various sub-fields. Then, we will discuss some recent advances in gradient boosting methods such as LambdaMART by focusing on their efficiency/effectiveness trade-offs and optimizations. Subsequently, we will then present TF-Ranking, a new open source TensorFlow package for neural LtR models, and how it can be used for modeling sparse textual features. Finally, we will conclude the tutorial by covering unbiased LtR -- a new research field aiming at learning from biased implicit user feedback. The tutorial will consist of three two-hour sessions, each focusing on one of the topics described above. It will provide a mix of theoretical and hands-on sessions, and should benefit both academics interested in learning more about the current state-of-the-art in LtR, as well as practitioners who want to use LtR techniques in their applications.Source: 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1419–1420, Parigi, Francia, 21/07/2019, 25/07/2019
DOI: 10.1145/3331184.3334824

2019 Conference article Open Access

An Optimal Algorithm to Find Champions of Tournament Graphs
Beretta L., Nardini F. M., Trani R., Venturini R.
A tournament graph T=(V,E) is an oriented complete graph, which can be used to model a round-robin tournament between n players. In this short paper, we address the problem of finding a champion of the tournament, also known as Copeland winner, which is a player that wins the highest number of matches. Our goal is to solve the problem by minimizing the number of arc lookups, i.e., the number of matches played. We prove that finding a champion requires ?(ln) comparisons, where l is the number of matches lost by the champion, and we present a deterministic algorithm matching this lower bound without knowing l . Solving this problem has important implications on several Information Retrieval applications including Web search, conversational AI, machine translation, question answering, recommender systems, etc.Source: SPIRE 2019: International Symposium on String Processing and Information Retrieval, pp. 267–273, Segovia, Spagna, 07/10/2019, 09/10/2019
DOI: 10.1007/978-3-030-32686-9_19
Project(s): BigDataGrapes

2019 Conference article Open Access

On combining dynamic selection, sampling, and pool generators for credit scoring
Melo Junior L., Nardini F. M., Renso C., Fernandes De Macedo J. A.
The profitability of the banks highly depends on the models used to decide on the customer's loans. State of the art credit scoring models are based on machine learning methods. These methods need to cope with the problem of imbalanced classes since credit scoring datasets usually contain many paid loans and few not paid ones (defaults). Recently, dynamic selection approaches combined with pre-processing techniques have been evaluated for imbalanced datasets. However, previous works only evaluate oversampling techniques combined with bagging pool generator ensembles. For this reason, we propose to combine different dynamic selection, preprocessing and pool generation techniques. We assess the prediction performance by using four public real-world credit scoring datasets with different levels of imbalanced ratio and four evaluation measures. Experimental results show that KNORA-Union dynamic selection technique combined with Balanced Random Forest improves the classification performance concerning the static ensemble for all levels of imbalance ratio.Source: Machine Learning and Data Mining in Pattern Recognition, 15th International Conference on Machine Learning and Data Mining, MLDM, pp. 443–457, New York, USA, 18/07/2019, 23/07/2019

See at: ISTI Repository | CNR ExploRA

2019 Journal article Embargo

Speed prediction in large and dynamic traffic sensor networks
Magalhaes R. P., Lettich F., Macedo J. A., Nardini F. M., Perego R., Renso C., Trani R.
Smart cities are nowadays equipped with pervasive networks of sensors that monitor traffic in real-time and record huge volumes of traffic data. These datasets constitute a rich source of information that can be used to extract knowledge useful for municipalities and citizens. In this paper we are interested in exploiting such data to estimate future speed in traffic sensor networks, as accurate predictions have the potential to enhance decision making capabilities of traffic management systems. Building effective speed prediction models in large cities poses important challenges that stem from the complexity of traffic patterns, the number of traffic sensors typically deployed, and the evolving nature of sensor networks. Indeed, sensors are frequently added to monitor new road segments or replaced/removed due to different reasons (e.g., maintenance). Exploiting a large number of sensors for effective speed prediction thus requires smart solutions to collect vast volumes of data and train effective prediction models. Furthermore, the dynamic nature of real-world sensor networks calls for solutions that are resilient not only to changes in traffic behavior, but also to changes in the network structure, where the cold start problem represents an important challenge. We study three different approaches in the context of large and dynamic sensor networks: local, global, and cluster-based. The local approach builds a specific prediction model for each sensor of the network. Conversely, the global approach builds a single prediction model for the whole sensor network. Finally, the cluster-based approach groups sensors into homogeneous clusters and generates a model for each cluster. We provide a large dataset, generated from ~1.3 billion records collected by up to 272 sensors deployed in Fortaleza, Brazil, and use it to experimentally assess the effectiveness and resilience of prediction models built according to the three aforementioned approaches. The results show that the global and cluster-based approaches provide very accurate prediction models that prove to be robust to changes in traffic behavior and in the structure of sensor networks.Source: Information systems (Oxf.) 98 (2019). doi:10.1016/j.is.2019.101444
DOI: 10.1016/j.is.2019.101444
Project(s): BigDataGrapes , MASTER

2019 Conference article Open Access

Enhanced news retrieval: passages lead the way!
Catena M., Nardini F. M., Frieder O., Perego R., Muntean C. I., Tonellotto N.
We observe that most relevant terms in unstructured news articles are primarily concentrated towards the beginning and the end of the document. Exploiting this observation, we propose a novel version of the classical BM25 weighting model, called BM25 Passage (BM25P), which scores query results by computing a linear combination of term statistics in the different portions of news articles. Our experimentation, conducted using three publicly available news datasets, demonstrates that BM25P markedly outperforms BM25 in term of effectiveness by up to 17.44% in NDCG@5 and 85% in NDCG@1.Source: 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1269–1272, Parigi, Francia, 21-25 July 2019
DOI: 10.1145/3331184.3331373