Document - A critical reassessment of the Saerens-Latinne-Decaestecker algorithm for posterior probability adjustment

2020

Journal article Open Access

A critical reassessment of the Saerens-Latinne-Decaestecker algorithm for posterior probability adjustment

Esuli A., Molinari A., Sebastiani F.

Prior probabilities Dataset shift Computer Science Applications Distribution shift General Business Management and Accounting Information Systems Probabilistic classifiers Posterior probabilities Text classification

We critically re-examine the Saerens-Latinne-Decaestecker (SLD) algorithm, a well-known method for estimating class prior probabilities ("priors") and adjusting posterior probabilities ("posteriors") in scenarios characterized by distribution shift, i.e., difference in the distribution of the priors between the training and the unlabelled documents. Given a machine-learned classifier and a set of unlabelled documents for which the classifier has returned posterior probabilities and estimates of the prior probabilities, SLD updates them both in an iterative, mutually recursive way, with the goal of making both more accurate; this is of key importance in downstream tasks such as single-label multiclass classification and cost-sensitive text classification. Since its publication, SLD has become the standard algorithm for improving the quality of the posteriors in the presence of distribution shift, and SLD is still considered a top contender when we need to estimate the priors (a task that has become known as "quantification"). However, its real effectiveness in improving the quality of the posteriors has been questioned. We here present the results of systematic experiments conducted on a large, publicly available dataset, across multiple amounts of distribution shift and multiple learners. Our experiments show that SLD improves the quality of the posterior probabilities and of the estimates of the prior probabilities, but only when the number of classes in the classification scheme is very small and the classifier is calibrated. As the number of classes grows, or as we use non-calibrated classifiers, SLD converges more slowly (and often does not converge at all), performance degrades rapidly, and the impact of SLD on the quality of the prior estimates and of the posteriors becomes negative rather than positive.

Source: ACM transactions on information systems 39 (2020). doi:10.1145/3433164

Publisher: Association for Computing Machinery,, New York, NY , Stati Uniti d'America

Metrics

Back to previous page

Cite as

BibTeX entry

@article{oai:it.cnr:prodotti:440890,
	title = {A critical reassessment of the Saerens-Latinne-Decaestecker algorithm for posterior probability adjustment},
	author = {Esuli A. and Molinari A. and Sebastiani F.},
	publisher = {Association for Computing Machinery,, New York, NY , Stati Uniti d'America},
	doi = {10.1145/3433164},
	journal = {ACM transactions on information systems},
	volume = {39},
	year = {2020}
}

CNR authors and affiliations

CNR authors

Esuli, Andrea
0000-0002-5725-4322
Molinari, Alessio
0000-0002-8791-3245
Sebastiani, Fabrizio
0000-0003-4221-6427

Laboratories

Networked Multimedia Information System (2002-2020)
Artificial Intelligence for Media and Humanities (2021-ongoing)

Download

CNR ExploRA

Bibliographic record

ISTI Repository

Postprint version

DOI

10.1145/3433164

Also available from

ZENODO
ACM Transactions on Information Systems
dl.acm.org

Projects (via OpenAIRE)

AI4Media
A European Excellence Centre for Media, Society and Democracy
ARIADNEplus
Advanced Research Infrastructure for Archaeological Data Networking in Europe - plus
SoBigData-PlusPlus
SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics