Journal article  Open Access

Utility-theoretic ranking for semiautomated text classification

Berardi G., Esuli A., Sebastiani F.

Computer Science - Machine Learning  FOS: Computer and information sciences  Utility theory  Semi-automated text classification  I.2.6 Learning  General Computer Science  Semiautomatd classification  Machine Learning (cs.LG) 

Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.

Source: ACM transactions on knowledge discovery from data 10 (2015): 6. doi:10.1145/2742548

Publisher: Association for Computing Machinery,, New York, NY , Stati Uniti d'America

