Document - Semi-automated text classification for sensitivity identification

2015

Conference article Open Access

Semi-automated text classification for sensitivity identification

Berardi G., Esuli A., Macdonald C., Ounis I., Sebastiani F.

Sensitive information

Sensitive documents are those that cannot be made public, e.g., for personal or organizational privacy reasons. For instance, documents requested through Freedom of Information mechanisms must be manually reviewed for the presence of sensitive information before their actual release. Hence, tools that can assist human reviewers in spotting sensitive information are of great value to government organizations subject to Freedom of Information laws. We look at sensitivity identification in terms of semi-automated text classification (SATC), the task of ranking automatically classified documents so as to optimize the cost-effectiveness of human post-checking work. We use a recently proposed utility-theoretic approach to SATC that explicitly optimizes the chosen effectiveness function when ranking the documents by sensitivity; this is especially useful in our case, since sensitivity identification is a recall-oriented task, thus requiring the use of a recall-oriented evaluation measure such as F2. We show the validity of this approach by running experiments on a multi-label multi-class dataset of government documents manually annotated according to different types of sensitivity.

Source: 24th ACM International Conference on Information and Knowledge Management, pp. 1711–1714, Melbourne, AU, 19-23/10/2015

Citations

[1] G. E. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations, 6(1):20-29, 2004.
[2] G. Berardi, A. Esuli, and F. Sebastiani. A utility-theoretic ranking method for semi-automated text classification. In Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), pages 961-970, Portland OR, US, 2012.
[3] M. Gabriel, C. Paskach, and D. Sharpe. The challenge and promise of predictive coding for privilege. In Proceedings of the ICAIL 2013 Workshop on Standards for Using Predictive Coding (DESI V), Roma, IT, 2013.
[4] T. Joachims. Making large-scale SVM learning practical. In B. Scho¨lkopf, C. J. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter 11, pages 169-184. The MIT Press, Cambridge, US, 1999.
[5] M. Martinez-Alvarez, A. Bellogin, and T. Roelleke. Document difficulty framework for semi-automatic text classification. In Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2013), Prague, CZ, 2013.
[6] M. Martinez-Alvarez, S. Yahyaei, and T. Roelleke. Semi-automatic document classification: Exploiting document difficulty. In Proceedings of the 34th European Conference on Information Retrieval (ECIR 2012), Barcelona, ES, 2012.
[7] G. McDonald, C. Macdonald, I. Ounis, and T. Gollins. Towards a classifier for digital sensitivity review. In Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014), pages 500-506, Amsterdam, NL, 2014.
[8] D. W. Oard and W. Webber. Information retrieval for e-discovery. Foundations and Trends in Information Retrieval, 7(2/3):99-237, 2013.
[9] G. Szarvas, R. Farkas, and R. Busa-Fekete. State-of-the-art anonymisation of medical data with an iterative machine learning model/framework. Journal of the American Medical Informatics Association, 14(5):574-580, 2007.
[10] J. K. Vinjumur, D. W. Oard, , and J. H. Paik. Assessing the reliability and reusability of an e-discovery privilege test collection. In Proceedings of the 37th ACM Conference on Research and Development in Information Retrieval (SIGIR 2014), pages 1047-1050, Gold Coast, AU, 2014.
[11] T. Wilson, P. Hoffmann, S. Somasundaran, J. Kessler, J. Wiebe, Y. Choi, C. Cardie, E. Riloff, and S. Patwardhan. OpinionFinder: A system for subjectivity analysis. In Proceedings of the HLT/EMNLP 2005 Interactive Demonstrations, pages 34-35, Vancouver, CA, 2005.

Metrics

Back to previous page

Cite as

BibTeX entry

@inproceedings{oai:it.cnr:prodotti:344510,
	title = {Semi-automated text classification for sensitivity identification},
	author = {Berardi G. and Esuli A. and Macdonald C. and Ounis I. and Sebastiani F.},
	doi = {10.1145/2806416.2806597},
	booktitle = {24th ACM International Conference on Information and Knowledge Management, pp. 1711–1714, Melbourne, AU, 19-23/10/2015},
	year = {2015}
}