Conference article  Open Access

Cross-forgery analysis of vision transformers and CNNs for deepfake image detection

Coccomini D. A., Caldelli R., Falchi F., Gennaro C., Amato G.

Deepfake Detection  Computer vision  Computer Vision and Pattern Recognition (cs.CV)  FOS: Computer and information sciences  Convolutional neural network  Deepfake  Vision Transformers  Vision transformers  Deep Learning  Convolutional Neural Netwro  Computer Science - Computer Vision and Pattern Recognition 

Deepfake Generation Techniques are evolving at a rapid pace, making it possible to create realistic manipulated images and videos and endangering the serenity of modern society. The continual emergence of new and varied techniques brings with it a further problem to be faced, namely the ability of deepfake detection models to update themselves promptly in order to be able to identify manipulations carried out using even the most recent methods. This is an extremely complex problem to solve, as training a model requires large amounts of data, which are difficult to obtain if the deepfake generation method is too recent. Moreover, continuously retraining a network would be unfeasible. In this paper, we ask ourselves if, among the various deep learning techniques, there is one that is able to generalise the concept of deepfake to such an extent that it does not remain tied to one or more specific deepfake generation methods used in the training set. We compared a Vision Transformer with an EfficientNetV2 on a cross-forgery context based on the ForgeryNet dataset. From our experiments, It emerges that EfficientNetV2 has a greater tendency to specialize often obtaining better results on training methods while Vision Transformers exhibit a superior generalization ability that makes them more competent even on images generated with new methodologies.

Source: MAD '22 - 1st International workshop on Multimedia AI against Disinformation, pp. 52–58, Newark, NY, USA, 27/06/2022

[1] A. Buslaev, A. Parinov, E. Khvedchenya, V. I. Iglovikov, and A. A. Kalinin. 2018. Albumentations: fast and flexible image augmentations. ArXiv e-prints (2018). arXiv:1809.06839
[2] Roberto Caldelli, Leonardo Galteri, Irene Amerini, and Alberto Del Bimbo. 2021. Optical Flow based CNN for detection of unlearnt deepfake manipulations. Pattern Recognition Letters 146 (2021), 31-37. https://doi.org/10.1016/j.patrec.2021. 03.005
[3] Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. 2021. Crossvit: Crossattention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899 (2021).
[4] Lele Chen, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7824- 7833. https://doi.org/10.1109/CVPR.2019.00802
[5] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. StarGAN: Unified Generative Adversarial Networks for Multidomain Image-to-Image Translation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8789-8797. https://doi.org/10.1109/CVPR.2018. 00916
[6] Davide Coccomini, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. 2022. Combining EficientNet and Vision Transformers for Video Deepfake Detection. arXiv:2107.02612 [cs.CV]
[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248-255. https://doi.org/10.1109/CVPR. 2009.5206848
[8] Yu Deng, Jiaolong Yang, Dong Chen, Fang Wen, and Xin Tong. 2020. Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5153-5162. https://doi.org/10.1109/CVPR42600.2020.00520
[9] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. 2020. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397 (2020).
[10] Nick Dufour and Andrew Gully. 2019. Contributing data to deep-fake detection research. https://ai.googleblog.com/2019/09/contributing-data-to-deepfakedetection.html
[11] Tiziano Fagni, Fabrizio Falchi, Margherita Gambini, Antonio Martella, and Maurizio Tesconi. 2021. TweepFake: About detecting deepfake tweets. Plos one 16, 5 (2021), e0251415.
[12] Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. 2019. Text-based Editing of Talking-head Video.
[13] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Networks. In Advances in neural information processing systems 27. arXiv:1406.2661 [stat.ML]
[14] Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. 2021. ForgeryNet: A Versatile Benchmark for Comprehensive Forgery Analysis. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4358-4367. https://doi.org/10.1109/CVPR46437.2021. 00434
[15] Young-Jin Heo, Young-Ju Choi, Young-Woon Lee, and Byung-Gyu Kim. 2021. Deepfake Detection Scheme Based on Vision Transformer and Distillation. arXiv preprint arXiv:2104.01353 (2021).
[16] Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. 2020. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2889-2898.
[17] Youngjoo Jo and Jongyoul Park. 2019. SC-FEGAN: Face Editing Generative Adversarial Network With User's Sketch and Color. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 1745-1753. https://doi.org/10.1109/ICCV. 2019.00183
[18] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8110-8119.
[19] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8107-8116. https://doi.org/10.1109/CVPR42600.2020.00813
[20] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In International Conference on Machine Learning. PMLR, 1857-1865.
[21] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[22] Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Sylvain Gelly, Thomas Unterthiner, and Xiaohua Zhai. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
[23] Pavel Korshunov and Sébastien Marcel. 2018. Deepfakes: a new threat to face recognition? assessment and detection. arXiv preprint arXiv:1812.08685 (2018).
[24] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. 2020. MaskGAN: Towards Diverse and Interactive Facial Image Manipulation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5548-5557. https://doi.org/ 10.1109/CVPR42600.2020.00559
[25] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. 2020. Advancing High Fidelity Identity Swapping for Forgery Detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5073-5082. https://doi.org/10.1109/CVPR42600.2020.00512
[26] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. 2020. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3207-3216.
[27] Yuval Nirkin, Yosi Keller, and Tal Hassner. 2019. FSGAN: Subject Agnostic Face Swapping and Reenactment. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 7183-7192. https://doi.org/10.1109/ICCV.2019.00728
[28] Ivan Petrov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu, Sugasa Marangonda, Chris Umé, Mr. Dpfks, RP Luis, Jian Jiang, Sheng Zhang, Pingyu Wu, Bo Zhou, and Weiming Zhang. 2020. DeepFaceLab: A simple, flexible and extensible face swapping framework. ArXiv abs/2005.05535 (2020).
[29] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1-11.
[30] Selim Seferbekov. 2020. DFDC 1st place solution. "https://github.com/selimsef/ dfdc_deepfake_challenge"
[31] Aliaksandr Siarohin, Stephane Lathuiliere, S. Tulyakov, Elisa Ricci, and N. Sebe. 2019. First Order Motion Model for Image Animation. ArXiv abs/2003.00196 (2019).
[32] Mingxing Tan and Quoc V. Le. 2021. EficientNetV2: Smaller Models and Faster Training. arXiv:2104.00298 [cs.CV]
[33] Mingxing Tan and Quoc V. Le. 2021. EficientNetV2: Smaller Models and Faster Training. (2021). https://doi.org/10.48550/ARXIV.2104.00298
[34] Deressa Wodajo and Solomon Atnafu. 2021. Deepfake Video Detection Using Convolutional Vision Transformer. arXiv preprint arXiv:2102.11126 (2021).
[35] Xin Yang, Yuezun Li, and Siwei Lyu. 2019. Exposing deep fakes using inconsistent head poses. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8261-8265.
[36] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23, 10 (2016), 1499-1503.


Back to previous page
BibTeX entry
	title = {Cross-forgery analysis of vision transformers and CNNs for deepfake image detection},
	author = {Coccomini D. A. and Caldelli R. and Falchi F. and Gennaro C. and Amato G.},
	doi = {10.1145/3512732.3533582 and 10.48550/arxiv.2206.13829},
	booktitle = {MAD '22 - 1st International workshop on Multimedia AI against Disinformation, pp. 52–58, Newark, NY, USA, 27/06/2022},
	year = {2022}

A European Excellence Centre for Media, Society and Democracy