0
05.02.2025
The rise of deepfakes has raised concerns about the potential misuse of this technology for disinformation, fraud, and other malicious purposes. To counter this threat, researchers have developed deepfake detectors that can identify deepfakes with widely published claims of 99%+ accuracy. Yet, this space evolves extremely quickly, with new and more realistic algorithms being released on a weekly basis.
To keep up with the evolving threat landscape, detectors need to be generalisable across different datasets and deepfake generation methods. With the release of a new deepfake dataset - DeepAction (Bohacek & Farid, 2024) - we took the opportunity to assess the generalisability of open-source deepfake detectors across the dataset and deepfake generation methods.
While much of our work involves evaluating closed-source models, there's much to be learned from open-source models and we hope that the lessons learned from our assessment will be useful to the wider community evaluating deepfake detectors and those building generalisable models.
The DeepAction Dataset, is a comprehensive collection of human action videos consisting of 3,100 AI-generated clips from seven text-to-video models (BD AnimateDif (Lin & Yang, 2024), CogVideoX-5B (Yang et al., 2024), RunwayML Gen3 (RunwayML, n.d.), Stable Diffusion (Blattmann et al., 2023), Veo (Veo, n.d.), and VideoPoet (Kondratyuk et al., 2024) and 100 matching real videos from Pexels.
The dataset encompasses 100 distinct human actions, with videos generated using prompts like "a person walking through a park" or "a person vacuuming the living room". While existing datasets commonly used to benchmark deepfake detectors like FaceForensics++ (Rössler et al., 2019) and Celeb-DF (Li et al., 2020) focus on facial manipulations, DeepAction enables evaluation of deepfake detectors on full-body actions and scenes - providing a new lens on detector generalisability.
We preprocessed the dataset using the following steps:
Note, this requires a modification to DeepfakeBench as the original implementation crops faces from the images, which is not suitable for action videos.
We evaluated three open-source deepfake detectors, each representing different approaches to detection:
This selection spans traditional specialist models (Xception), modern generalization-focused approaches (UCF), and large-scale representation learning (CLIP). All three models were pre-trained on FF++ to enable fair comparison on the DeepAction dataset.
To evaluate how well these algorithms would be able to adapt from the facial recognition domain to action videos, we fine-tuned them on the DeepAction data before evaluating them on the test set. The frame-level results (N = 10564) are summarised in Table 1 below.
Table 1: Performance of deepfake detectors trained and tested on the DeepAction dataset at the frame level
Our evaluation uses multiple metrics:
All models show strong performance, with the main differentiator being their trade-off between sensitivity and specificity.
Video-level2 (N = 399) results mirror the frame-level findings, confirming that these architectures can effectively adapt to non-facial content when properly fine-tuned. This suggests that the fundamental feature extraction capabilities of these models extend beyond their original training domain.
Table 2: Performance of deepfake detectors trained and tested on the DeepAction dataset at the video level
While surveillance of emerging deepfake threats and continuous improvement of detector models by fine-tuning on the latest techniques is effective, it is not always possible to know the exact nature of such threats in advance. To evaluate the generalisability of pre-trained models, we take our three models pre-trained on the FF++ dataset and evaluate them on DeepAction without further fine-tuning. The results are summarised in Table 3 below.
Table 3: Performance of deepfake detectors trained on FF++ at the frame level
All three models perform significantly worse when evaluated on the DeepAction dataset without fine-tuning. The Xception model and CLIP seem to do a poorer job at identifying deepfakes, while the UCF model seems to prioritise sensitivity over specificity. This suggests the models trained on a different distribution of data are optimized for different aspects of performance, making direct comparison misleading.
Figure 1: ROC Curve of pre-trained models
In cases where we are able to obtain the prediction probabilities, we can plot the ROC curve to better understand the trade-off between sensitivity and specificity. The ROC curve for the pre-trained models is shown in Figure 1. This provides a more useful comparison of the models and would allow us to pick the model that best suits our needs - if we value a low specificity (false positive rate) we would choose the Xception model but if we value a high sensitivity (true positive rate) we would choose the CLIP model. This analysis suggests that while no single pre-trained model generalizes perfectly, we could combine models with complementary strengths to offer a more performant solution for detecting across deepfake types.
While in a typical evaluation of a black box model, one would be left to speculate on why the models may not be as generalisable as one would hope, the benefit of using open-source models is that we can dive one level deeper into the feature embedding space to understand what might be going on.
Figure 2: Visualising feature embeddings
The figure shows t-SNE visualisations of feature embeddings for each model before and after fine-tuning. The top row shows embeddings from models trained only on FF++, while the bottom row shows embeddings after fine-tuning on DeepAction. Blue points represent real videos, while other colors indicate different deepfake generation methods. Each model's embedding structure reveals distinct patterns in how it learns to distinguish real from fake content.
The Xception model reveals characteristics of a specialist architecture that excels with training data but struggles to generalize. Examining the t-SNE embeddings shows that before fine-tuning, the model fails to meaningfully separate real from fake videos - all data points appear randomly distributed. However, after fine-tuning on DeepAction, a clear separation emerges between real and fake content, suggesting the model can learn the task but requires explicit training examples.
Figure 3: Leave-one-out analysis by testing data subset
To further test the generalisability of the Xception model, we conducted a leave-one-out analysis where we trained the model on all but one of the generated DeepAction videos and evaluated it on all videos (including the held-out set). The matrix of results is shown in Figure 3.
Performance drops significantly when the model is evaluated on unseen data, especially for BDAnimateDiff, RunwayML and Veo. Bohacek & Farid (2024), noted the data generated from Veo was a pre-released model and that might explain why the model performs so poorly without any samples to train on. Interestingly, the performance drop was negligible for StableDiffusion.
The UCF model shows some structure in its TSNE embeddings when trained on FF++, though without a clear separation between real and fake data points. This structure likely emerges from UCF's contrastive regularization technique, which explicitly separates content-specific from forgery-specific features (Figure 4).
Figure 4: UCF architecture (Yan, Zhang, Fan, et al., 2023) - separating an input content into content and forgery-specific features
While the forgery features learned on FF++ might be generalised across other facial manipulation and forgery techniques, it might not translate as effectively to the DeepAction dataset where scenes with portrait-like faces are rare and backgrounds more complex.
Table 4: AUC scores of UCF model trained on FF++ and evaluated on different datasets. First three results taken from Yan, Zhang, Yuan, et al. (2023)
Table 4 shows the AUC scores of the UCF model trained on FF++ and evaluated on different datasets. The model performs well on the FF++ dataset but the AUC scores drop significantly as the test dataset varies increasingly from the training dataset. The drop in performance from Celeb faces to more diverse methods and perturbations in DFDC to a completely different domain in DeepAction shows a limitation even in a model that was designed to be more generalisable.
The TSNE visualization of CLIP's embeddings presents an intriguing puzzle: despite showing clear separation between real and fake data points, the model achieves only marginally better classification performance than Xception or UCF (AUC 0.71 vs 0.67).
This disconnect highlights a fundamental insight about deep learning models: the ability to learn discriminative features (visible in the TSNE plot) does not automatically translate to strong classification performance. The classification head must still learn to correctly weigh and combine these features for the specific task at hand.
Fortunately, for our pre-trained model, this is a simple remedy. By freezing CLIP's pre-trained parameters and training only a new classification layer on the DeepAction dataset, we achieved:
This dramatic improvement mirrors Bohacek & Farid (2024) findings with their SVM approach (0.84 accuracy) and suggests that while CLIP learns robust, transferable features, adapting these features to new domains may require targeted fine-tuning of the classification layer.
Our evaluation of three open-source deepfake detectors reveals both the potential and limitations of generalisable synthetic media detection.
When fine-tuned on new domains, deepfake detectors show remarkable adaptability - even models originally trained on facial deepfakes achieve 0.95+ AUC scores on synthetic action videos. However, without fine-tuning, performance degrades significantly as test data diverges from the training distribution, even for newer architectures like UCF and CLIP that were designed for better generalization.
Our analysis also reveals a critical insight: strong feature learning doesn't guarantee strong classification. While models can learn rich, discriminative features (evident in their latent space), translating these features into accurate predictions often requires targeted fine-tuning of the classification layer.
We have evaluated the generalisability of three open-source deepfake detectors across different datasets and deepfake generation methods, and found that fine-tuning the models on the DeepAction dataset significantly improved their performance and deepfake detectors. Even those trained on facial images, can easily adapt to other types of images.
However, generalising pre-trained models still remains a challenge. While newer architectures like UCF and CLIP show promise in being more generalisable, it is important to validate their performance on unseen datasets that hopefully mirrors the distribution of the data they are likely to be deployed on. Understanding this distinction between feature learning and classification performance provides a path forward: future detectors may benefit more from focusing on robust feature extraction and making it available rather than trying to build one-size-fits-all classifiers.
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., & Rombach, R. (2023). Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. https://arxiv.org/abs/2311.15127
Bohacek, M., & Farid, H. (2024). Human Action CLIPS: Detecting AI-generated Human Motion. https://arxiv.org/abs/2412.00526
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1251–1258. http://openaccess.thecvf.com/content_cvpr_2017/html/Chollet_Xception_Deep_Learning_CVPR_2017_paper.html
Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.-C., Somandepalli, K., Akbari, H., Alon, Y., Cheng, Y., Dillon, J., Gupta, A., Hahn, M., Hauth, A., Hendon, D., … Jiang, L. (2024). VideoPoet: A Large Language Model for Zero-Shot Video Generation. https://arxiv.org/abs/2312.14125
Li, Y., Yang, X., Sun, P., Qi, H., & Lyu, S. (2020). Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. https://arxiv.org/abs/1909.12962
Lin, S., & Yang, X. (2024). AnimateDiff-Lightning: Cross-Model Diffusion Distillation. https://arxiv.org/abs/2403.12706
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020
Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M. (2019). FaceForensics++: Learning to Detect Manipulated Facial Images. https://arxiv.org/abs/1901.08971
RunwayML. (n.d.). Introducing Gen-3 Alpha. https://runwayml.com/research/introducing-gen-3-alpha
Veo. (n.d.). https://deepmind.google/technologies/veo
Yan, Z., Yao, T., Chen, S., Zhao, Y., Fu, X., Zhu, J., Luo, D., Yuan, L., Wang, C., Ding, S., & others. (2024). DF40: Toward Next-Generation Deepfake Detection. arXiv Preprint arXiv:2406.13495.
Yan, Z., Zhang, Y., Fan, Y., & Wu, B. (2023). UCF: Uncovering Common Features for Generalizable Deepfake Detection. https://arxiv.org/abs/2304.13949
Yan, Z., Zhang, Y., Yuan, X., Lyu, S., & Wu, B. (2023). DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems (Vol. 36, pp. 4534–4565). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2023/file/0e735e4b4f07de483cbe250130992726-Paper-Datasets_and_Benchmarks.pdf
Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., Yin, D., Gu, X., Zhang, Y., Wang, W., Cheng, Y., Liu, T., Xu, B., Dong, Y., & Tang, J. (2024). CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. https://arxiv.org/abs/2408.06072