The Generalisability Gap - Evaluating Deepfake Detectors Across Domains

05.02.2025

Introduction

The rise of deepfakes has raised concerns about the potential misuse of this technology for disinformation, fraud, and other malicious purposes. To counter this threat, researchers have developed deepfake detectors that can identify deepfakes with widely published claims of 99%+ accuracy. Yet, this space evolves extremely quickly, with new and more realistic algorithms being released on a weekly basis.

To keep up with the evolving threat landscape, detectors need to be generalisable across different datasets and deepfake generation methods. With the release of a new deepfake dataset - DeepAction (Bohacek & Farid, 2024) - we took the opportunity to assess the generalisability of open-source deepfake detectors across the dataset and deepfake generation methods.

While much of our work involves evaluating closed-source models, there's much to be learned from open-source models and we hope that the lessons learned from our assessment will be useful to the wider community evaluating deepfake detectors and those building generalisable models.

Key Takeaways

When fine-tuned on new data, deepfake detectors can perform well even outside their original training domain (faces) - achieving 95%+ AUC scores on action videos. This suggests the underlying architectures are capable of adapting to new types of deepfakes when properly trained.
However, pre-trained models perform poorly when tested on new types of deepfakes without fine-tuning - even for models designed to be more generalisable and pre-trained on large datasets, AUC scores drop to 67-71%. This highlights the challenge of building truly generalisable detectors.
Generalisability exists on a spectrum - while newer architectures like UCF and CLIP show promise, their effectiveness must be validated on unseen datasets that mirror real-world deployment conditions.
Feature separation in the latent space does not automatically translate to strong classification performance. However, when a model learns meaningful representations, adapting it to new domains can be as simple as retraining the final classification layer - as demonstrated by CLIP's rapid improvement from 71% to 94% AUC by freezing the underlying parameters and training a new classifier.

DeepAction Dataset

The DeepAction Dataset, is a comprehensive collection of human action videos consisting of 3,100 AI-generated clips from seven text-to-video models (BD AnimateDif (Lin & Yang, 2024), CogVideoX-5B (Yang et al., 2024), RunwayML Gen3 (RunwayML, n.d.), Stable Diffusion (Blattmann et al., 2023), Veo (Veo, n.d.), and VideoPoet (Kondratyuk et al., 2024) and 100 matching real videos from Pexels.

The dataset encompasses 100 distinct human actions, with videos generated using prompts like "a person walking through a park" or "a person vacuuming the living room". While existing datasets commonly used to benchmark deepfake detectors like FaceForensics++ (Rössler et al., 2019) and Celeb-DF (Li et al., 2020) focus on facial manipulations, DeepAction enables evaluation of deepfake detectors on full-body actions and scenes - providing a new lens on detector generalisability.

We preprocessed the dataset using the following steps:

Split the videos into 60/20/20 train/validation/test sets based on video IDs to prevent overlap1
Sampled up to 32 frames per video using DeepfakeBench (Yan et al., 2023)
Applied a 224x224 pixel center crop to each frame

Note, this requires a modification to DeepfakeBench as the original implementation crops faces from the images, which is not suitable for action videos.

Deepfake Detector Candidates

We evaluated three open-source deepfake detectors, each representing different approaches to detection:

Xception (Rössler et al., 2019) serves as our baseline candidate. Built on the XceptionNet architecture (Chollet, 2017), it's widely adopted in deepfake detection research. We use the pre-trained model from Yan et al. (2024), fine-tuned on the FaceForensics++ (FF++) dataset.
UCF (Yan, Zhang, Fan, et al., 2023) represents recent advances in generalisable detection. Released in 2023, it employs contrastive regularization to learn both common and specific forgery features. Our evaluation uses the pre-trained model from Yan, Zhang, Yuan, et al. (2023), fine-tuned on FF++.
CLIP (Radford et al., 2021) demonstrates the potential of large-scale training. Though designed for general visual-linguistic tasks, its rich representations make it effective for deepfake detection. Recent benchmarks in Yan et al. (2024) show it outperforming other state-of-the-art detectors. On the DeepAction dataset, Bohacek & Farid (2024) achieved 85% frame accuracy and 97% video accuracy using just an SVM trained on CLIP embeddings. We use the ViT-B/16 variant from Yan et al. (2024), fine-tuned on FF++.

This selection spans traditional specialist models (Xception), modern generalization-focused approaches (UCF), and large-scale representation learning (CLIP). All three models were pre-trained on FF++ to enable fair comparison on the DeepAction dataset.

Adaptability of Deepfake Detectors to New Datasets

To evaluate how well these algorithms would be able to adapt from the facial recognition domain to action videos, we fine-tuned them on the DeepAction data before evaluating them on the test set. The frame-level results (N = 10564) are summarised in Table 1 below.

Table 1: Performance of deepfake detectors trained and tested on the DeepAction dataset at the frame level

Our evaluation uses multiple metrics:

Area Under Curve (AUC): The primary metric used in DeepfakeBench (Yan, Zhang, Yuan, et al., 2023)
Macro-average accuracy: Adopted from Bohacek & Farid (2024) to handle class imbalance
Sensitivity, specificity, and F1 score: To provide a complete performance picture

All models show strong performance, with the main differentiator being their trade-off between sensitivity and specificity.

Video-level2 (N = 399) results mirror the frame-level findings, confirming that these architectures can effectively adapt to non-facial content when properly fine-tuned. This suggests that the fundamental feature extraction capabilities of these models extend beyond their original training domain.

Table 2: Performance of deepfake detectors trained and tested on the DeepAction dataset at the video level

Generalisability of Pre-trained Detectors

While surveillance of emerging deepfake threats and continuous improvement of detector models by fine-tuning on the latest techniques is effective, it is not always possible to know the exact nature of such threats in advance. To evaluate the generalisability of pre-trained models, we take our three models pre-trained on the FF++ dataset and evaluate them on DeepAction without further fine-tuning. The results are summarised in Table 3 below.

Table 3: Performance of deepfake detectors trained on FF++ at the frame level

All three models perform significantly worse when evaluated on the DeepAction dataset without fine-tuning. The Xception model and CLIP seem to do a poorer job at identifying deepfakes, while the UCF model seems to prioritise sensitivity over specificity. This suggests the models trained on a different distribution of data are optimized for different aspects of performance, making direct comparison misleading.

Figure 1: ROC Curve of pre-trained models

In cases where we are able to obtain the prediction probabilities, we can plot the ROC curve to better understand the trade-off between sensitivity and specificity. The ROC curve for the pre-trained models is shown in Figure 1. This provides a more useful comparison of the models and would allow us to pick the model that best suits our needs - if we value a low specificity (false positive rate) we would choose the Xception model but if we value a high sensitivity (true positive rate) we would choose the CLIP model. This analysis suggests that while no single pre-trained model generalizes perfectly, we could combine models with complementary strengths to offer a more performant solution for detecting across deepfake types.

Why do Pre-trained Models Perform Poorly?

While in a typical evaluation of a black box model, one would be left to speculate on why the models may not be as generalisable as one would hope, the benefit of using open-source models is that we can dive one level deeper into the feature embedding space to understand what might be going on.

Figure 2: Visualising feature embeddings

The figure shows t-SNE visualisations of feature embeddings for each model before and after fine-tuning. The top row shows embeddings from models trained only on FF++, while the bottom row shows embeddings after fine-tuning on DeepAction. Blue points represent real videos, while other colors indicate different deepfake generation methods. Each model's embedding structure reveals distinct patterns in how it learns to distinguish real from fake content.

Xception - A Specialist Model

The Xception model reveals characteristics of a specialist architecture that excels with training data but struggles to generalize. Examining the t-SNE embeddings shows that before fine-tuning, the model fails to meaningfully separate real from fake videos - all data points appear randomly distributed. However, after fine-tuning on DeepAction, a clear separation emerges between real and fake content, suggesting the model can learn the task but requires explicit training examples.

Figure 3: Leave-one-out analysis by testing data subset

To further test the generalisability of the Xception model, we conducted a leave-one-out analysis where we trained the model on all but one of the generated DeepAction videos and evaluated it on all videos (including the held-out set). The matrix of results is shown in Figure 3.

Performance drops significantly when the model is evaluated on unseen data, especially for BDAnimateDiff, RunwayML and Veo. Bohacek & Farid (2024), noted the data generated from Veo was a pre-released model and that might explain why the model performs so poorly without any samples to train on. Interestingly, the performance drop was negligible for StableDiffusion.

UCF and Spectrums of Generalisability

The UCF model shows some structure in its TSNE embeddings when trained on FF++, though without a clear separation between real and fake data points. This structure likely emerges from UCF's contrastive regularization technique, which explicitly separates content-specific from forgery-specific features (Figure 4).

Figure 4: UCF architecture (Yan, Zhang, Fan, et al., 2023) - separating an input content into content and forgery-specific features

While the forgery features learned on FF++ might be generalised across other facial manipulation and forgery techniques, it might not translate as effectively to the DeepAction dataset where scenes with portrait-like faces are rare and backgrounds more complex.

Table 4: AUC scores of UCF model trained on FF++ and evaluated on different datasets. First three results taken from Yan, Zhang, Yuan, et al. (2023)

Table 4 shows the AUC scores of the UCF model trained on FF++ and evaluated on different datasets. The model performs well on the FF++ dataset but the AUC scores drop significantly as the test dataset varies increasingly from the training dataset. The drop in performance from Celeb faces to more diverse methods and perturbations in DFDC to a completely different domain in DeepAction shows a limitation even in a model that was designed to be more generalisable.

CLIP - Learning Good Features ≠ Good Classification

The TSNE visualization of CLIP's embeddings presents an intriguing puzzle: despite showing clear separation between real and fake data points, the model achieves only marginally better classification performance than Xception or UCF (AUC 0.71 vs 0.67).

This disconnect highlights a fundamental insight about deep learning models: the ability to learn discriminative features (visible in the TSNE plot) does not automatically translate to strong classification performance. The classification head must still learn to correctly weigh and combine these features for the specific task at hand.

Fortunately, for our pre-trained model, this is a simple remedy. By freezing CLIP's pre-trained parameters and training only a new classification layer on the DeepAction dataset, we achieved:

AUC score: 0.94 (up from 0.71)
Accuracy: 0.85 (up from 0.62)

This dramatic improvement mirrors Bohacek & Farid (2024) findings with their SVM approach (0.84 accuracy) and suggests that while CLIP learns robust, transferable features, adapting these features to new domains may require targeted fine-tuning of the classification layer.

Conclusion

Our evaluation of three open-source deepfake detectors reveals both the potential and limitations of generalisable synthetic media detection.

When fine-tuned on new domains, deepfake detectors show remarkable adaptability - even models originally trained on facial deepfakes achieve 0.95+ AUC scores on synthetic action videos. However, without fine-tuning, performance degrades significantly as test data diverges from the training distribution, even for newer architectures like UCF and CLIP that were designed for better generalization.

Our analysis also reveals a critical insight: strong feature learning doesn't guarantee strong classification. While models can learn rich, discriminative features (evident in their latent space), translating these features into accurate predictions often requires targeted fine-tuning of the classification layer.

We have evaluated the generalisability of three open-source deepfake detectors across different datasets and deepfake generation methods, and found that fine-tuning the models on the DeepAction dataset significantly improved their performance and deepfake detectors. Even those trained on facial images, can easily adapt to other types of images.

However, generalising pre-trained models still remains a challenge. While newer architectures like UCF and CLIP show promise in being more generalisable, it is important to validate their performance on unseen datasets that hopefully mirrors the distribution of the data they are likely to be deployed on. Understanding this distinction between feature learning and classification performance provides a path forward: future detectors may benefit more from focusing on robust feature extraction and making it available rather than trying to build one-size-fits-all classifiers.

References

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., & Rombach, R. (2023). Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. https://arxiv.org/abs/2311.15127

Bohacek, M., & Farid, H. (2024). Human Action CLIPS: Detecting AI-generated Human Motion. https://arxiv.org/abs/2412.00526

Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1251–1258. http://openaccess.thecvf.com/content_cvpr_2017/html/Chollet_Xception_Deep_Learning_CVPR_2017_paper.html

Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.-C., Somandepalli, K., Akbari, H., Alon, Y., Cheng, Y., Dillon, J., Gupta, A., Hahn, M., Hauth, A., Hendon, D., … Jiang, L. (2024). VideoPoet: A Large Language Model for Zero-Shot Video Generation. https://arxiv.org/abs/2312.14125

Li, Y., Yang, X., Sun, P., Qi, H., & Lyu, S. (2020). Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. https://arxiv.org/abs/1909.12962

Lin, S., & Yang, X. (2024). AnimateDiff-Lightning: Cross-Model Diffusion Distillation. https://arxiv.org/abs/2403.12706

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020

Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M. (2019). FaceForensics++: Learning to Detect Manipulated Facial Images. https://arxiv.org/abs/1901.08971

RunwayML. (n.d.). Introducing Gen-3 Alpha. https://runwayml.com/research/introducing-gen-3-alpha

Veo. (n.d.). https://deepmind.google/technologies/veo

Yan, Z., Yao, T., Chen, S., Zhao, Y., Fu, X., Zhu, J., Luo, D., Yuan, L., Wang, C., Ding, S., & others. (2024). DF40: Toward Next-Generation Deepfake Detection. arXiv Preprint arXiv:2406.13495.

Yan, Z., Zhang, Y., Fan, Y., & Wu, B. (2023). UCF: Uncovering Common Features for Generalizable Deepfake Detection. https://arxiv.org/abs/2304.13949

Yan, Z., Zhang, Y., Yuan, X., Lyu, S., & Wu, B. (2023). DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems (Vol. 36, pp. 4534–4565). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2023/file/0e735e4b4f07de483cbe250130992726-Paper-Datasets_and_Benchmarks.pdf

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., Yin, D., Gu, X., Zhang, Y., Wang, W., Cheng, Y., Liu, T., Xu, B., Dong, Y., & Tang, J. (2024). CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. https://arxiv.org/abs/2408.06072

Footnotes

While this is similar to Bohacek & Farid (2024) train/test split, it also means that the number of real videos (19) evaluated in the test set is relatively limited and standard errors may be high. This is a limitation of the dataset and should be taken into account when interpreting the results.
Since none of our models consider temporal features, each frame is scored and evaluated independently. Results at the video level are then decided by the class with the highest proportion of frames (i.e. majority voting).