Synaptiks
Posts
Paper review - Can Multi-modal (reasoning) LLMs work as deepfake dectors ?

Paper review - Can Multi-modal (reasoning) LLMs work as deepfake dectors ?

Geoffrey NICHIL & Alexandre HOTTON
April 03, 2025

Review of the paper: Can Multi-modal (reasoning) LLMs work as deepfake detectors?

a) Context and Problem to Solve

Understanding Deepfakes and Their Impact

Imagine you're watching a video of a famous person saying something shocking. Later, you find out that the video was fake, created using advanced computer techniques. This kind of fake media is called a "deepfake." Deepfakes are synthetic media where a person's likeness is replaced with someone else's, making it appear that they said or did something they never actually did. These can be in the form of videos, images, or audio recordings.

Why Are Deepfakes a Concern?

Deepfakes have become a significant concern because they can be used to spread false information, commit fraud, or damage someone's reputation. As the technology to create deepfakes becomes more accessible and sophisticated, it's harder to tell what's real and what's fake. This poses challenges in areas like news reporting, social media, and even personal relationships.

Traditional Methods of Detecting Deepfakes

To combat the spread of deepfakes, researchers have developed various methods to detect them. Traditionally, these methods rely on computer programs that analyze images or videos to find inconsistencies or signs of manipulation. These programs often use convolutional neural networks (CNNs), a type of artificial intelligence (AI) model designed to process visual information. While these methods have been somewhat effective, they struggle when faced with new, more advanced deepfakes that differ from the ones they were trained to detect.

Introducing Multi-modal Large Language Models (LLMs)

Recently, a new type of AI model called multi-modal large language models (LLMs) has emerged. These models can process and understand multiple types of data, such as text and images, simultaneously. Examples include OpenAI's GPT-4V and Google's Gemini. These models have shown impressive abilities in understanding and reasoning across different forms of information.

The Central Question

Given the advanced capabilities of multi-modal LLMs, researchers are curious: Can these models be used to detect deepfakes more effectively than traditional methods? This is the main question the paper aims to answer.

b) Methods Used in the Study

Evaluating Multi-modal LLMs

The researchers selected 12 state-of-the-art multi-modal LLMs to test their ability to detect deepfakes. These models were chosen based on their advanced capabilities and recent development. The models include:

OpenAI's GPT-4V
Google's Gemini Flash 2
Deepseek's Janus
Grok 3
Llama 3.2
Qwen 2/2.5 VL
Mistral's Pixtral
Claude 3.5/3.7 Sonnet

Datasets Used for Testing

To evaluate the models, the researchers used several datasets containing both real and deepfake images. These datasets included:

FaceForensics++: A collection of videos with both real and manipulated content.
DFDC (DeepFake Detection Challenge) Dataset: A large-scale dataset with diverse deepfake videos.
Celeb-DF: A dataset with deepfake videos of celebrities.
WildDeepfake: A dataset containing deepfakes collected from the internet, representing real-world scenarios.

Testing Procedure

The models were presented with images from these datasets and asked to determine whether each image was real or a deepfake. To enhance the models' performance, the researchers used a technique called "prompt tuning." This involves crafting specific instructions or questions to guide the model's responses more effectively.

Analyzing the Models' Reasoning

Beyond just checking if the models could correctly identify deepfakes, the researchers wanted to understand how the models arrived at their decisions. They conducted an in-depth analysis of the models' reasoning pathways to identify key factors that influenced their judgments.

c) Key Results of the Study

Performance of Multi-modal LLMs

The study found that some of the multi-modal LLMs performed competitively with traditional deepfake detection methods. Notably, the best-performing models demonstrated strong generalization abilities, meaning they could accurately detect deepfakes even in datasets they hadn't encountered before. In some cases, these models outperformed traditional detection pipelines, especially on out-of-distribution datasets (datasets that differ significantly from the training data).

Variability Among Models

However, not all multi-modal LLMs performed well. Some models performed worse than random guessing, highlighting significant variability in their effectiveness for deepfake detection.

Factors Influencing Performance

Interestingly, the study found that newer versions of models or those with enhanced reasoning capabilities did not necessarily perform better in deepfake detection tasks. However, model size appeared to play a role, with larger models sometimes exhibiting improved performance.

d) Conclusions and Main Implications

Potential of Multi-modal LLMs in Deepfake Detection

The findings suggest that certain multi-modal LLMs have the potential to be effective tools for detecting deepfakes, offering competitive performance compared to traditional methods. Their ability to process and reason across multiple data types allows them to adapt to various forms of deepfake manipulations.

Need for Further Research

The variability in performance among different models indicates that not all multi-modal LLMs are suitable for deepfake detection. Further research is needed to understand the specific characteristics that make some models more effective than others.

Implications for Future Detection Frameworks

Integrating multi-modal reasoning into deepfake detection frameworks could enhance their robustness and adaptability to real-world scenarios. This approach may lead to more reliable detection methods that can keep pace with the rapidly evolving techniques used to create deepfakes.

Reply

or to participate.