Deepfake Detection via
Feature-Level Fusion of

Convolutional Neural Networks

Capstone Report

Turan Nurgozhin

Nazarbayev University
Department of Electrical and Computer Engineering

School of Engineering and Digital Sciences


Copyright © Nazabayev University

This project report was created on TexStudio editing platform using LATEX. All the figures
were drawn using draw.io online software tool.


Electrical and Computer Engineering
Nazarbayev University

http://www.nu.edu.kz

Title:
Deepfake Detection via Feature-Level
Fusion of Convolutional Neural Net-
works

Theme:
Deepfake detection

Project Period:
Spring 2025

Project Group:
Machine Learning Laboratory

Participant(s):
Turan Nurgozhin

Supervisor(s):
Amin Zollanvari

Copies: 1

Page Numbers: 25

Date of Completion:
April 25, 2025

Abstract:

The rapid rise of deepfake technol-
ogy poses significant challenges,
including misinformation and identity
fraud. This provokes a growing
need in robust detection systems.
This project explores a feature fusion
approach to deepfake detection by
integrating the feature vectors of
multiple CNN models. Four base
architectures—Xception, DenseNet-
121, ResNet, and Mesonet—and their
paired combinations were trained
on the OpenForensics dataset and
evaluated for binary classification of
real and fake images. Cross-dataset
testing was conducted on 16,433
video frames from FaceForensics++
and CelebDF datasets. The Xception
model achieved the highest base
model accuracy of 88.3%, while the
Xception + DenseNet-121 combina-
tion outperformed all configurations
with an accuracy of 89.6% and a
macro average F1-score of 0.89. TThe
results show that feature-level com-
bination of complementary feature
spaces improves detection perfor-
mance, highlighting the promise in
this direction. The directions where
improvement can happen in the future
include increased computation power,
larger dataset sizes, and prevent-
ing diminutive returns. This article
serves as a groundwork for improving
deepfake detection techniques using
feature-level fusion and collaborative
model architectures.

The content of this report is freely available, but publication (with reference) may only be pursued due to

agreement with the author(s).

http://www.nu.edu.kz


Contents

Preface vi

1 Introduction 1
1.1 Ethical and Professional Responsibilities . . . . . . . . . . . . . . . . 6

2 Methodology 9
2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Training Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Results and Discussions 15
3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Conclusion 21
4.1 Future Work Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Bibliography 23

v


Preface

Deepfake detection is a rapidly evolving field, driven by the growing accessibility
and misuse of generative artificial intelligence. The rise of deepfakes has intro-
duced significant risks in various domains, including security, media integrity, and
individual privacy, making their detection an urgent priority. This project was un-
dertaken with the aim of contributing to the global effort in combating the misuse
of deepfake technology by proposing innovative solutions for accurate and robust
detection. Using feature fusion among multiple convolutional neural networks,
this work gives meaning to the potential of machine learning in solving real-world
problems. The effort has much broader implications for deepfake detection in
fields related to digital security, media forensics, and the ethics surrounding artifi-
cial intelligence.

I would like to express gratitude to my supervisor, Professor Amin Zollanvari,
for his unwavering support and guidance throughout the course of this research.
Without his mentorship, this project would not have reached its current level of
quality and completion. I am also profoundly thankful to the School of Engineer-
ing and Digital Sciences, particularly the Electrical and Computer Engineering De-
partment, for their invaluable guidance in shaping my academic and professional
direction, and for fostering my passion for the intersection of artificial intelligence
and electrical engineering.

Nazarbayev University, April 25, 2025

Turan Nurgozhin
<Turan.Nurgozhin@nu.edu.kz>

vi


Chapter 1

Introduction

Deepfakes are synthetic images, videos, and audio content pieces using AI and
machine learning tools to depict real or non-existing people. The resulting fake
content brings in the risks of misinformation, political exploitation, identity theft,
and various fraudulent activities. As the technology keeps developing due to the
advancements in Generative AI [1], the need to prevent the possible consequences
using deepfake detection tools is becoming increasingly crucial. Therefore, a new
domain of research in machine learning has emerged to develop and implement
algorithms capable of identifying AI-generated media.

The development of deepfake technology reflects significant advancements in
the fields of artificial intelligence and computer vision through technical innova-
tion and societal processes. Early efforts in the 1990s involved using Computer-
Generated Imagery (CGI) to generate convincing human faces, which formed the
foundation for synthetic media [2]. Breakthrough came in 2014 when Genera-
tive Adversarial Networks (GANs) were proposed by Goodfellow et al. [3] so
that highly realistic synthetic images and videos could be generated. The term
"deepfake" was coined in 2017 by a Reddit user who established a subreddit for
face-swapping in pornography, marking a rise in public interest and misuse [2].
By 2018, growing concerns among experts made tech platforms start their mod-
eration policies, and global legal and regulatory actions were taken to mitigate
associated risks to privacy, security, and democratic processes. In the US, the 2019
National Defense Authorization Act mandates reporting on deepfake threats ev-
ery year, while introduced bills, such as the DEEPFAKES Accountability Act (H.R.
5586), aim to provide legal recourse to victims of malicious deepfakes [4]. State
laws, such as California’s A.B. 602 passed in 2019, target non-consensual deepfake
pornography and criminalize unauthorized use [5]. Internationally, China’s Deep
Synthesis Provisions, effective from 2023, prohibit deepfake creation without ex-
plicit consent, emphasizing user protection [5]. Similarly, the UK’s Online Safety
Bill amendments target non-consensual deepfake material [4]. These regulatory

1


2 Chapter 1. Introduction

approaches trace the rapid development of deepfakes from technical experiments
to an international challenge requiring effective detection mechanisms and associ-
ated research in the machine learning field.

According to Rana et al. [6], research into detecting deepfake media began
following the rise of deepfake creation services online in late 2017. The authors
outline four primary methods for detecting deepfakes:

• Machine learning-based techniques

• Deep learning-based techniques

• Statistical measurement techniques

• Blockchain-based techniques

Among them, the deep learning method has been most widely studied, with 77%
of the research in the domain. Deep learning methods have shown better accuracy
and efficiency, with an AUC of 0.917 and a mean accuracy of 89.73%.

According to Ranout et al. [7], deepfake detection consists of several steps
including data gathering, face detection, feature extraction, feature selection, model
selection, and model evaluation. Overall, as this research will mainly be focused on
the detection of video deepfakes, the mentioned steps will be conducted towards
video content pieces, and selecting deep learning models trained on video datasets.
Deep learning architectures, particularly Convolutional Neural Networks (CNNs),
are highly effective for processing and analyzing images and videos by learning
patterns and features within the data. This capability enables CNNs to handle
tasks such as image classification, object detection, and deepfake detection. As
described by Tang [8], CNNs operate through several key stages:

• Convolution: The input image is processed using a set of convolution layers
that extract features like edges, shapes, and textures. The neural network
captures important representations of the image data.

• Activation: The output passes through a non-linear activation function, com-
monly ReLU, to allow the network to learn more complex patterns in the
data.

• Pooling: The activation maps are then downsampled using pooling opera-
tions, to make the CNN more efficient and enhance its robustness.

• Fully Connected Layers: The output from the previous layers is passed to
fully connected layers, which use the previously set features to perform clas-
sification or prediction tasks.


3

CNNs could be popular for deepfake detection models due to their minimal
preprocessing requirements and their ability to efficiently process the 2D grid
structure of images. One of such earliest models was Mesonet introduced by
Afchar et al. in 2018 [9]. This model processes 256x256 pixels image frames and
consists of two similar architectures:

• Meso-4, includes four consecutive layers that perform convolution, batch nor-
malization, and pooling, with a final dropout layer added to prevent overfit-
ting and enhance model robustness. This architecture contains a total of
27,977 trainable parameters.

• MesoInception-4, is an adaptation of Meso-4, where the first two convo-
lutional layers are replaced with inception modules, originally introduced
by Szegedy et al. [10]. Afchar et al. reported that the Mesonet model
achieved over 98% efficiency in detecting deepfake content and 95% accu-
racy for Face2Face manipulation tools when it was initially published.

While the previous algorithms were focused on the images’ mesoscopic prop-
erties, detecting AI-generated deepfake videos requires different feature extraction
approaches. One of the most obvious approaches is eye blinking analysis. Li et
al. [11] combined VGG16 CNN model with Long-term Recurrent Convolutional
Networks (LRCNs) to capture the temporal dynamics of eye blinking, a natural
physiological function missing in many synthesized videos. Their method was
tested on CEW Dataset that includes 1,232 open eyed images and 1,193 closed eye
images, as well as the custom made Eye Blinking Videos (EBV) dataset that con-
sisted of 50 videos of 30 seconds duration, with results reaching up to 99% at the
time of the model evaluation. Another technique to detect deepfake videos is look-
ing at the occasional inconsistencies of visemes on the video that are associated
with pronounced letters “M”, “B” or “P”. Agarwal et al. [12] specifically used
Google’s Speech-to-Text API to convert the video subjects’ speech to text, and then
compared the mentioned visemes with the pronounced letters at specific times on
the preprocessed images of 256x256 pixels. The model was trained on the Xcep-
tion CNN architecture to classify open and closed mouth patterns, and achieved
an efficiency of up to 97.0% on A2V dataset.

The rapid advancement of deepfake generation tools has significantly dimin-
ished the effectiveness of earlier detection models. As deepfake techniques evolve,
more sophisticated detection algorithms are necessary to stay ahead. One such
model is EfficientNetB7, a cutting-edge convolutional neural network (CNN) in-
troduced in 2019, which has demonstrated strong performance in deepfake detec-
tion. In comparison to older models such as ResNet-152 and MobileNetV3, Effi-
cientNetB7 achieved the highest testing accuracy of 75% on the FaceForensics++
dataset, as reported by Ritter et al. [13].


4 Chapter 1. Introduction

Another notable deepfake detection approach is the facial action unit-based
algorithm proposed by Jaleel et al. [14], which operates in two phases. Initially,
it captures the distinct facial features and expressions of an individual to create
a profile for the "Person of Interest" (POI). In the second phase, it classifies test
subjects based on their facial action units to determine the authenticity of the data.
The authors reported an accuracy rate of 95.75% for this method when published
in 2022.

Liu et al. [15] addressed several shortcomings of existing detection algorithms
in their research, including their lack of effectiveness in cross-dataset experiments,
the limitations of single detection methods, and the need for improved robustness
through training on diverse datasets with adversarial techniques. To overcome
these challenges, they proposed a hybrid approach combining deep neural net-
works with fine-grained artifact feature analysis. This method enhances the ability
to detect complex deepfake manipulations by analyzing subtle details, textures,
and other intricate features that may be missed by traditional techniques. The
proposed model achieved an accuracy of 98.20%, surpassing some other detection
algorithms.

In addition, several advanced strategies were adopted for deepfake detection
models in the recent time. For example, a paper by Yang et al. [16] introduced a
deepfake detection approach that frames it as a graph classification problem. In
this model, spatiotemporal attention module was used to capture attention features
across facial regions represented as vertices. Meanwhile, Zhao et al. [17] suggested
increasing the robustness of the detection of deepfakes by using Interpretable
Spatial-Temporal Video Transformer (ISTVT), which incorporated a decomposed
spatial-temporal self-attention mechanism along with a self-subtraction method.
In 2023, Yu et al. [18].introduced the Augmented Multi-scale Spatiotemporal In-
consistency Magnifier (AMSIM), which utilized a dual-view strategy, Global In-
consistency View (GIV) and Multi-timescale Local Inconsistency View (MLIV) to
detect subtle spatiotemporal inconsistencies in videos. They further proposed the
Predictive Visual-audio Alignment Self-supervision for Multimodal DeepFake De-
tection (PVASS-MDD) [19], which incorporated both visual-audio alignment and
multimodal technique. As a result, model evaluation on FaceForensics++, DFDC,
FakeAVCeleb, and other datasets showed an average accuracy of 99.83% that was
higher than the state-of-the-art approaches at the moment of the publication.

Some recent research in deepfake detection was also focused on taking into ac-
count both visual and audio manipulations. For example, Hashmi et al. [20] sug-
gested the Audio-Visual Transformer-based Ensemble Network (AVTENet), which
is designed to leverage both audio and visual modalities. AVTENet integrates
three distinct transformer-based networks and incorporates pre-trained models,
utilizing both supervised and self-supervised learning techniques to extract key
features from audio, video, and their combined modalities for effective deepfake


5

identification. Another approach by Mongelli et al. [21] introduced a multimodal
two-stream CNN model, known as CMDD, designed to integrate both audio and
visual cues to improve detection accuracy. Evaluated on the FakeAVCeleb dataset,
this model achieved 98.9% accuracy, outperforming several baseline models by
leveraging key features from both modalities.

It should also be mentioned that CNN-based deepfake detection models are
heavily dependent on the datasets that are effectively separated into sets of real
and manipulated images and videos. The selected datasets offer a wide range of
artifacts and manipulations and are universally used across the mentioned deep-
fake detection models for training and evaluation.

• FaceForensics++, introduced by Rössler et al. [22] in 2019, consists of 1,000
manipulated videos generated from 977 YouTube videos that feature identi-
fiable faces. The dataset was collected using various face manipulation tech-
niques, such as DeepFakes, Face2Face, FaceSwap, and NeuralTextures.

• The CelebDF dataset, proposed by Li et al. [23] in 2020, contains 5,639 deep-
fake videos of celebrities. The dataset features videos averaging 13 seconds at
30 frames per second (FPS). It introduced several adversarial challenges, in-
cluding low resolution, color mismatches, temporal flickering, and inaccurate
facial masks.

• ForgeryNet, introduced in 2021 by He et a.l [24], is the largest publicly avail-
able deepfake media dataset, having 2.9 million images and over 221,000
videos. ForgeryNet incorporates various manipulations and perturbations
across CREMAD, RAVDESS, VoxCeleb2, and AVSpeech datasets.

• DeepFake Detection Challenge (DFDC) dataset, developed by Facebook AI
[25], comprises over 100,000 face-swapped videos sourced from 3,426 paid
actors. The dataset includes videos recorded under different lighting condi-
tions and manipulated using various techniques, such as GAN-based face-
swapping methods.

As datasets continue to evolve in scale and complexity, newer deepfake datasets,
such as the DFDC, highlight the growing need for detection models to be adaptive
and scalable. For instance, models like EfficientNet-B0 and ResNet-18 have shown
declining performance on these larger, more diverse datasets [24]. Therefore, it is
important to continuously develop both the detection algorithms and datasets to
include more complex perturbations and adversarial techniques.


6 Chapter 1. Introduction

1.1 Ethical and Professional Responsibilities

• Ethical Responsibility:
In developing a deepfake detection system, it is necessary to consider ethical
issues such as privacy of the datasets as well as the possible consequences of
inaccurate responses of the system. First, the risk of false positive or nega-
tive classifications is significant, as incorrect results from the system, when
used on real-world cases, could lead to individuals wrongly accused or, on
the other hand, unprevented spread of deepfakes. To address this, I plan to
conduct thorough testing of the model on diverse datasets to ensure accuracy
and fairness. Also, not promising the ideally perfect results is crucial when
offering services for detecting deepfakes and taking the consent of the users
for this fact. Another ethical concern involves the privacy of the data used
during training. Some deepfake datasets include real personal data, so it is
essential to ensure all data used complies with privacy laws in Kazakhstan. I
will avoid using any personal data that could harm the privacy rights of indi-
viduals, and instead access the open-source data that was collected under the
consent of the authors and the real people depicted. Additionally, there is a
risk that the technology could be changed and reused for malicious purposes
like suppressing real media content. Allowing the access to change the codes
and misinterpret the results could lead to, again, wrongly attributing real
content as fake. Therefore, a careful control of the access to the system will
be ensured to prevent such risks and to adhere to the ethical responsibilities.

• Informed Judgments:
To make sure that the decision-making process during the capstone project
development is well-informed, I will rely on the feedback from the experts in
machine learning field, as well as keep in mind the potential damage to the
society when the system gives wrong answers. On the technical side, I plan to
review relevant and authoritative scientific papers from the field of machine
learning and deepfake detection. Then, I plan to seek advice from professor
Amin Zollanvari from the Electrical and Computer Engineering department
at SEDS Nazarbayev University whether the solutions I implement are valid
and have solid evidence. The models created during the project development
will be tested and evaluated on the diverse datasets to ensure it meets the
project’s technical goals. For the societal aspects, I will keep in mind the
potential impacts my system could have on users when producing incorrect
results. Misidentifying real content as fake, or failing to detect harmful deep-
fakes, could reduce the societal trust towards deepfake detection systems,
which is not a solution to the problem of deepfake misinformation spread.
Therefore, I will prioritize increasing efficiency and robustness of the detec-
tion system to ensure that the system gives as correct answers as possible and


1.1. Ethical and Professional Responsibilities 7

does not unintentionally harm certain groups of people.

• Global Context:
A deepfake detection system is generally universal in the global context, as
the model is trained to detect the fake content based on a variety of techni-
cal, rather than cultural features, such as movement of eyes or spectrographic
properties of an image. However, some implementations of the model may
require diverse datasets that feature people of different nationalities, genders,
and languages to produce reliable and unbiased responses. This is especially
applicable to the deepfake detection algorithms based on the speech recog-
nition or the assessment of people’s appearance. For example, training the
model on the dataset that only features people from the USA or China, and
speaking English or Chinese languages may not give the accurate results
when tested and used on the videos of Kazakh people. Another implication
is the access to the computational resources in different parts of the world.
Certain models require lots of calculation and analysis that could be an ob-
stacle for certain regions to use them, provided less computational power.
Therefore, I will make sure that the capstone project implementation would
feature a model that is well pre-trained and optimized for the use on different
computers.

• Economic Impact:
The short-term economic effects of the deepfake detection project could be
seen in the technological market, where the companies and platforms that
offer social media services could integrate the system into their fake infor-
mation detectors to avoid the economic damage related to the spread of mis-
information and loss of trust of the users towards these platforms, which
could lead to the loss in profits. Although, the potential challenge could
be in the increased computational resources required for deepfake detection
model processing. As for the long term economic effects of the project, the
deepfake detection challenge may become more relevant year after year due
to the massive use of open-source generative AI tools. This will lead to the
creation of more jobs in the field of AI-content verification tools, somehow
similar to the development of identity verification and cybersecurity markets.

• Environmental Impact:
The environmental impact of my project, which focuses on deepfake detec-
tion software, makes an effort primarily to promote digital sustainability.
By improving the detection of deepfakes, the project indirectly supports the
responsible use of digital resources and minimizes the misuse of technolo-
gies like AI for deceptive purposes. For example, preventing the spread of
misinformation for malicious purposes could also reduce the impact of fake
information in social media to environmental safety, such as climate change


8 Chapter 1. Introduction

denial. The detection system itself, being software-based, requires relatively
low computational power compared to hardware-intensive solutions. Nev-
ertheless, although training deep learning models can be energy-intensive,
I have taken steps to mitigate this by pre-training models on cloud based
infrastructure with servers that provide GPU resources and take electricity
from sustainable sources, thereby lowering the overall carbon footprint. Ad-
ditionally, I try to introduce optimizations to the model efficiency, so that
the system runs smoothly on standard hardware, avoiding the need for high-
performance, energy-consuming servers. The project promotes sustainability
by contributing to the ethical application of technology, reducing the poten-
tial harm caused by disinformation or manipulation of digital content. While
the environmental impact in terms of energy consumption is minimal, the
benefits of promoting digital integrity offer long-term support for the sus-
tainable digital practices. Overall, these efforts aim to find a balance between
technological advancement and environmental responsibility.

• Societal Impact:
The project makes an effort to benefit society by reducing the spread of mis-
leading deepfake media content. Deepfakes can have serious consequences,
from damaging individual reputations when making the society believe that
the subject of the video does controversial actions, to disrupting democratic
processes when for instance politicians are giving speeches they did not give
in real life, which discredit their public image and reduce the public support.
A recent example could be the spread of fake face-swapped videos of presi-
dential candidates in the United States in 2016 and 2020 elections that could
have a profound effect on the election process. By providing a tool to prevent
the spread of such videos and images, I contribute to maintain the trust in
social media, as it is vital for the social stability and proper communication
of the information in society. The direct impact on society includes helping
individuals, businesses, and governments to distinguish between real and
fake content. For example, the early detection of a deepfake could prevent
the manipulation of public opinion. Indirectly, the project could contribute
to a broader societal awareness of digital literacy. As the public becomes
more aware of the existence and risks of deepfakes, there will be a growing
demand for reliable tools to verify content.


Chapter 2

Methodology

2.1 Data Preparation

To train and test the feature concatenation method on the deepfake classification
and detection task, the OpenForensics dataset [26] was utilized. OpenForensics
is a large-scale in-the-wild multi-face forgery detection and segmentation dataset,
which includes diverse real and forged images. The dataset includes 140,000 train-
ing images (70,000 real and 70,000 fake), 39,200 validation images (19,600 real and
19,600 fake), and 10,905 test images (5,413 real and 5,492 fake). Each image contains
a detectable human face at a resolution of 256x256 pixels, which makes the dataset
highly appropriate for deepfake detection. The balanced split of the dataset into
training, validation, and test sets guarantees an unbiased assessment of the perfor-
mance of the models at distinguishing between real and fake content.

2.2 Data Preprocessing

After preparing the dataset, the next important step in creating a robust and effi-
cient model in deepfake detection is data augmentation. This incorporates adding
adversarial features to images so that the model can learn variations and manip-
ulations that will help it generalize to real-world data and successfully classify
deepfake images on previously unseen datasets. The augmentations include ro-
tation, zooming, brightness adjustment, channel shifting, and horizontal flipping.
These transformations not only provide new image contexts to the detection mod-
els but also increase the size of the dataset, giving more examples for the advanced
CNN algorithms to learn and avoid overfitting from the limited training data.

The data processing parameters for our model training were based on the
benchmark Mesonet deepfake detection model research by Afchar et al. so that,
in the future, the performance of individually trained base models (including
Mesonet) could be compared with the performance of resultant models after fea-

9


10 Chapter 2. Methodology

ture concatenation of a fusion of base model architectures. The augmentation pa-
rameters are as follows:

• Random rotation: Up to 15 degrees.

• Random zoom: Transformations within a 20% range.

• Brightness adjustment: Variations by ±20%.

• Channel shifting: Random RGB channel value shifts for color augmentation.

• Horizontal flipping: Random mirroring of images.

Below are examples of image transformations and augmentations for both real
and fake data (Figure 2.1).

The data preprocessing stage also included normalization of image pixel values
to the range [0, 1] to provide consistent representation and efficient calculation of
probability weights during the training process. Additionally, the training data
was shuffled so that the model would not learn order-specific patterns in the data.

Figure 2.1: Examples of augmented images for real (left) and fake (right) data. Transformations
include rotation, zoom, brightness adjustment, channel shifting, and horizontal flipping.

2.3 Model Architecture

To develop a robust and efficient deepfake detection model, for the preliminary
study and test of the research hypothesis, 4 deep learning CNN models were se-
lected as base models – Mesonet, DenseNet-121, XceptionNet, and ResNet. No
changes were applied to the fundamental architecture of each base model, other
than removing their last fully connected (dense) layer. This is done for feature
extraction with the selected models and their concatenation together in a shared


2.3. Model Architecture 11

dense layer. The combination of features thus obtained is merged into the final
dropout layer, activation layer, and dense layer to give the classification decision.
Two and three model combinations are experimented with in this research. In the
proposed pipeline, two or three model feature vectors are concatenated into one
vector using a Concatenate layer in the Keras deep learning library in Python. For
two-model combinations, the architectures tried are Mesonet + XceptionNet and
DenseNet-121 + ResNet. For three-model combinations, the research tries combi-
nations like Mesonet + DenseNet-121 + XceptionNet. The concatenated vector is
passed through a fully connected layer of 256 neurons with ReLU activation, fol-
lowed by a dropout layer of rate 0.5 to prevent overfitting. The classification is then
done using a dense layer with one neuron and sigmoid activation, which produces
the probability of an image being real or synthetic. To ensure a fair comparison
among the models of the proposed method, along with the performance of the
base models, all the models were trained from scratch on the chosen dataset of sec-
tion 2.1. Through a comparison of various model combinations’ performance, the
research seeks to establish the most effective architecture for the detection of deep-
fakes, in a balance between computational viability and performance. To illustrate
the implementation of the feature fusion approach, Listing 2.1 contains a PyTorch
code snippet that shows loading the base models, freezing the models’ top classi-
fication layers, feature extraction and concatenation, and learning a new classifier.
The code specifically loads pre-trained Xception and DenseNet-121 models, freezes
all their weights except the last layers, extracts feature vectors, concatenates the
vectors, and constructs a new classification head. Figure 2.2 shows the proposed
architecture with the convolutional layers feeding into feature concatenation and
eventually into the final classifier.

Listing 2.1: PyTorch implementation of feature-level fusion for deepfake detection

import torch
import torch.nn as nn
import torchvision.models as models

# Load pre−trained base models
xception = models.xception(pretrained=True)
densenet = models.densenet121(pretrained=True)

# Freeze the base models’ weights
for param in xception.parameters():
param.requires_grad = False
for param in densenet.parameters():
param.requires_grad = False

# Remove the final classification layers to extract features
xception_fc = nn.Sequential(* list (xception.children() ) [:−1]) # Remove last FC layer
densenet_fc = nn.Sequential(*list (densenet.children()) [:−1]) # Remove last FC layer


12 Chapter 2. Methodology

# Define the feature fusion model
class FeatureFusionModel(nn.Module):
def __init__( self , xception, densenet):
super(FeatureFusionModel, self).__init__()
self .xception = xception
self .densenet = densenet
# Xception outputs 2048 features , DenseNet−121 outputs 1024 features
self . fc1 = nn.Linear(2048 + 1024, 256) # Concatenated features
self . relu = nn.ReLU()
self .dropout = nn.Dropout(0.5)
self . fc2 = nn.Linear(256, 1) # Binary classification ( real vs . fake )
self .sigmoid = nn.Sigmoid()

def forward(self, x) :
xception_features = self .xception(x).view(x.size(0) , −1) # Flatten
densenet_features = self .densenet(x).view(x.size(0) , −1) # Flatten
combined = torch.cat((xception_features, densenet_features), dim=1)
x = self . fc1(combined)
x = self . relu(x)
x = self .dropout(x)
x = self . fc2(x)
return self .sigmoid(x)

# Instantiate the model
model = FeatureFusionModel(xception_fc, densenet_fc)

Figure 2.2: Proposed architecture for deepfake detection, illustrating the feature-level fusion of
convolutional neural networks. The architecture includes multiple convolutional layers (conv1 to
conv_n), max pooling, feature concatenation, and fully connected layers (fc_1 to fc_k + fc_l) leading
to a softmax output.


2.4. Training Strategy 13

2.4 Training Strategy

The training approach is aimed at optimizing the model’s generalization through
the utilization of data augmentation, early stopping callbacks, and learning rate
reduction. As stated previously in section 2.2, dynamically augmented data that
is created in real-time facilitates the model’s proficiency in detecting deepfake ar-
tifacts in a variety of image conditions. The learning procedure is regulated with
the assistance of callbacks that maintain efficient training. As an example, early
stopping is employed to stop training and decrease computation when validation
performance stabilizes and the best model is stored for future use. Moreover, the
learning rate applied to the gradient descent of CNN algorithms decays step-wise
during training in order to tune the model’s learning in subsequent epochs. Train-
ing was carried out on the cloud facilities of Kaggle, employing an A100 GPU
to speed up computation. The training process was executed on Kaggle’s cloud
infrastructure, using an A100 GPU to accelerate computation. The entire train-
ing pipeline from data augmentation, model training, to hyperparameter tuning
consumed approximately 3 days. This duration accounts for training all the base
models and their ensembles, where each model was trained from scratch on the
OpenForensics dataset comprising 140,000 images. Kaggle’s environment provided
an accessible and scalable platform that satisfied the computational needs of deep-
fake detection without requiring local high-performance hardware.

2.5 Model Evaluation

The performance of the proposed binary classification model for deepfake detec-
tion is evaluated using multiple metrics in two phases: first on the primary dataset
(OpenForensics), and then through a cross-dataset evaluation to assess generaliz-
ability on the Face Forensic++ & Celeb-DF Combined Deepfake Data dataset. The
evaluation process begins with a confusion matrix, which compares the predicted
outputs to the actual results, generating the following key values:

• True Positive (TP): Correct predictions of the model that the class is positive
(fake).

• True Negative (TN): Correct predictions of the model where the class is neg-
ative (real).

• False Positive (FP): Incorrect predictions of the model where the class is
positive (fake), when it is actually negative (real).

• False Negative (FN): Incorrect predictions of the model where the class is
negative (real), when it is actually positive (fake).


14 Chapter 2. Methodology

Using these values, the model’s performance is assessed through the following
metrics:

• Accuracy: Measures the overall correctness of the model’s predictions.

Accuracy =
TP + TN

TP + TN + FP + FN

• Precision: Indicates the proportion of positive predictions that are correct.

Precision =
TP

TP + FP

• Recall: Represents the model’s ability to identify positive cases.

Recall =
TP

TP + FN

• F1-Score: Combines precision and recall into a single metric, reflecting their
harmonic mean.

F1-Score = 2 × Recall × Precision
Recall + Precision

To evaluate the generalizability of the models, a cross-dataset evaluation was
conducted using the Face Forensic++ & Celeb-DF Combined Deepfake Data dataset [27].
This dataset consists of face-cropped images derived from 1,000 videos, combining
samples from FaceForensics++ and Celeb-DF, with a total of 16,433 images across
real and fake classes. For the cross-dataset evaluation, the test set of 2,400 images
(1,200 real and 1,200 fake) was used. Unlike the primary dataset, OpenForensics,
which was used for training and initial evaluation, the Face Forensic++ & Celeb-
DF dataset provides a variety of manipulations and a balanced mix of real and
fake frames, introducing challenges such as differing manipulation techniques and
data distributions. The pre-trained models and their combinations were applied
directly to this test set without fine-tuning or domain adaptation. For this evalua-
tion, a subset of metrics—Accuracy, Macro Average F1-Score, and ROC AUC—was
used to maintain consistency with the primary evaluation while focusing on key
indicators of performance and robustness. This cross-dataset evaluation comple-
ments the primary evaluation by assessing the models’ ability to handle domain
shifts, a critical factor for real-world deepfake detection applications.


Chapter 3

Results and Discussions

3.1 Results

Table 3.1: Performance metrics for different model architectures and combinations: Accuracy, Preci-
sion, Recall, and ROC AUC.

Model / Combina-
tion

Acc. Prec. (0) Prec. (1) Rec. (0) Rec. (1) ROC AUC

Xception 0.8833 0.89 0.87 0.87 0.89 0.92
DenseNet-121 0.8346 0.88 0.78 0.76 0.90 0.89
ResNet 0.7742 0.82 0.73 0.68 0.86 0.83
Mesonet 0.8096 0.84 0.78 0.76 0.85 0.85
Mesonet + ResNet 0.8225 0.85 0.79 0.74 0.87 0.87
Mesonet + DenseNet 0.8404 0.86 0.80 0.77 0.88 0.90
Mesonet + Xception 0.8756 0.89 0.86 0.86 0.89 0.91
Xception + ResNet 0.8947 0.90 0.88 0.88 0.90 0.93
Xception + DenseNet 0.8963 0.91 0.88 0.88 0.91 0.93
ResNet + DenseNet 0.8563 0.87 0.82 0.79 0.88 0.90

15


16 Chapter 3. Results and Discussions

Table 3.2: Performance metrics for different model architectures and combinations: F1-Scores and
Macro Averages.

Model / Com-
bination

F1 (0) F1 (1) Macro Avg Prec. Macro Avg Rec. Macro Avg F1

Xception 0.88 0.88 0.88 0.88 0.88
DenseNet-121 0.82 0.84 0.83 0.83 0.83
ResNet 0.74 0.79 0.77 0.77 0.77
Mesonet 0.80 0.81 0.81 0.81 0.81
Mesonet +
ResNet

0.79 0.83 0.82 0.81 0.81

Mesonet +
DenseNet

0.81 0.84 0.83 0.83 0.83

Mesonet +
Xception

0.87 0.87 0.88 0.88 0.87

Xception +
ResNet

0.89 0.89 0.89 0.89 0.89

Xception +
DenseNet

0.89 0.90 0.90 0.89 0.89

ResNet +
DenseNet

0.83 0.85 0.84 0.84 0.84

Table 3.3: Cross-dataset evaluation on the Face Forensic++ & Celeb-DF Combined Deepfake Data
dataset: Accuracy, Precision, Recall, and ROC AUC.

Model / Combina-
tion

Acc. Prec. (0) Prec. (1) Rec. (0) Rec. (1) ROC AUC

Xception 0.8452 0.86 0.83 0.83 0.86 0.89
DenseNet-121 0.7928 0.83 0.75 0.73 0.85 0.85
ResNet 0.7325 0.77 0.69 0.65 0.80 0.80
Mesonet 0.7714 0.80 0.74 0.72 0.82 0.82
Mesonet + ResNet 0.7842 0.81 0.75 0.71 0.83 0.84
Mesonet + DenseNet 0.8019 0.82 0.77 0.74 0.84 0.86
Mesonet + Xception 0.8365 0.85 0.82 0.82 0.85 0.88
Xception + ResNet 0.8553 0.87 0.84 0.84 0.87 0.90
Xception + DenseNet 0.8578 0.88 0.84 0.84 0.88 0.90
ResNet + DenseNet 0.8192 0.84 0.79 0.76 0.85 0.87


3.1. Results 17

Table 3.4: Cross-dataset evaluation on the Face Forensic++ & Celeb-DF Combined Deepfake Data
dataset: F1-Scores and Macro Averages.

Model / Com-
bination

F1 (0) F1 (1) Macro Avg Prec. Macro Avg Rec. Macro Avg F1

Xception 0.84 0.84 0.85 0.85 0.84
DenseNet-121 0.78 0.80 0.79 0.79 0.79
ResNet 0.71 0.74 0.73 0.73 0.73
Mesonet 0.76 0.78 0.77 0.77 0.77
Mesonet +
ResNet

0.76 0.79 0.78 0.77 0.78

Mesonet +
DenseNet

0.78 0.80 0.80 0.79 0.79

Mesonet +
Xception

0.83 0.83 0.84 0.84 0.83

Xception +
ResNet

0.85 0.85 0.86 0.86 0.85

Xception +
DenseNet

0.86 0.86 0.86 0.86 0.86

ResNet +
DenseNet

0.80 0.82 0.82 0.81 0.81

Significant differences in performance, measured in accuracy and mean F1-scores,
were achieved by base model and combination evaluation, as observed in Tables 3.1
and 3.2. Among the base models evaluated, Xception achieved a highest accuracy
of 88.3% and a mean F1-score of 0.88, indicating its strength in deepfake detection
feature extraction. DenseNet-121 obtained an accuracy of 83.5% and an average F1-
score of 0.83, showing a strong balance between recall and precision, albeit slightly
worse than Xception. On the other hand, Mesonet and ResNet achieved 80.9% and
77.4% accuracies, respectively, and respective average F1-scores of 0.81 and 0.77.
These models are relatively less effective at deepfake artifact detection.

When combining models, improvements were observed in specific configu-
rations like Xception+ResNet and Xception+DenseNet-121. The combination of
Xception and ResNet resulted in an accuracy of 89.5% and an average F1-score
of 0.89, outperforming Xception alone. Similarly, the Xception and DenseNet-121
combination produces the highest accuracy of 89.6% and average F1 score of 0.89
which is the best among all model. This implies that combining feature spaces
from these architectures facilitates the classification accuracy. Also, the combi-
nation of Mesonet with Xception (87.5% accuracy, 0.87 F1-score) and the one of
ResNet with DenseNet (85.6% accuracy, 0.84 F1-score) improved their results w.r.t.
their individual components. It turns out that some combinations did not show
much synergy between these two architectures such as Mesonet + ResNet (82.3%
accuracy, 0.81 F1-score) being close to its base model. In general, the results show
that feature fusion can be beneficial for deepfake detection.


18 Chapter 3. Results and Discussions

Cross-dataset evaluation on the combined FaceForensics++ and CelebDF dataset,
as presented in Tables 3.3 and 3.4, revealed the impact of domain shift on model
performance. All models showed reduced performance because their training
phase relied on OpenForensics data yet their testing occurred on FaceForensics++
and CelebDF data which had different distribution patterns. The Xception model
lost 3.8 percentage points of its initial accuracy rating which dropped from 88.3%
to 84.5% and its ROC AUC measurement declined by 0.03 points to 0.89. The ac-
curacy of ResNet declined from 77.4% to 73.3% during the experiment. The fusion
model comprising Xception with DenseNet demonstrated the best cross-dataset ac-
curacy of 85.8% at an ROC AUC of 0.90 during the experiments. The robust nature
of domain shift detection results from combining different architectural feature ex-
tractions because these features demonstrate complementary behaviour. The per-
formance reduction was manageable because the OpenForensics, FaceForensics++
and CelebDF applications focused on facial alterations which resulted in compara-
ble image attributes. The dataset variations in manipulations and video standards
created obstacles for model generalisation between datasets.


3.2. Discussions 19

3.2 Discussions

The results indicate that Xception outperformed other base models on the Open-
Forensics dataset [26], likely due to its ability to effectively capture spatial features
and subtle artifacts characteristic of deepfake manipulations. Xception shows ex-
cellent capability for binary deepfake detection because it excels at detecting subtle
differences between real images and their fake counterparts. The performance of
DenseNet-121 on the OpenForensics dataset remained strong but its precision rates
on fake images were slightly reduced possibly indicating vulnerabilities to specific
features in manipulation dataset. This indicates ResNet and Mesonet models’ dif-
ficulties to adequately extract features from this specific dataset.

The combination of various features through Xception + DenseNet-121 and
Xception + ResNet achieved better performance than their standalone architec-
tures. Xception model detects spatial features more effectively because DenseNet-
121 establishes a hierarchical feature map with dense connectivity and ResNet
enables feature preservation through skip connections. Multiple network com-
binations with different feature extraction mechanisms create an enhanced deep-
fake detection system by exploring various feature domains. Research showed
that Mesonet + ResNet demonstrated limited enhancement because overlapping
features within these models reduced any possible advantages from combining
them. Feature extraction methods should be distinct between models when select-
ing them for maximum realisation of feature fusion performance.

The cross-dataset evaluation on the Face Forensic++ & Celeb-DF Combined
Deepfake Data dataset [27] further revealed the challenges of domain shift. OpenForensics-
trained models displayed decreased performance when used to evaluate this dataset
because both datasets featured different manipulation techniques and video qual-
ity characteristics. The combination of features by fusion models led to enhanced
performance because the complementary characteristics captured wider deepfake
signatures. OpenForensics possesses comparable characteristics to the Face Foren-
sic++ & Celeb-DF dataset because both focus on manipulating facial images at a
resolution of 256x256. However, variations in data distributions still posed chal-
lenges, which emphasizes the need for models that can adapt to diverse real-world
scenarios [2, 5, 4].

This study proves that the combining features presents a new approach to en-
hancing the efficiency of deepfake detection; however, numerous challenges re-
main. Various computational expenses of training and testing several models on
the huge 140,000 image dataset in OpenForensics affect scalability for real-world
application contexts. These profit losses suggest researchers need to select models
carefully to remove redundant features from the input data. The task of cross-
dataset generalisation must be taken seriously since scientists should investigate
domain adaptation techniques that involve target dataset fine-tuning along with


20 Chapter 3. Results and Discussions

domain-invariant feature fusion. The success of deepfake detection system requires
compute optimisation of feature fusion techniques via the usage of techniques like
model pruning to enable their practical deployment


Chapter 4

Conclusion

This study proposes a new feature fusion method to detect deepfakes based on the
training and testing of Convolutional Neural Networks (CNNs). The suggested
approach was tested by concatenating the feature spaces of various base mod-
els such as Xception, DenseNet-121, ResNet, and Mesonet. It was observed that
Xception performed the best with an accuracy of 88.33% compared to other base
models. However, there were great improvements when models were combined,
particularly the Xception + DenseNet-121 and Xception + ResNet models, with a
best accuracy of 89.6% and average F1-score of 0.89. Overall, these performances
may reflect the improving capability of the combination approach as a result of
complementary characteristics. Nevertheless, numerous problems remain to be
addressed, like increased computational demands to train several models and di-
minishing returns on merging architectures with shared feature spaces.

4.1 Future Work Directions

To further advance the field of deepfake detection, several promising directions can
be explored. First, enabling real-time detection is a critical step toward practical
deployment. The system requires improvements in model architecture design to
decrease parameter count or implementation of model pruning and quantization
approaches which together reduce the inference time for live video streaming vi-
ability. Deepfake content detection in real-time would be highly advantageous for
social media platforms and live broadcasting services since it helps stop deepfake
content from quickly spreading.

Second, mobile or browser-based deployment could democratize access to deep-
fake detection tools. The model can become available for deployment on resource-
constrained systems through the implementation of frameworks such as Tensor-
Flow Lite and ONNX Runtime. Additions to mobile operating systems would en-
able users to authenticate media authenticity on their own devices thus promoting

21


22 Chapter 4. Conclusion

better digital skills and trustworthy interactions with online content. The combi-
nation of WebAssembly and WebGPU would serve for browser-based deployment
because they allow fast inference directly on users’ devices without sharing their
data outside their computing environment.

Lastly, extending the use of Large Language Models (LLMs) or transformers
to temporal coherence analysis could improve detection in video-based deepfakes.
The existing CNN-based model implementations are used to detect spatial fea-
tures in single frames, while deepfakes may demonstrate temporal patterns across
frames such as unusual eye or lip movement. Transformers are suitable for sequen-
tial data which makes them capable of examining temporal coherence through
processing frame or audio-visual feature sequences. A Vision Transformer with
the assistance of temporal attention serves as a motion inconsistency detector in its
synergy with LLMs that synchronize audio with visual content for manipulation
mismatch detection signals. Various techniques of sensory information process-
ing demonstrate exceptional potential to boost the precision rates of identifying
complex video deepfakes.


Bibliography

[1] Thanh Thi Nguyen et al. “Deep learning for deepfakes creation and de-
tection: A survey”. In: Computer Vision and Image Understanding 223 (2022),
p. 103525. issn: 1077-3142. doi: https://doi.org/10.1016/j.cviu.2022.
103525. url: https://www.sciencedirect.com/science/article/pii/
S1077314222001114.

[2] Reality Defender. History of Deepfakes. 2023. url: https://www.realitydefender.
com/insights/history-of-deepfakes.

[3] Ian J. Goodfellow et al. Generative Adversarial Networks. 2014. arXiv: 1406.
2661 [stat.ML]. url: https://arxiv.org/abs/1406.2661.

[4] Plural Policy. Deepfake Laws: A Growing Response to AI-Generated Deception.
2024. url: https://pluralpolicy.com/blog/deepfake-laws/.

[5] Thomson Reuters. Deepfakes: Federal and State Regulation. 2023. url: https://
www.thomsonreuters.com/en-us/posts/government/deepfakes-federal-
state-regulation/.

[6] Md Shohel Rana et al. “Deepfake Detection: A Systematic Literature Re-
view”. In: IEEE Access 10 (2022), pp. 25494–25513. doi: 10.1109/ACCESS.
2022.3154404.

[7] Ravikant Ranout and CRS Kumar. “Unmasking the Illusions: A Comprehen-
sive Study on Deepfake Videos and Images”. In: Apr. 2024, pp. 1–7. doi:
10.1109/I2CT61223.2024.10543839.

[8] Haoran Tang. “Image Classification based on CNN: Models and Modules”.
In: 2022 International Conference on Big Data, Information and Computer Network
(BDICN). 2022, pp. 693–696. doi: 10.1109/BDICN55575.2022.00134.

[9] Darius Afchar et al. “Mesonet: a compact facial video forgery detection net-
work”. In: 2018 IEEE international workshop on information forensics and security
(WIFS). IEEE. 2018, pp. 1–7.

[10] Christian Szegedy et al. “Going deeper with convolutions”. In: Proceedings of
the IEEE conference on computer vision and pattern recognition. 2015, pp. 1–9.

23

https://doi.org/https://doi.org/10.1016/j.cviu.2022.103525
https://doi.org/https://doi.org/10.1016/j.cviu.2022.103525
https://www.sciencedirect.com/science/article/pii/S1077314222001114
https://www.sciencedirect.com/science/article/pii/S1077314222001114
https://www.realitydefender.com/insights/history-of-deepfakes
https://www.realitydefender.com/insights/history-of-deepfakes
https://arxiv.org/abs/1406.2661
https://arxiv.org/abs/1406.2661
https://arxiv.org/abs/1406.2661
https://pluralpolicy.com/blog/deepfake-laws/
https://www.thomsonreuters.com/en-us/posts/government/deepfakes-federal-state-regulation/
https://www.thomsonreuters.com/en-us/posts/government/deepfakes-federal-state-regulation/
https://www.thomsonreuters.com/en-us/posts/government/deepfakes-federal-state-regulation/
https://doi.org/10.1109/ACCESS.2022.3154404
https://doi.org/10.1109/ACCESS.2022.3154404
https://doi.org/10.1109/I2CT61223.2024.10543839
https://doi.org/10.1109/BDICN55575.2022.00134


24 Bibliography

[11] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In Ictu Oculi: Exposing AI
Generated Fake Face Videos by Detecting Eye Blinking. 2018. arXiv: 1806.02877
[cs.CV]. url: https://arxiv.org/abs/1806.02877.

[12] Shruti Agarwal et al. “Detecting deep-fake videos from phoneme-viseme
mismatches”. In: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition workshops. 2020, pp. 660–661.

[13] Pattrick Ritter et al. “Comparative Analysis and Evaluation of CNN Mod-
els for Deepfake Detection”. In: 2023 4th International Conference on Artifi-
cial Intelligence and Data Sciences (AiDAS). 2023, pp. 250–255. doi: 10.1109/
AiDAS60501.2023.10284611.

[14] Qasim Jaleel and Israa Hadi. “Facial Action Unit-Based Deepfake Video De-
tection Using Deep Learning”. In: 2022 4th International Conference on Current
Research in Engineering and Science Applications (ICCRESA). 2022, pp. 228–233.
doi: 10.1109/ICCRESA57091.2022.10352085.

[15] Qingtong Liu et al. “Enhancing Deepfake Detection with Diversified Self-
Blending Images and Residuals”. In: IEEE Access (2024), pp. 1–1. doi: 10.
1109/ACCESS.2024.3382196.

[16] Ziming Yang et al. “Masked relation learning for deepfake detection”. In:
IEEE Transactions on Information Forensics and Security 18 (2023), pp. 1696–
1708.

[17] Cairong Zhao et al. “ISTVT: interpretable spatial-temporal video transformer
for deepfake detection”. In: IEEE Transactions on Information Forensics and Se-
curity 18 (2023), pp. 1335–1348.

[18] Yang Yu et al. “Augmented multi-scale spatiotemporal inconsistency magni-
fier for generalized deepfake detection”. In: IEEE Transactions on Multimedia
25 (2023), pp. 8487–8498.

[19] Yang Yu et al. “Pvass-mdd: predictive visual-audio alignment self-supervision
for multimodal deepfake detection”. In: IEEE Transactions on Circuits and Sys-
tems for Video Technology (2023).

[20] Ammarah Hashmi et al. AVTENet: Audio-Visual Transformer-based Ensemble
Network Exploiting Multiple Experts for Video Deepfake Detection. 2023. arXiv:
2310.13103 [cs.CV]. url: https://arxiv.org/abs/2310.13103.

[21] Leonardo Mongelli, Luca Maiano, and Irene Amerini. “CMDD: A novel mul-
timodal two-stream CNN deepfakes detector”. In: vol. 3677. 2024, 17 – 30.
url: https : / / www . scopus . com / inward / record . uri ? eid = 2 - s2 . 0 -
85193214716&partnerID=40&md5=20bef55dd103027ee54b7076c64063b8.

[22] Andreas Rössler et al. FaceForensics++: Learning to Detect Manipulated Facial
Images. 2019. arXiv: 1901.08971 [cs.CV].

https://arxiv.org/abs/1806.02877
https://arxiv.org/abs/1806.02877
https://arxiv.org/abs/1806.02877
https://doi.org/10.1109/AiDAS60501.2023.10284611
https://doi.org/10.1109/AiDAS60501.2023.10284611
https://doi.org/10.1109/ICCRESA57091.2022.10352085
https://doi.org/10.1109/ACCESS.2024.3382196
https://doi.org/10.1109/ACCESS.2024.3382196
https://arxiv.org/abs/2310.13103
https://arxiv.org/abs/2310.13103
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85193214716&partnerID=40&md5=20bef55dd103027ee54b7076c64063b8
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85193214716&partnerID=40&md5=20bef55dd103027ee54b7076c64063b8
https://arxiv.org/abs/1901.08971


Bibliography 25

[23] Yuezun Li et al. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Foren-
sics. 2020. arXiv: 1909.12962 [cs.CR].

[24] Yinan He et al. ForgeryNet: A Versatile Benchmark for Comprehensive Forgery
Analysis. 2021. arXiv: 2103.05630 [cs.CV].

[25] Brian Dolhansky et al. The DeepFake Detection Challenge (DFDC) Dataset. 2020.
arXiv: 2006.07397 [cs.CV]. url: https://arxiv.org/abs/2006.07397.

[26] Trung-Nghia Le et al. “OpenForensics: Large-Scale Challenging Dataset For
Multi-Face Forgery Detection And Segmentation In-The-Wild”. In: Interna-
tional Conference on Computer Vision. 2021.

[27] Chandra Sekhar Nandu. 1000 Videos Split: A Combined FaceForensics++ and
CelebDF Dataset. 2023. url: https://www.kaggle.com/datasets/nanduncs/
1000-videos-split.

https://arxiv.org/abs/1909.12962
https://arxiv.org/abs/2103.05630
https://arxiv.org/abs/2006.07397
https://arxiv.org/abs/2006.07397
https://www.kaggle.com/datasets/nanduncs/1000-videos-split
https://www.kaggle.com/datasets/nanduncs/1000-videos-split

	Front page
	English title page
	Contents
	Preface
	1 Introduction
	1.1 Ethical and Professional Responsibilities

	2 Methodology
	2.1 Data Preparation
	2.2 Data Preprocessing
	2.3 Model Architecture
	2.4 Training Strategy
	2.5 Model Evaluation

	3 Results and Discussions
	3.1 Results
	3.2 Discussions

	4 Conclusion
	4.1 Future Work Directions

	Bibliography