Deepfake Detection via Feature-Level Fusion of Convolutional Neural Networks Capstone Report Turan Nurgozhin Nazarbayev University Department of Electrical and Computer Engineering School of Engineering and Digital Sciences Copyright © Nazabayev University This project report was created on TexStudio editing platform using LATEX. All the figures were drawn using draw.io online software tool. Electrical and Computer Engineering Nazarbayev University http://www.nu.edu.kz Title: Deepfake Detection via Feature-Level Fusion of Convolutional Neural Net- works Theme: Deepfake detection Project Period: Spring 2025 Project Group: Machine Learning Laboratory Participant(s): Turan Nurgozhin Supervisor(s): Amin Zollanvari Copies: 1 Page Numbers: 25 Date of Completion: April 25, 2025 Abstract: The rapid rise of deepfake technol- ogy poses significant challenges, including misinformation and identity fraud. This provokes a growing need in robust detection systems. This project explores a feature fusion approach to deepfake detection by integrating the feature vectors of multiple CNN models. Four base architectures—Xception, DenseNet- 121, ResNet, and Mesonet—and their paired combinations were trained on the OpenForensics dataset and evaluated for binary classification of real and fake images. Cross-dataset testing was conducted on 16,433 video frames from FaceForensics++ and CelebDF datasets. The Xception model achieved the highest base model accuracy of 88.3%, while the Xception + DenseNet-121 combina- tion outperformed all configurations with an accuracy of 89.6% and a macro average F1-score of 0.89. TThe results show that feature-level com- bination of complementary feature spaces improves detection perfor- mance, highlighting the promise in this direction. The directions where improvement can happen in the future include increased computation power, larger dataset sizes, and prevent- ing diminutive returns. This article serves as a groundwork for improving deepfake detection techniques using feature-level fusion and collaborative model architectures. The content of this report is freely available, but publication (with reference) may only be pursued due to agreement with the author(s). http://www.nu.edu.kz Contents Preface vi 1 Introduction 1 1.1 Ethical and Professional Responsibilities . . . . . . . . . . . . . . . . 6 2 Methodology 9 2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Training Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 Results and Discussions 15 3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Conclusion 21 4.1 Future Work Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Bibliography 23 v Preface Deepfake detection is a rapidly evolving field, driven by the growing accessibility and misuse of generative artificial intelligence. The rise of deepfakes has intro- duced significant risks in various domains, including security, media integrity, and individual privacy, making their detection an urgent priority. This project was un- dertaken with the aim of contributing to the global effort in combating the misuse of deepfake technology by proposing innovative solutions for accurate and robust detection. Using feature fusion among multiple convolutional neural networks, this work gives meaning to the potential of machine learning in solving real-world problems. The effort has much broader implications for deepfake detection in fields related to digital security, media forensics, and the ethics surrounding artifi- cial intelligence. I would like to express gratitude to my supervisor, Professor Amin Zollanvari, for his unwavering support and guidance throughout the course of this research. Without his mentorship, this project would not have reached its current level of quality and completion. I am also profoundly thankful to the School of Engineer- ing and Digital Sciences, particularly the Electrical and Computer Engineering De- partment, for their invaluable guidance in shaping my academic and professional direction, and for fostering my passion for the intersection of artificial intelligence and electrical engineering. Nazarbayev University, April 25, 2025 Turan Nurgozhin vi Chapter 1 Introduction Deepfakes are synthetic images, videos, and audio content pieces using AI and machine learning tools to depict real or non-existing people. The resulting fake content brings in the risks of misinformation, political exploitation, identity theft, and various fraudulent activities. As the technology keeps developing due to the advancements in Generative AI [1], the need to prevent the possible consequences using deepfake detection tools is becoming increasingly crucial. Therefore, a new domain of research in machine learning has emerged to develop and implement algorithms capable of identifying AI-generated media. The development of deepfake technology reflects significant advancements in the fields of artificial intelligence and computer vision through technical innova- tion and societal processes. Early efforts in the 1990s involved using Computer- Generated Imagery (CGI) to generate convincing human faces, which formed the foundation for synthetic media [2]. Breakthrough came in 2014 when Genera- tive Adversarial Networks (GANs) were proposed by Goodfellow et al. [3] so that highly realistic synthetic images and videos could be generated. The term "deepfake" was coined in 2017 by a Reddit user who established a subreddit for face-swapping in pornography, marking a rise in public interest and misuse [2]. By 2018, growing concerns among experts made tech platforms start their mod- eration policies, and global legal and regulatory actions were taken to mitigate associated risks to privacy, security, and democratic processes. In the US, the 2019 National Defense Authorization Act mandates reporting on deepfake threats ev- ery year, while introduced bills, such as the DEEPFAKES Accountability Act (H.R. 5586), aim to provide legal recourse to victims of malicious deepfakes [4]. State laws, such as California’s A.B. 602 passed in 2019, target non-consensual deepfake pornography and criminalize unauthorized use [5]. Internationally, China’s Deep Synthesis Provisions, effective from 2023, prohibit deepfake creation without ex- plicit consent, emphasizing user protection [5]. Similarly, the UK’s Online Safety Bill amendments target non-consensual deepfake material [4]. These regulatory 1 2 Chapter 1. Introduction approaches trace the rapid development of deepfakes from technical experiments to an international challenge requiring effective detection mechanisms and associ- ated research in the machine learning field. According to Rana et al. [6], research into detecting deepfake media began following the rise of deepfake creation services online in late 2017. The authors outline four primary methods for detecting deepfakes: • Machine learning-based techniques • Deep learning-based techniques • Statistical measurement techniques • Blockchain-based techniques Among them, the deep learning method has been most widely studied, with 77% of the research in the domain. Deep learning methods have shown better accuracy and efficiency, with an AUC of 0.917 and a mean accuracy of 89.73%. According to Ranout et al. [7], deepfake detection consists of several steps including data gathering, face detection, feature extraction, feature selection, model selection, and model evaluation. Overall, as this research will mainly be focused on the detection of video deepfakes, the mentioned steps will be conducted towards video content pieces, and selecting deep learning models trained on video datasets. Deep learning architectures, particularly Convolutional Neural Networks (CNNs), are highly effective for processing and analyzing images and videos by learning patterns and features within the data. This capability enables CNNs to handle tasks such as image classification, object detection, and deepfake detection. As described by Tang [8], CNNs operate through several key stages: • Convolution: The input image is processed using a set of convolution layers that extract features like edges, shapes, and textures. The neural network captures important representations of the image data. • Activation: The output passes through a non-linear activation function, com- monly ReLU, to allow the network to learn more complex patterns in the data. • Pooling: The activation maps are then downsampled using pooling opera- tions, to make the CNN more efficient and enhance its robustness. • Fully Connected Layers: The output from the previous layers is passed to fully connected layers, which use the previously set features to perform clas- sification or prediction tasks. 3 CNNs could be popular for deepfake detection models due to their minimal preprocessing requirements and their ability to efficiently process the 2D grid structure of images. One of such earliest models was Mesonet introduced by Afchar et al. in 2018 [9]. This model processes 256x256 pixels image frames and consists of two similar architectures: • Meso-4, includes four consecutive layers that perform convolution, batch nor- malization, and pooling, with a final dropout layer added to prevent overfit- ting and enhance model robustness. This architecture contains a total of 27,977 trainable parameters. • MesoInception-4, is an adaptation of Meso-4, where the first two convo- lutional layers are replaced with inception modules, originally introduced by Szegedy et al. [10]. Afchar et al. reported that the Mesonet model achieved over 98% efficiency in detecting deepfake content and 95% accu- racy for Face2Face manipulation tools when it was initially published. While the previous algorithms were focused on the images’ mesoscopic prop- erties, detecting AI-generated deepfake videos requires different feature extraction approaches. One of the most obvious approaches is eye blinking analysis. Li et al. [11] combined VGG16 CNN model with Long-term Recurrent Convolutional Networks (LRCNs) to capture the temporal dynamics of eye blinking, a natural physiological function missing in many synthesized videos. Their method was tested on CEW Dataset that includes 1,232 open eyed images and 1,193 closed eye images, as well as the custom made Eye Blinking Videos (EBV) dataset that con- sisted of 50 videos of 30 seconds duration, with results reaching up to 99% at the time of the model evaluation. Another technique to detect deepfake videos is look- ing at the occasional inconsistencies of visemes on the video that are associated with pronounced letters “M”, “B” or “P”. Agarwal et al. [12] specifically used Google’s Speech-to-Text API to convert the video subjects’ speech to text, and then compared the mentioned visemes with the pronounced letters at specific times on the preprocessed images of 256x256 pixels. The model was trained on the Xcep- tion CNN architecture to classify open and closed mouth patterns, and achieved an efficiency of up to 97.0% on A2V dataset. The rapid advancement of deepfake generation tools has significantly dimin- ished the effectiveness of earlier detection models. As deepfake techniques evolve, more sophisticated detection algorithms are necessary to stay ahead. One such model is EfficientNetB7, a cutting-edge convolutional neural network (CNN) in- troduced in 2019, which has demonstrated strong performance in deepfake detec- tion. In comparison to older models such as ResNet-152 and MobileNetV3, Effi- cientNetB7 achieved the highest testing accuracy of 75% on the FaceForensics++ dataset, as reported by Ritter et al. [13]. 4 Chapter 1. Introduction Another notable deepfake detection approach is the facial action unit-based algorithm proposed by Jaleel et al. [14], which operates in two phases. Initially, it captures the distinct facial features and expressions of an individual to create a profile for the "Person of Interest" (POI). In the second phase, it classifies test subjects based on their facial action units to determine the authenticity of the data. The authors reported an accuracy rate of 95.75% for this method when published in 2022. Liu et al. [15] addressed several shortcomings of existing detection algorithms in their research, including their lack of effectiveness in cross-dataset experiments, the limitations of single detection methods, and the need for improved robustness through training on diverse datasets with adversarial techniques. To overcome these challenges, they proposed a hybrid approach combining deep neural net- works with fine-grained artifact feature analysis. This method enhances the ability to detect complex deepfake manipulations by analyzing subtle details, textures, and other intricate features that may be missed by traditional techniques. The proposed model achieved an accuracy of 98.20%, surpassing some other detection algorithms. In addition, several advanced strategies were adopted for deepfake detection models in the recent time. For example, a paper by Yang et al. [16] introduced a deepfake detection approach that frames it as a graph classification problem. In this model, spatiotemporal attention module was used to capture attention features across facial regions represented as vertices. Meanwhile, Zhao et al. [17] suggested increasing the robustness of the detection of deepfakes by using Interpretable Spatial-Temporal Video Transformer (ISTVT), which incorporated a decomposed spatial-temporal self-attention mechanism along with a self-subtraction method. In 2023, Yu et al. [18].introduced the Augmented Multi-scale Spatiotemporal In- consistency Magnifier (AMSIM), which utilized a dual-view strategy, Global In- consistency View (GIV) and Multi-timescale Local Inconsistency View (MLIV) to detect subtle spatiotemporal inconsistencies in videos. They further proposed the Predictive Visual-audio Alignment Self-supervision for Multimodal DeepFake De- tection (PVASS-MDD) [19], which incorporated both visual-audio alignment and multimodal technique. As a result, model evaluation on FaceForensics++, DFDC, FakeAVCeleb, and other datasets showed an average accuracy of 99.83% that was higher than the state-of-the-art approaches at the moment of the publication. Some recent research in deepfake detection was also focused on taking into ac- count both visual and audio manipulations. For example, Hashmi et al. [20] sug- gested the Audio-Visual Transformer-based Ensemble Network (AVTENet), which is designed to leverage both audio and visual modalities. AVTENet integrates three distinct transformer-based networks and incorporates pre-trained models, utilizing both supervised and self-supervised learning techniques to extract key features from audio, video, and their combined modalities for effective deepfake 5 identification. Another approach by Mongelli et al. [21] introduced a multimodal two-stream CNN model, known as CMDD, designed to integrate both audio and visual cues to improve detection accuracy. Evaluated on the FakeAVCeleb dataset, this model achieved 98.9% accuracy, outperforming several baseline models by leveraging key features from both modalities. It should also be mentioned that CNN-based deepfake detection models are heavily dependent on the datasets that are effectively separated into sets of real and manipulated images and videos. The selected datasets offer a wide range of artifacts and manipulations and are universally used across the mentioned deep- fake detection models for training and evaluation. • FaceForensics++, introduced by Rössler et al. [22] in 2019, consists of 1,000 manipulated videos generated from 977 YouTube videos that feature identi- fiable faces. The dataset was collected using various face manipulation tech- niques, such as DeepFakes, Face2Face, FaceSwap, and NeuralTextures. • The CelebDF dataset, proposed by Li et al. [23] in 2020, contains 5,639 deep- fake videos of celebrities. The dataset features videos averaging 13 seconds at 30 frames per second (FPS). It introduced several adversarial challenges, in- cluding low resolution, color mismatches, temporal flickering, and inaccurate facial masks. • ForgeryNet, introduced in 2021 by He et a.l [24], is the largest publicly avail- able deepfake media dataset, having 2.9 million images and over 221,000 videos. ForgeryNet incorporates various manipulations and perturbations across CREMAD, RAVDESS, VoxCeleb2, and AVSpeech datasets. • DeepFake Detection Challenge (DFDC) dataset, developed by Facebook AI [25], comprises over 100,000 face-swapped videos sourced from 3,426 paid actors. The dataset includes videos recorded under different lighting condi- tions and manipulated using various techniques, such as GAN-based face- swapping methods. As datasets continue to evolve in scale and complexity, newer deepfake datasets, such as the DFDC, highlight the growing need for detection models to be adaptive and scalable. For instance, models like EfficientNet-B0 and ResNet-18 have shown declining performance on these larger, more diverse datasets [24]. Therefore, it is important to continuously develop both the detection algorithms and datasets to include more complex perturbations and adversarial techniques. 6 Chapter 1. Introduction 1.1 Ethical and Professional Responsibilities • Ethical Responsibility: In developing a deepfake detection system, it is necessary to consider ethical issues such as privacy of the datasets as well as the possible consequences of inaccurate responses of the system. First, the risk of false positive or nega- tive classifications is significant, as incorrect results from the system, when used on real-world cases, could lead to individuals wrongly accused or, on the other hand, unprevented spread of deepfakes. To address this, I plan to conduct thorough testing of the model on diverse datasets to ensure accuracy and fairness. Also, not promising the ideally perfect results is crucial when offering services for detecting deepfakes and taking the consent of the users for this fact. Another ethical concern involves the privacy of the data used during training. Some deepfake datasets include real personal data, so it is essential to ensure all data used complies with privacy laws in Kazakhstan. I will avoid using any personal data that could harm the privacy rights of indi- viduals, and instead access the open-source data that was collected under the consent of the authors and the real people depicted. Additionally, there is a risk that the technology could be changed and reused for malicious purposes like suppressing real media content. Allowing the access to change the codes and misinterpret the results could lead to, again, wrongly attributing real content as fake. Therefore, a careful control of the access to the system will be ensured to prevent such risks and to adhere to the ethical responsibilities. • Informed Judgments: To make sure that the decision-making process during the capstone project development is well-informed, I will rely on the feedback from the experts in machine learning field, as well as keep in mind the potential damage to the society when the system gives wrong answers. On the technical side, I plan to review relevant and authoritative scientific papers from the field of machine learning and deepfake detection. Then, I plan to seek advice from professor Amin Zollanvari from the Electrical and Computer Engineering department at SEDS Nazarbayev University whether the solutions I implement are valid and have solid evidence. The models created during the project development will be tested and evaluated on the diverse datasets to ensure it meets the project’s technical goals. For the societal aspects, I will keep in mind the potential impacts my system could have on users when producing incorrect results. Misidentifying real content as fake, or failing to detect harmful deep- fakes, could reduce the societal trust towards deepfake detection systems, which is not a solution to the problem of deepfake misinformation spread. Therefore, I will prioritize increasing efficiency and robustness of the detec- tion system to ensure that the system gives as correct answers as possible and 1.1. Ethical and Professional Responsibilities 7 does not unintentionally harm certain groups of people. • Global Context: A deepfake detection system is generally universal in the global context, as the model is trained to detect the fake content based on a variety of techni- cal, rather than cultural features, such as movement of eyes or spectrographic properties of an image. However, some implementations of the model may require diverse datasets that feature people of different nationalities, genders, and languages to produce reliable and unbiased responses. This is especially applicable to the deepfake detection algorithms based on the speech recog- nition or the assessment of people’s appearance. For example, training the model on the dataset that only features people from the USA or China, and speaking English or Chinese languages may not give the accurate results when tested and used on the videos of Kazakh people. Another implication is the access to the computational resources in different parts of the world. Certain models require lots of calculation and analysis that could be an ob- stacle for certain regions to use them, provided less computational power. Therefore, I will make sure that the capstone project implementation would feature a model that is well pre-trained and optimized for the use on different computers. • Economic Impact: The short-term economic effects of the deepfake detection project could be seen in the technological market, where the companies and platforms that offer social media services could integrate the system into their fake infor- mation detectors to avoid the economic damage related to the spread of mis- information and loss of trust of the users towards these platforms, which could lead to the loss in profits. Although, the potential challenge could be in the increased computational resources required for deepfake detection model processing. As for the long term economic effects of the project, the deepfake detection challenge may become more relevant year after year due to the massive use of open-source generative AI tools. This will lead to the creation of more jobs in the field of AI-content verification tools, somehow similar to the development of identity verification and cybersecurity markets. • Environmental Impact: The environmental impact of my project, which focuses on deepfake detec- tion software, makes an effort primarily to promote digital sustainability. By improving the detection of deepfakes, the project indirectly supports the responsible use of digital resources and minimizes the misuse of technolo- gies like AI for deceptive purposes. For example, preventing the spread of misinformation for malicious purposes could also reduce the impact of fake information in social media to environmental safety, such as climate change 8 Chapter 1. Introduction denial. The detection system itself, being software-based, requires relatively low computational power compared to hardware-intensive solutions. Nev- ertheless, although training deep learning models can be energy-intensive, I have taken steps to mitigate this by pre-training models on cloud based infrastructure with servers that provide GPU resources and take electricity from sustainable sources, thereby lowering the overall carbon footprint. Ad- ditionally, I try to introduce optimizations to the model efficiency, so that the system runs smoothly on standard hardware, avoiding the need for high- performance, energy-consuming servers. The project promotes sustainability by contributing to the ethical application of technology, reducing the poten- tial harm caused by disinformation or manipulation of digital content. While the environmental impact in terms of energy consumption is minimal, the benefits of promoting digital integrity offer long-term support for the sus- tainable digital practices. Overall, these efforts aim to find a balance between technological advancement and environmental responsibility. • Societal Impact: The project makes an effort to benefit society by reducing the spread of mis- leading deepfake media content. Deepfakes can have serious consequences, from damaging individual reputations when making the society believe that the subject of the video does controversial actions, to disrupting democratic processes when for instance politicians are giving speeches they did not give in real life, which discredit their public image and reduce the public support. A recent example could be the spread of fake face-swapped videos of presi- dential candidates in the United States in 2016 and 2020 elections that could have a profound effect on the election process. By providing a tool to prevent the spread of such videos and images, I contribute to maintain the trust in social media, as it is vital for the social stability and proper communication of the information in society. The direct impact on society includes helping individuals, businesses, and governments to distinguish between real and fake content. For example, the early detection of a deepfake could prevent the manipulation of public opinion. Indirectly, the project could contribute to a broader societal awareness of digital literacy. As the public becomes more aware of the existence and risks of deepfakes, there will be a growing demand for reliable tools to verify content. Chapter 2 Methodology 2.1 Data Preparation To train and test the feature concatenation method on the deepfake classification and detection task, the OpenForensics dataset [26] was utilized. OpenForensics is a large-scale in-the-wild multi-face forgery detection and segmentation dataset, which includes diverse real and forged images. The dataset includes 140,000 train- ing images (70,000 real and 70,000 fake), 39,200 validation images (19,600 real and 19,600 fake), and 10,905 test images (5,413 real and 5,492 fake). Each image contains a detectable human face at a resolution of 256x256 pixels, which makes the dataset highly appropriate for deepfake detection. The balanced split of the dataset into training, validation, and test sets guarantees an unbiased assessment of the perfor- mance of the models at distinguishing between real and fake content. 2.2 Data Preprocessing After preparing the dataset, the next important step in creating a robust and effi- cient model in deepfake detection is data augmentation. This incorporates adding adversarial features to images so that the model can learn variations and manip- ulations that will help it generalize to real-world data and successfully classify deepfake images on previously unseen datasets. The augmentations include ro- tation, zooming, brightness adjustment, channel shifting, and horizontal flipping. These transformations not only provide new image contexts to the detection mod- els but also increase the size of the dataset, giving more examples for the advanced CNN algorithms to learn and avoid overfitting from the limited training data. The data processing parameters for our model training were based on the benchmark Mesonet deepfake detection model research by Afchar et al. so that, in the future, the performance of individually trained base models (including Mesonet) could be compared with the performance of resultant models after fea- 9 10 Chapter 2. Methodology ture concatenation of a fusion of base model architectures. The augmentation pa- rameters are as follows: • Random rotation: Up to 15 degrees. • Random zoom: Transformations within a 20% range. • Brightness adjustment: Variations by ±20%. • Channel shifting: Random RGB channel value shifts for color augmentation. • Horizontal flipping: Random mirroring of images. Below are examples of image transformations and augmentations for both real and fake data (Figure 2.1). The data preprocessing stage also included normalization of image pixel values to the range [0, 1] to provide consistent representation and efficient calculation of probability weights during the training process. Additionally, the training data was shuffled so that the model would not learn order-specific patterns in the data. Figure 2.1: Examples of augmented images for real (left) and fake (right) data. Transformations include rotation, zoom, brightness adjustment, channel shifting, and horizontal flipping. 2.3 Model Architecture To develop a robust and efficient deepfake detection model, for the preliminary study and test of the research hypothesis, 4 deep learning CNN models were se- lected as base models – Mesonet, DenseNet-121, XceptionNet, and ResNet. No changes were applied to the fundamental architecture of each base model, other than removing their last fully connected (dense) layer. This is done for feature extraction with the selected models and their concatenation together in a shared 2.3. Model Architecture 11 dense layer. The combination of features thus obtained is merged into the final dropout layer, activation layer, and dense layer to give the classification decision. Two and three model combinations are experimented with in this research. In the proposed pipeline, two or three model feature vectors are concatenated into one vector using a Concatenate layer in the Keras deep learning library in Python. For two-model combinations, the architectures tried are Mesonet + XceptionNet and DenseNet-121 + ResNet. For three-model combinations, the research tries combi- nations like Mesonet + DenseNet-121 + XceptionNet. The concatenated vector is passed through a fully connected layer of 256 neurons with ReLU activation, fol- lowed by a dropout layer of rate 0.5 to prevent overfitting. The classification is then done using a dense layer with one neuron and sigmoid activation, which produces the probability of an image being real or synthetic. To ensure a fair comparison among the models of the proposed method, along with the performance of the base models, all the models were trained from scratch on the chosen dataset of sec- tion 2.1. Through a comparison of various model combinations’ performance, the research seeks to establish the most effective architecture for the detection of deep- fakes, in a balance between computational viability and performance. To illustrate the implementation of the feature fusion approach, Listing 2.1 contains a PyTorch code snippet that shows loading the base models, freezing the models’ top classi- fication layers, feature extraction and concatenation, and learning a new classifier. The code specifically loads pre-trained Xception and DenseNet-121 models, freezes all their weights except the last layers, extracts feature vectors, concatenates the vectors, and constructs a new classification head. Figure 2.2 shows the proposed architecture with the convolutional layers feeding into feature concatenation and eventually into the final classifier. Listing 2.1: PyTorch implementation of feature-level fusion for deepfake detection import torch import torch.nn as nn import torchvision.models as models # Load pre−trained base models xception = models.xception(pretrained=True) densenet = models.densenet121(pretrained=True) # Freeze the base models’ weights for param in xception.parameters(): param.requires_grad = False for param in densenet.parameters(): param.requires_grad = False # Remove the final classification layers to extract features xception_fc = nn.Sequential(* list (xception.children() ) [:−1]) # Remove last FC layer densenet_fc = nn.Sequential(*list (densenet.children()) [:−1]) # Remove last FC layer 12 Chapter 2. Methodology # Define the feature fusion model class FeatureFusionModel(nn.Module): def __init__( self , xception, densenet): super(FeatureFusionModel, self).__init__() self .xception = xception self .densenet = densenet # Xception outputs 2048 features , DenseNet−121 outputs 1024 features self . fc1 = nn.Linear(2048 + 1024, 256) # Concatenated features self . relu = nn.ReLU() self .dropout = nn.Dropout(0.5) self . fc2 = nn.Linear(256, 1) # Binary classification ( real vs . fake ) self .sigmoid = nn.Sigmoid() def forward(self, x) : xception_features = self .xception(x).view(x.size(0) , −1) # Flatten densenet_features = self .densenet(x).view(x.size(0) , −1) # Flatten combined = torch.cat((xception_features, densenet_features), dim=1) x = self . fc1(combined) x = self . relu(x) x = self .dropout(x) x = self . fc2(x) return self .sigmoid(x) # Instantiate the model model = FeatureFusionModel(xception_fc, densenet_fc) Figure 2.2: Proposed architecture for deepfake detection, illustrating the feature-level fusion of convolutional neural networks. The architecture includes multiple convolutional layers (conv1 to conv_n), max pooling, feature concatenation, and fully connected layers (fc_1 to fc_k + fc_l) leading to a softmax output. 2.4. Training Strategy 13 2.4 Training Strategy The training approach is aimed at optimizing the model’s generalization through the utilization of data augmentation, early stopping callbacks, and learning rate reduction. As stated previously in section 2.2, dynamically augmented data that is created in real-time facilitates the model’s proficiency in detecting deepfake ar- tifacts in a variety of image conditions. The learning procedure is regulated with the assistance of callbacks that maintain efficient training. As an example, early stopping is employed to stop training and decrease computation when validation performance stabilizes and the best model is stored for future use. Moreover, the learning rate applied to the gradient descent of CNN algorithms decays step-wise during training in order to tune the model’s learning in subsequent epochs. Train- ing was carried out on the cloud facilities of Kaggle, employing an A100 GPU to speed up computation. The training process was executed on Kaggle’s cloud infrastructure, using an A100 GPU to accelerate computation. The entire train- ing pipeline from data augmentation, model training, to hyperparameter tuning consumed approximately 3 days. This duration accounts for training all the base models and their ensembles, where each model was trained from scratch on the OpenForensics dataset comprising 140,000 images. Kaggle’s environment provided an accessible and scalable platform that satisfied the computational needs of deep- fake detection without requiring local high-performance hardware. 2.5 Model Evaluation The performance of the proposed binary classification model for deepfake detec- tion is evaluated using multiple metrics in two phases: first on the primary dataset (OpenForensics), and then through a cross-dataset evaluation to assess generaliz- ability on the Face Forensic++ & Celeb-DF Combined Deepfake Data dataset. The evaluation process begins with a confusion matrix, which compares the predicted outputs to the actual results, generating the following key values: • True Positive (TP): Correct predictions of the model that the class is positive (fake). • True Negative (TN): Correct predictions of the model where the class is neg- ative (real). • False Positive (FP): Incorrect predictions of the model where the class is positive (fake), when it is actually negative (real). • False Negative (FN): Incorrect predictions of the model where the class is negative (real), when it is actually positive (fake). 14 Chapter 2. Methodology Using these values, the model’s performance is assessed through the following metrics: • Accuracy: Measures the overall correctness of the model’s predictions. Accuracy = TP + TN TP + TN + FP + FN • Precision: Indicates the proportion of positive predictions that are correct. Precision = TP TP + FP • Recall: Represents the model’s ability to identify positive cases. Recall = TP TP + FN • F1-Score: Combines precision and recall into a single metric, reflecting their harmonic mean. F1-Score = 2 × Recall × Precision Recall + Precision To evaluate the generalizability of the models, a cross-dataset evaluation was conducted using the Face Forensic++ & Celeb-DF Combined Deepfake Data dataset [27]. This dataset consists of face-cropped images derived from 1,000 videos, combining samples from FaceForensics++ and Celeb-DF, with a total of 16,433 images across real and fake classes. For the cross-dataset evaluation, the test set of 2,400 images (1,200 real and 1,200 fake) was used. Unlike the primary dataset, OpenForensics, which was used for training and initial evaluation, the Face Forensic++ & Celeb- DF dataset provides a variety of manipulations and a balanced mix of real and fake frames, introducing challenges such as differing manipulation techniques and data distributions. The pre-trained models and their combinations were applied directly to this test set without fine-tuning or domain adaptation. For this evalua- tion, a subset of metrics—Accuracy, Macro Average F1-Score, and ROC AUC—was used to maintain consistency with the primary evaluation while focusing on key indicators of performance and robustness. This cross-dataset evaluation comple- ments the primary evaluation by assessing the models’ ability to handle domain shifts, a critical factor for real-world deepfake detection applications. Chapter 3 Results and Discussions 3.1 Results Table 3.1: Performance metrics for different model architectures and combinations: Accuracy, Preci- sion, Recall, and ROC AUC. Model / Combina- tion Acc. Prec. (0) Prec. (1) Rec. (0) Rec. (1) ROC AUC Xception 0.8833 0.89 0.87 0.87 0.89 0.92 DenseNet-121 0.8346 0.88 0.78 0.76 0.90 0.89 ResNet 0.7742 0.82 0.73 0.68 0.86 0.83 Mesonet 0.8096 0.84 0.78 0.76 0.85 0.85 Mesonet + ResNet 0.8225 0.85 0.79 0.74 0.87 0.87 Mesonet + DenseNet 0.8404 0.86 0.80 0.77 0.88 0.90 Mesonet + Xception 0.8756 0.89 0.86 0.86 0.89 0.91 Xception + ResNet 0.8947 0.90 0.88 0.88 0.90 0.93 Xception + DenseNet 0.8963 0.91 0.88 0.88 0.91 0.93 ResNet + DenseNet 0.8563 0.87 0.82 0.79 0.88 0.90 15 16 Chapter 3. Results and Discussions Table 3.2: Performance metrics for different model architectures and combinations: F1-Scores and Macro Averages. Model / Com- bination F1 (0) F1 (1) Macro Avg Prec. Macro Avg Rec. Macro Avg F1 Xception 0.88 0.88 0.88 0.88 0.88 DenseNet-121 0.82 0.84 0.83 0.83 0.83 ResNet 0.74 0.79 0.77 0.77 0.77 Mesonet 0.80 0.81 0.81 0.81 0.81 Mesonet + ResNet 0.79 0.83 0.82 0.81 0.81 Mesonet + DenseNet 0.81 0.84 0.83 0.83 0.83 Mesonet + Xception 0.87 0.87 0.88 0.88 0.87 Xception + ResNet 0.89 0.89 0.89 0.89 0.89 Xception + DenseNet 0.89 0.90 0.90 0.89 0.89 ResNet + DenseNet 0.83 0.85 0.84 0.84 0.84 Table 3.3: Cross-dataset evaluation on the Face Forensic++ & Celeb-DF Combined Deepfake Data dataset: Accuracy, Precision, Recall, and ROC AUC. Model / Combina- tion Acc. Prec. (0) Prec. (1) Rec. (0) Rec. (1) ROC AUC Xception 0.8452 0.86 0.83 0.83 0.86 0.89 DenseNet-121 0.7928 0.83 0.75 0.73 0.85 0.85 ResNet 0.7325 0.77 0.69 0.65 0.80 0.80 Mesonet 0.7714 0.80 0.74 0.72 0.82 0.82 Mesonet + ResNet 0.7842 0.81 0.75 0.71 0.83 0.84 Mesonet + DenseNet 0.8019 0.82 0.77 0.74 0.84 0.86 Mesonet + Xception 0.8365 0.85 0.82 0.82 0.85 0.88 Xception + ResNet 0.8553 0.87 0.84 0.84 0.87 0.90 Xception + DenseNet 0.8578 0.88 0.84 0.84 0.88 0.90 ResNet + DenseNet 0.8192 0.84 0.79 0.76 0.85 0.87 3.1. Results 17 Table 3.4: Cross-dataset evaluation on the Face Forensic++ & Celeb-DF Combined Deepfake Data dataset: F1-Scores and Macro Averages. Model / Com- bination F1 (0) F1 (1) Macro Avg Prec. Macro Avg Rec. Macro Avg F1 Xception 0.84 0.84 0.85 0.85 0.84 DenseNet-121 0.78 0.80 0.79 0.79 0.79 ResNet 0.71 0.74 0.73 0.73 0.73 Mesonet 0.76 0.78 0.77 0.77 0.77 Mesonet + ResNet 0.76 0.79 0.78 0.77 0.78 Mesonet + DenseNet 0.78 0.80 0.80 0.79 0.79 Mesonet + Xception 0.83 0.83 0.84 0.84 0.83 Xception + ResNet 0.85 0.85 0.86 0.86 0.85 Xception + DenseNet 0.86 0.86 0.86 0.86 0.86 ResNet + DenseNet 0.80 0.82 0.82 0.81 0.81 Significant differences in performance, measured in accuracy and mean F1-scores, were achieved by base model and combination evaluation, as observed in Tables 3.1 and 3.2. Among the base models evaluated, Xception achieved a highest accuracy of 88.3% and a mean F1-score of 0.88, indicating its strength in deepfake detection feature extraction. DenseNet-121 obtained an accuracy of 83.5% and an average F1- score of 0.83, showing a strong balance between recall and precision, albeit slightly worse than Xception. On the other hand, Mesonet and ResNet achieved 80.9% and 77.4% accuracies, respectively, and respective average F1-scores of 0.81 and 0.77. These models are relatively less effective at deepfake artifact detection. When combining models, improvements were observed in specific configu- rations like Xception+ResNet and Xception+DenseNet-121. The combination of Xception and ResNet resulted in an accuracy of 89.5% and an average F1-score of 0.89, outperforming Xception alone. Similarly, the Xception and DenseNet-121 combination produces the highest accuracy of 89.6% and average F1 score of 0.89 which is the best among all model. This implies that combining feature spaces from these architectures facilitates the classification accuracy. Also, the combi- nation of Mesonet with Xception (87.5% accuracy, 0.87 F1-score) and the one of ResNet with DenseNet (85.6% accuracy, 0.84 F1-score) improved their results w.r.t. their individual components. It turns out that some combinations did not show much synergy between these two architectures such as Mesonet + ResNet (82.3% accuracy, 0.81 F1-score) being close to its base model. In general, the results show that feature fusion can be beneficial for deepfake detection. 18 Chapter 3. Results and Discussions Cross-dataset evaluation on the combined FaceForensics++ and CelebDF dataset, as presented in Tables 3.3 and 3.4, revealed the impact of domain shift on model performance. All models showed reduced performance because their training phase relied on OpenForensics data yet their testing occurred on FaceForensics++ and CelebDF data which had different distribution patterns. The Xception model lost 3.8 percentage points of its initial accuracy rating which dropped from 88.3% to 84.5% and its ROC AUC measurement declined by 0.03 points to 0.89. The ac- curacy of ResNet declined from 77.4% to 73.3% during the experiment. The fusion model comprising Xception with DenseNet demonstrated the best cross-dataset ac- curacy of 85.8% at an ROC AUC of 0.90 during the experiments. The robust nature of domain shift detection results from combining different architectural feature ex- tractions because these features demonstrate complementary behaviour. The per- formance reduction was manageable because the OpenForensics, FaceForensics++ and CelebDF applications focused on facial alterations which resulted in compara- ble image attributes. The dataset variations in manipulations and video standards created obstacles for model generalisation between datasets. 3.2. Discussions 19 3.2 Discussions The results indicate that Xception outperformed other base models on the Open- Forensics dataset [26], likely due to its ability to effectively capture spatial features and subtle artifacts characteristic of deepfake manipulations. Xception shows ex- cellent capability for binary deepfake detection because it excels at detecting subtle differences between real images and their fake counterparts. The performance of DenseNet-121 on the OpenForensics dataset remained strong but its precision rates on fake images were slightly reduced possibly indicating vulnerabilities to specific features in manipulation dataset. This indicates ResNet and Mesonet models’ dif- ficulties to adequately extract features from this specific dataset. The combination of various features through Xception + DenseNet-121 and Xception + ResNet achieved better performance than their standalone architec- tures. Xception model detects spatial features more effectively because DenseNet- 121 establishes a hierarchical feature map with dense connectivity and ResNet enables feature preservation through skip connections. Multiple network com- binations with different feature extraction mechanisms create an enhanced deep- fake detection system by exploring various feature domains. Research showed that Mesonet + ResNet demonstrated limited enhancement because overlapping features within these models reduced any possible advantages from combining them. Feature extraction methods should be distinct between models when select- ing them for maximum realisation of feature fusion performance. The cross-dataset evaluation on the Face Forensic++ & Celeb-DF Combined Deepfake Data dataset [27] further revealed the challenges of domain shift. OpenForensics- trained models displayed decreased performance when used to evaluate this dataset because both datasets featured different manipulation techniques and video qual- ity characteristics. The combination of features by fusion models led to enhanced performance because the complementary characteristics captured wider deepfake signatures. OpenForensics possesses comparable characteristics to the Face Foren- sic++ & Celeb-DF dataset because both focus on manipulating facial images at a resolution of 256x256. However, variations in data distributions still posed chal- lenges, which emphasizes the need for models that can adapt to diverse real-world scenarios [2, 5, 4]. This study proves that the combining features presents a new approach to en- hancing the efficiency of deepfake detection; however, numerous challenges re- main. Various computational expenses of training and testing several models on the huge 140,000 image dataset in OpenForensics affect scalability for real-world application contexts. These profit losses suggest researchers need to select models carefully to remove redundant features from the input data. The task of cross- dataset generalisation must be taken seriously since scientists should investigate domain adaptation techniques that involve target dataset fine-tuning along with 20 Chapter 3. Results and Discussions domain-invariant feature fusion. The success of deepfake detection system requires compute optimisation of feature fusion techniques via the usage of techniques like model pruning to enable their practical deployment Chapter 4 Conclusion This study proposes a new feature fusion method to detect deepfakes based on the training and testing of Convolutional Neural Networks (CNNs). The suggested approach was tested by concatenating the feature spaces of various base mod- els such as Xception, DenseNet-121, ResNet, and Mesonet. It was observed that Xception performed the best with an accuracy of 88.33% compared to other base models. However, there were great improvements when models were combined, particularly the Xception + DenseNet-121 and Xception + ResNet models, with a best accuracy of 89.6% and average F1-score of 0.89. Overall, these performances may reflect the improving capability of the combination approach as a result of complementary characteristics. Nevertheless, numerous problems remain to be addressed, like increased computational demands to train several models and di- minishing returns on merging architectures with shared feature spaces. 4.1 Future Work Directions To further advance the field of deepfake detection, several promising directions can be explored. First, enabling real-time detection is a critical step toward practical deployment. The system requires improvements in model architecture design to decrease parameter count or implementation of model pruning and quantization approaches which together reduce the inference time for live video streaming vi- ability. Deepfake content detection in real-time would be highly advantageous for social media platforms and live broadcasting services since it helps stop deepfake content from quickly spreading. Second, mobile or browser-based deployment could democratize access to deep- fake detection tools. The model can become available for deployment on resource- constrained systems through the implementation of frameworks such as Tensor- Flow Lite and ONNX Runtime. Additions to mobile operating systems would en- able users to authenticate media authenticity on their own devices thus promoting 21 22 Chapter 4. Conclusion better digital skills and trustworthy interactions with online content. The combi- nation of WebAssembly and WebGPU would serve for browser-based deployment because they allow fast inference directly on users’ devices without sharing their data outside their computing environment. Lastly, extending the use of Large Language Models (LLMs) or transformers to temporal coherence analysis could improve detection in video-based deepfakes. The existing CNN-based model implementations are used to detect spatial fea- tures in single frames, while deepfakes may demonstrate temporal patterns across frames such as unusual eye or lip movement. Transformers are suitable for sequen- tial data which makes them capable of examining temporal coherence through processing frame or audio-visual feature sequences. A Vision Transformer with the assistance of temporal attention serves as a motion inconsistency detector in its synergy with LLMs that synchronize audio with visual content for manipulation mismatch detection signals. Various techniques of sensory information process- ing demonstrate exceptional potential to boost the precision rates of identifying complex video deepfakes. Bibliography [1] Thanh Thi Nguyen et al. “Deep learning for deepfakes creation and de- tection: A survey”. In: Computer Vision and Image Understanding 223 (2022), p. 103525. issn: 1077-3142. doi: https://doi.org/10.1016/j.cviu.2022. 103525. url: https://www.sciencedirect.com/science/article/pii/ S1077314222001114. [2] Reality Defender. History of Deepfakes. 2023. url: https://www.realitydefender. com/insights/history-of-deepfakes. [3] Ian J. Goodfellow et al. Generative Adversarial Networks. 2014. arXiv: 1406. 2661 [stat.ML]. url: https://arxiv.org/abs/1406.2661. [4] Plural Policy. Deepfake Laws: A Growing Response to AI-Generated Deception. 2024. url: https://pluralpolicy.com/blog/deepfake-laws/. [5] Thomson Reuters. Deepfakes: Federal and State Regulation. 2023. url: https:// www.thomsonreuters.com/en-us/posts/government/deepfakes-federal- state-regulation/. [6] Md Shohel Rana et al. “Deepfake Detection: A Systematic Literature Re- view”. In: IEEE Access 10 (2022), pp. 25494–25513. doi: 10.1109/ACCESS. 2022.3154404. [7] Ravikant Ranout and CRS Kumar. “Unmasking the Illusions: A Comprehen- sive Study on Deepfake Videos and Images”. In: Apr. 2024, pp. 1–7. doi: 10.1109/I2CT61223.2024.10543839. [8] Haoran Tang. “Image Classification based on CNN: Models and Modules”. In: 2022 International Conference on Big Data, Information and Computer Network (BDICN). 2022, pp. 693–696. doi: 10.1109/BDICN55575.2022.00134. [9] Darius Afchar et al. “Mesonet: a compact facial video forgery detection net- work”. In: 2018 IEEE international workshop on information forensics and security (WIFS). IEEE. 2018, pp. 1–7. [10] Christian Szegedy et al. “Going deeper with convolutions”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 1–9. 23 https://doi.org/https://doi.org/10.1016/j.cviu.2022.103525 https://doi.org/https://doi.org/10.1016/j.cviu.2022.103525 https://www.sciencedirect.com/science/article/pii/S1077314222001114 https://www.sciencedirect.com/science/article/pii/S1077314222001114 https://www.realitydefender.com/insights/history-of-deepfakes https://www.realitydefender.com/insights/history-of-deepfakes https://arxiv.org/abs/1406.2661 https://arxiv.org/abs/1406.2661 https://arxiv.org/abs/1406.2661 https://pluralpolicy.com/blog/deepfake-laws/ https://www.thomsonreuters.com/en-us/posts/government/deepfakes-federal-state-regulation/ https://www.thomsonreuters.com/en-us/posts/government/deepfakes-federal-state-regulation/ https://www.thomsonreuters.com/en-us/posts/government/deepfakes-federal-state-regulation/ https://doi.org/10.1109/ACCESS.2022.3154404 https://doi.org/10.1109/ACCESS.2022.3154404 https://doi.org/10.1109/I2CT61223.2024.10543839 https://doi.org/10.1109/BDICN55575.2022.00134 24 Bibliography [11] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In Ictu Oculi: Exposing AI Generated Fake Face Videos by Detecting Eye Blinking. 2018. arXiv: 1806.02877 [cs.CV]. url: https://arxiv.org/abs/1806.02877. [12] Shruti Agarwal et al. “Detecting deep-fake videos from phoneme-viseme mismatches”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2020, pp. 660–661. [13] Pattrick Ritter et al. “Comparative Analysis and Evaluation of CNN Mod- els for Deepfake Detection”. In: 2023 4th International Conference on Artifi- cial Intelligence and Data Sciences (AiDAS). 2023, pp. 250–255. doi: 10.1109/ AiDAS60501.2023.10284611. [14] Qasim Jaleel and Israa Hadi. “Facial Action Unit-Based Deepfake Video De- tection Using Deep Learning”. In: 2022 4th International Conference on Current Research in Engineering and Science Applications (ICCRESA). 2022, pp. 228–233. doi: 10.1109/ICCRESA57091.2022.10352085. [15] Qingtong Liu et al. “Enhancing Deepfake Detection with Diversified Self- Blending Images and Residuals”. In: IEEE Access (2024), pp. 1–1. doi: 10. 1109/ACCESS.2024.3382196. [16] Ziming Yang et al. “Masked relation learning for deepfake detection”. In: IEEE Transactions on Information Forensics and Security 18 (2023), pp. 1696– 1708. [17] Cairong Zhao et al. “ISTVT: interpretable spatial-temporal video transformer for deepfake detection”. In: IEEE Transactions on Information Forensics and Se- curity 18 (2023), pp. 1335–1348. [18] Yang Yu et al. “Augmented multi-scale spatiotemporal inconsistency magni- fier for generalized deepfake detection”. In: IEEE Transactions on Multimedia 25 (2023), pp. 8487–8498. [19] Yang Yu et al. “Pvass-mdd: predictive visual-audio alignment self-supervision for multimodal deepfake detection”. In: IEEE Transactions on Circuits and Sys- tems for Video Technology (2023). [20] Ammarah Hashmi et al. AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection. 2023. arXiv: 2310.13103 [cs.CV]. url: https://arxiv.org/abs/2310.13103. [21] Leonardo Mongelli, Luca Maiano, and Irene Amerini. “CMDD: A novel mul- timodal two-stream CNN deepfakes detector”. In: vol. 3677. 2024, 17 – 30. url: https : / / www . scopus . com / inward / record . uri ? eid = 2 - s2 . 0 - 85193214716&partnerID=40&md5=20bef55dd103027ee54b7076c64063b8. [22] Andreas Rössler et al. FaceForensics++: Learning to Detect Manipulated Facial Images. 2019. arXiv: 1901.08971 [cs.CV]. https://arxiv.org/abs/1806.02877 https://arxiv.org/abs/1806.02877 https://arxiv.org/abs/1806.02877 https://doi.org/10.1109/AiDAS60501.2023.10284611 https://doi.org/10.1109/AiDAS60501.2023.10284611 https://doi.org/10.1109/ICCRESA57091.2022.10352085 https://doi.org/10.1109/ACCESS.2024.3382196 https://doi.org/10.1109/ACCESS.2024.3382196 https://arxiv.org/abs/2310.13103 https://arxiv.org/abs/2310.13103 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85193214716&partnerID=40&md5=20bef55dd103027ee54b7076c64063b8 https://www.scopus.com/inward/record.uri?eid=2-s2.0-85193214716&partnerID=40&md5=20bef55dd103027ee54b7076c64063b8 https://arxiv.org/abs/1901.08971 Bibliography 25 [23] Yuezun Li et al. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Foren- sics. 2020. arXiv: 1909.12962 [cs.CR]. [24] Yinan He et al. ForgeryNet: A Versatile Benchmark for Comprehensive Forgery Analysis. 2021. arXiv: 2103.05630 [cs.CV]. [25] Brian Dolhansky et al. The DeepFake Detection Challenge (DFDC) Dataset. 2020. arXiv: 2006.07397 [cs.CV]. url: https://arxiv.org/abs/2006.07397. [26] Trung-Nghia Le et al. “OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild”. In: Interna- tional Conference on Computer Vision. 2021. [27] Chandra Sekhar Nandu. 1000 Videos Split: A Combined FaceForensics++ and CelebDF Dataset. 2023. url: https://www.kaggle.com/datasets/nanduncs/ 1000-videos-split. https://arxiv.org/abs/1909.12962 https://arxiv.org/abs/2103.05630 https://arxiv.org/abs/2006.07397 https://arxiv.org/abs/2006.07397 https://www.kaggle.com/datasets/nanduncs/1000-videos-split https://www.kaggle.com/datasets/nanduncs/1000-videos-split Front page English title page Contents Preface 1 Introduction 1.1 Ethical and Professional Responsibilities 2 Methodology 2.1 Data Preparation 2.2 Data Preprocessing 2.3 Model Architecture 2.4 Training Strategy 2.5 Model Evaluation 3 Results and Discussions 3.1 Results 3.2 Discussions 4 Conclusion 4.1 Future Work Directions Bibliography