ORIGINAL ARTICLE
Arabic Review Dataset for Deepfake Text Detection: Collection and Generation
More details
Hide details
1
Information and Computer Science, King Fahd University of Petroleum and Minerals
2
Interdisciplinary Research Center for Intelligent Secure Systems, King Fahd University of
Petroleum and Minerals, Dhahran, 31261, Saudi Arabi
Submission date: 2025-11-23
Final revision date: 2026-02-16
Acceptance date: 2026-03-15
Publication date: 2026-04-07
Corresponding author
Tarek Helmy
Information and Computer Science, King Fahd University of Petroleum and Minerals
Journal of Undergraduate Research International 2026;2(1):76-83
KEYWORDS
TOPICS
ABSTRACT
It has become increasingly challenging to distinguish between real and deepfake texts, particularly for low-resource languages such as Arabic. This study aims to construct a reliable Arabic dataset to enable deepfake text detection by collecting authentic YouTube comments and generating synthetic text using OpenAI GPT-4.0 Mini. The collected comments span four thematic domains— entertainment, religion, health, and sports—to capture common discussion topics and linguistic variations present in Arabic online communities. Synthetic samples were generated using a structured prompt-based methodology that applies predefined deception techniques to simulate realistic misleading content. To validate the proposed dataset, a Bidirectional Encoder Representations from Transformers (BERT)–based model was fine-tuned for binary classification. Experimental results achieved an accuracy of 91.43%, demonstrating strong classification capability and confirming the effectiveness of the dataset for deepfake detection tasks. Although the dataset remains limited in size and dialectal diversity, the results demonstrate the effectiveness of the proposed methodology. The dataset and methodology are expected to support future research in Arabic natural language processing and improve the reliability of automated deepfake detection approaches.