

{
  "service": "elmekki-site-api",
  "resource": "search",
  "generated_at": "2026-05-03T22:58:18-07:00",
  "items": [
    {
      "kind": "page",
      "title": "Home",
      "url": "https://elmekki.me/",
      "summary": "UBC postdoctoral researcher in NLP and Multilingual AI. I build inclusive LLMs across text, audio, and images for low-resource languages. 10+ publications (ACL, EMNLP, NAACL).",
      "content": "Postdoctoral researcher in NLP and Multilingual AI. Publications, blog posts, CV, and contact details."
    },
    {
      "kind": "page",
      "title": "Publications",
      "url": "https://elmekki.me/articles/",
      "summary": "Peer-reviewed publications in NLP and Multilingual AI with links, abstracts, and BibTeX.",
      "content": "Journal and conference publications, abstracts, and citation metadata."
    },
    {
      "kind": "page",
      "title": "Blog",
      "url": "https://elmekki.me/blog/",
      "summary": "Research notes, project updates, and writing on multilingual AI and NLP.",
      "content": "Blog posts about multilingual AI, LLMs, speech, and evaluation."
    },


    {
      "kind": "publication",
      "title": "LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation",
      "url": "https://elmekki.me/articles/#lqm-linguistically-motivated-multidimensional-quality-metrics-for-machine-translation",
      "paper_url": "https://arxiv.org/abs/2604.18490",
      "summary": "The 64th Annual Meeting of the Association for Computational Linguistics: ACL 2026 (Long Papers Findings) · 2026-07-05",
      "authors": "Samar M. Magdy, Fakhraddin Alwajih, Abdellah El Mekki, Wesam El-Sayed, Muhammad Abdul-Mageed",
      "venue": "The 64th Annual Meeting of the Association for Computational Linguistics: ACL 2026 (Long Papers Findings)",
      "year": 2026,
      "content": "Existing MT evaluation frameworks, including automatic metrics and human evaluation schemes such as Multidimensional Quality Metrics (MQM), are largely language-agnostic. However, they often fail to capture dialect- and culture-specific errors in diglossic languages such as Arabic, where translation failures stem from mismatches in language variety, content coverage, and pragmatic appropriateness rather than surface form alone. We introduce LQM: Linguistically Motivated Multidimensional Quality Metrics for MT. LQM is a hierarchical error taxonomy for diagnosing MT errors through six linguistically grounded levels: sociolinguistics, pragmatics, semantics, morphosyntax, orthography, and graphetics. We construct a bidirectional parallel corpus of 3,850 sentences spanning seven Arabic dialects, derived from conversational, culturally rich content. We evaluate six LLMs in a zero-shot setting and conduct expert span-level human annotation using LQM, producing 6,113 labeled error spans across 3,495 unique erroneous sentences, along with severity-weighted quality scores. We complement this analysis with an automatic metric (spBLEU). Though validated here on Arabic, LQM is a language-agnostic framework designed to be easily applied to or adapted for other languages."
    },


    {
      "kind": "publication",
      "title": "Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs",
      "url": "https://elmekki.me/articles/#alexandria-a-multi-domain-dialectal-arabic-machine-translation-dataset-for-culturally-inclusive-and-linguistically-diverse-llms",
      "paper_url": "https://arxiv.org/abs/2601.13099",
      "summary": "The 64th Annual Meeting of the Association for Computational Linguistics: ACL 2026 (Long Papers Main Conference) · 2026-07-05",
      "authors": "Abdellah El Mekki, Samar M. Magdy, Houdaifa Atou, Ruwa AbuHweidi, Baraah Qawasmeh, Omer Nacar, Thikra Al-hibiri, Razan Saadie, Hamzah Alsayadi, Nadia Ghezaiel Hammouda, Alshima Alkhazimi, Aya Hamod, Al-Yas Al-Ghafri, Wesam El-Sayed, Asila Al sharji, Mohamad Ballout, Anas Belfathi, Karim Ghaddar, Serry Sibaee, Alaa Aoun, Areej Asiri, Lina Abureesh, Ahlam Bashiti, Majdal Yousef, Abdulaziz Hafiz, Yehdih Mohamed, Emira Hamedtou, Brakehe Brahim, Rahaf Alhamouri, Youssef Nafea, Aya El Aatar, Walid Al-Dhabyani, Emhemed Hamed, Sara Shatnawi, Fakhraddin Alwajih, Khalid Elkhidir, Ashwag Alasmari, Abdurrahman Gerrio, Omar Alshahri, AbdelRahim A. Elmadany, Ismail Berrada, Amir Azad Adli Alkathiri, Fadi A Zaraket, Mustafa Jarrar, Yahya Mohamed El Hadj, Hassan Alhuzali, Muhammad Abdul-Mageed",
      "venue": "The 64th Annual Meeting of the Association for Computational Linguistics: ACL 2026 (Long Papers Main Conference)",
      "year": 2026,
      "content": "Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic (MSA). Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce Alexandria, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of parallel English-Dialectal Arabic multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total turns, Alexandria serves as both a training resource and as a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation benchmarks the current capabilities of Arabic-aware LLMs in translating across diverse Arabic dialects and sub-dialects while exposing significant persistent challenges. The Alexandria dataset, the creation prompts, the translation and revision guidelines, and the evaluation code are publicly available."
    },


    {
      "kind": "publication",
      "title": "Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset",
      "url": "https://elmekki.me/articles/#pearl-a-multimodal-culturally-aware-arabic-instruction-dataset",
      "paper_url": "https://aclanthology.org/2025.findings-emnlp.1254.pdf",
      "summary": "Findings of the Association for Computational Linguistics: EMNLP 2025 · 2025-11-01",
      "authors": "Fakhraddin Alwajih, Samar M. Magdy, Abdellah El Mekki, Omer Nacar, Youssef Nafea, Safaa Taher Abdelfadil, Abdulfattah Mohammed Yahya, Hamzah Luqman, Nada Almarwani, Samah Aloufi, Baraah Qawasmeh, Houdaifa Atou, Serry Sibaee, Hamzah A. Alsayadi, Walid Al-Dhabyani, Maged S. Al-shaibani, Aya El aatar, Nour Qandos, Rahaf Alhamouri, Samar Ahmad, Mohammed Anwar AL-Ghrawi, Aminetou Yacoub, Ruwa AbuHweidi, Vatimetou Mohamed Lemin, Reem Abdel-Salam, Ahlam Bashiti, Adel Ammar, Aisha Alansari, Ahmed Ashraf, Nora Alturayeif, Alcides Alcoba Inciarte, AbdelRahim A. Elmadany, Mohamedou Cheikh Tourad, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed",
      "venue": "Findings of the Association for Computational Linguistics: EMNLP 2025",
      "year": 2025,
      "content": "Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce PEARL, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 37 annotators from across the Arab world, PEARL comprises over 309K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks (PEARL and PEARL-LITE) along with a specialized subset (PEARL-X) explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models’ cultural grounding compared to conventional scaling methods. PEARL establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available."
    },


    {
      "kind": "publication",
      "title": "EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs",
      "url": "https://elmekki.me/articles/#eduadapt-a-question-answer-benchmark-dataset-for-evaluating-grade-level-adaptability-in-llms",
      "paper_url": "https://aclanthology.org/2025.emnlp-main.1736.pdf",
      "summary": "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: EMNLP 2025 · 2025-11-01",
      "authors": "Numaan Naeem, Abdellah El Mekki, Muhammad Abdul-Mageed",
      "venue": "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: EMNLP 2025",
      "year": 2025,
      "content": "Large language models (LLMs) are transforming education by answering questions, explaining complex concepts, and generating content across a wide range of subjects. Despite strong performance on academic benchmarks, they often fail to tailor responses to students’ grade levels. This is a critical need in K-12 education, where age-appropriate vocabulary and explanation are essential for effective learning. Existing models frequently produce outputs that are too advanced or vague for younger learners, and there are no standardized benchmarks to evaluate their ability to adjust across cognitive and developmental stages. To address this gap, we introduce EduAdapt, a benchmark of nearly 48k grade-labeled QA pairs across nine science subjects, spanning Grades 1-12 and grouped into four grade levels. We evaluate a diverse set of open-source LLMs on EduAdapt and find that while larger models generally perform better, they still struggle with generating suitable responses for early-grade students (Grades 1-5). Our work presents the first dataset and evaluation framework for assessing grade-level adaptability in LLMs, aiming to foster more developmentally aligned educational AI systems through better training and prompting strategies."
    },


    {
      "kind": "publication",
      "title": "PalmX 2025: The First Shared Task on Benchmarking LLMs on Arabic and Islamic Culture",
      "url": "https://elmekki.me/articles/#palmx-2025-the-first-shared-task-on-benchmarking-llms-on-arabic-and-islamic-culture",
      "paper_url": "https://aclanthology.org/2025.arabicnlp-sharedtasks.107.pdf",
      "summary": "Proceedings of The Third Arabic Natural Language Processing Conference: ArabicNLP 2025 · 2025-11-01",
      "authors": "Fakhraddin Alwajih, Abdellah El Mekki, Hamdy Mubarak, Majd Hawasly, Abubakr Mohamed, Muhammad Abdul-Mageed",
      "venue": "Proceedings of The Third Arabic Natural Language Processing Conference: ArabicNLP 2025",
      "year": 2025,
      "content": "Large Language Models (LLMs) inherently reflect the vast data distributions they encounter during their pre-training phase. As this data is predominantly sourced from the web, there is a high chance it will be skewed towards high-resourced languages and cultures, such as those of the West. Consequently, LLMs often exhibit a diminished understanding of certain communities, a gap that is particularly evident in their knowledge of Arabic and Islamic cultures. This issue becomes even more pronounced with increasingly under-represented topics. To address this critical challenge, we introduce PalmX 2025, the first shared task designed to benchmark the cultural competence of LLMs in these specific domains. The task is composed of two subtasks featuring multiple-choice questions (MCQs) in Modern Standard Arabic (MSA): General Arabic Culture and General Islamic Culture. These subtasks cover a wide range of topics, including traditions, food, history, religious practices, and language expressions from across 22 Arab countries. The initiative drew considerable interest, with 26 teams registering for Subtask 1 and 19 for Subtask 2, culminating in nine and six valid submissions, respectively. Our findings reveal that task-specific fine-tuning substantially boosts performance over baseline models. The top-performing systems achieved an accuracy of 72.15% on cultural questions and 84.22% on Islamic knowledge. Parameter-efficient fine-tuning emerged as the predominant and most effective approach among participants, while the utility of data augmentation was found to be domain-dependent. Ultimately, this benchmark provides a crucial, standardized framework to guide the development of more culturally grounded and competent Arabic LLMs. Results of the shared task demonstrate that general cultural and general religious knowledge remain challenging to LLMs, motivating us to continue to offer the shared task in the future."
    },


    {
      "kind": "publication",
      "title": "NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities",
      "url": "https://elmekki.me/articles/#nilechat-towards-linguistically-diverse-and-culturally-aware-llms-for-local-communities",
      "paper_url": "https://aclanthology.org/2025.emnlp-main.556.pdf",
      "summary": "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: EMNLP 2025 · 2025-11-01",
      "authors": "Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, Muhammad Abdul-Mageed",
      "venue": "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: EMNLP 2025",
      "year": 2025,
      "content": "Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter Egyptian and Moroccan Arabic LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. This work addresses Arabic dialect in LLMs with a focus on cultural and values alignment via controlled synthetic data generation and retrieval-augmented pre-training for Moroccan Darija and Egyptian Arabic, including Arabizi variants, advancing Arabic NLP for low-resource communities.We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in cultural LLM development: https://github.com/UBC-NLP/nilechat."
    },


    {
      "kind": "publication",
      "title": "Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs",
      "url": "https://elmekki.me/articles/#palm-a-culturally-inclusive-and-linguistically-diverse-dataset-for-arabic-llms",
      "paper_url": "https://aclanthology.org/2025.acl-long.1579.pdf",
      "summary": "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers): ACL 2025 · 2025-07-01",
      "authors": "Fakhraddin Alwajih, Abdellah El Mekki, Samar Mohamed Magdy, AbdelRahim A. Elmadany, Omer Nacar, El Moatez Billah Nagoudi, Reem Abdel-Salam, Hanin Atwany, Youssef Nafea, Abdulfattah Mohammed Yahya, Rahaf Alhamouri, Hamzah A. Alsayadi, Hiba Zayed, Sara Shatnawi, Serry Sibaee, Yasir Ech-chammakhy, Walid Al-Dhabyani, Marwa Mohamed Ali, Imen Jarraya, Ahmed Oumar El-Shangiti, Aisha Alraeesi, Mohammed Anwar AL-Ghrawi, Abdulrahman S. Al-Batati, Elgizouli Mohamed, Noha Taha Elgindi, Muhammed Saeed, Houdaifa Atou, Issam Ait Yahia, Abdelhak Bouayad, Mohammed Machrouh, Amal Makouar, Dania Alkawi, Mukhtar Mohamed, Safaa Taher Abdelfadil, Amine Ziad Ounnoughene, Anfel Rouabhia, Rwaa Assi, Ahmed Sorkatti, Mohamedou Cheikh Tourad, Anis Koubaa, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed",
      "venue": "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers): ACL 2025",
      "year": 2025,
      "content": "As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce PALM, a year-long community-driven project covering all 22 Arab countries. The dataset contains instruction–response pairs in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world—each an author of this paper—PALM offers a broad, inclusive perspective. We use PALM to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations: while closed-source LLMs generally perform strongly, they still exhibit flaws, and smaller open-source models face greater challenges. Furthermore, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data are publicly available for reproducibility. More information about PALM is available on our project page: https://github.com/UBC-NLP/palm."
    },


    {
      "kind": "publication",
      "title": "Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs",
      "url": "https://elmekki.me/articles/#effective-self-mining-of-in-context-examples-for-unsupervised-machine-translation-with-llms",
      "paper_url": "https://aclanthology.org/2025.findings-naacl.238.pdf",
      "summary": "Findings of the Association for Computational Linguistics: NAACL 2025 · 2025-05-01",
      "authors": "Abdellah El Mekki, Muhammad Abdul-Mageed",
      "venue": "Findings of the Association for Computational Linguistics: NAACL 2025",
      "year": 2025,
      "content": "Large Language Models (LLMs) have demonstrated impressive performance on a wide range of natural language processing (NLP) tasks, primarily through in-context learning (ICL). In ICL, the LLM is provided with examples that represent a given task such that it learns to generate answers for test inputs. However, access to these in-context examples is not guaranteed especially for low-resource or massively multilingual tasks. In this work, we propose an unsupervised approach to mine in-context examples for machine translation (MT), enabling unsupervised MT (UMT) across different languages. Our approach begins with word-level mining to acquire word translations that are then used to perform sentence-level mining. As the quality of mined parallel pairs may not be optimal due to noise or mistakes, we introduce a filtering criterion to select the optimal in-context examples from a pool of unsupervised parallel sentences. We evaluate our approach using two multilingual LLMs on 288 directions from the FLORES-200 dataset (CITATION) and analyze the impact of various linguistic features on performance. Our findings demonstrate the effectiveness of our unsupervised approach in mining in-context examples for MT, leading to better or comparable translation performance as translation with regular in-context samples (extracted from human-annotated data), while also outperforming the other state-of-the-art UMT methods by an average of 7 BLEU points."
    },


    {
      "kind": "publication",
      "title": "Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks",
      "url": "https://elmekki.me/articles/#swan-and-arabicmteb-dialect-aware-arabic-centric-cross-lingual-and-cross-cultural-embedding-models-and-benchmarks",
      "paper_url": "https://aclanthology.org/2025.findings-naacl.263.pdf",
      "summary": "Findings of the Association for Computational Linguistics: NAACL 2025 · 2025-05-01",
      "authors": "Gagan Bhatia, El Moatez Billah Nagoudi, Abdellah El Mekki, Fakhraddin Alwajih, Muhammad Abdul-Mageed",
      "venue": "Findings of the Association for Computational Linguistics: NAACL 2025",
      "year": 2025,
      "content": "In this paper, we introduce Swan, a family of embedding models centred around the Arabic language, addressing both small-scale and large-scale use cases. Swan includes two variants: Swan-Small, based on ARBERTv2, and Swan-Large, built on ArMistral, a pretrained Arabic large language model. To evaluate these models, we propose ArabicMTEB, a comprehensive benchmark suite that assesses cross-lingual, multi-dialectal, multi-domain, and multi-cultural Arabic text embedding performance, covering eight diverse tasks and spanning 94 datasets. Swan-Large achieves state-of-the-art results, outperforming Multilingual-E5-large in most Arabic tasks, while the Swan-Small consistently surpasses Multilingual-E5-base. Our extensive evaluations demonstrate that Swan models are dialectally and culturally aware, excelling across various Arabic domains while offering significant monetary efficiency. This work significantly advances the field of Arabic language modelling and provides valuable resources for future research and applications in Arabic natural language processing. Our models and benchmarks will be made publicly accessible for research."
    },


    {
      "kind": "publication",
      "title": "Casablanca: Data and Models for Multidialectal Arabic Speech Recognition",
      "url": "https://elmekki.me/articles/#casablanca-data-and-models-for-multidialectal-arabic-speech-recognition",
      "paper_url": "https://aclanthology.org/2024.emnlp-main.1211.pdf",
      "summary": "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing · 2024-11-04",
      "authors": "Bashar Talafha, Karima Kadaoui, Samar Mohamed Magdy, Mariem Habiboullah, Chafei Mohamed Chafei, Ahmed Oumar El-Shangiti, Hiba Zayed, Mohamedou Cheikh Tourad, Rahaf Alhamouri, Rwaa Assi, Aisha Alraeesi, Hour Mohamed, Fakhraddin Alwajih, Abdelrahman Mohamed, Abdellah El Mekki, El Moatez Billah Nagoudi, Benelhadj Djelloul Mama Saadia, Hamzah A. Alsayadi, Walid Al-Dhabyani, Sara Shatnawi, Yasir Ech-chammakhy, Amal Makouar, Yousra Berrachedi, Mustafa Jarrar, Shady Shehata, Ismail Berrada, Muhammad Abdul-Mageed",
      "venue": "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
      "year": 2024,
      "content": "In spite of the recent progress in speech processing, the majority of world languages and dialects remain uncovered. This situation only furthers an already wide technological divide, thereby hindering technological and socioeconomic inclusion. This challenge is largely due to the absence of datasets that can empower diverse speech systems. In this paper, we seek to mitigate this obstacle for a number of Arabic dialects by presenting Casablanca, a large-scale community-driven effort to collect and transcribe a multi-dialectal Arabic dataset. The dataset covers eight dialects: Algerian, Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni, and includes annotations for transcription, gender, dialect, and code-switching. We also develop a number of strong baselines exploiting Casablanca. The project page for Casablanca is accessible at: www.dlnlp.ai/speech/casablanca."
    },


    {
      "kind": "publication",
      "title": "ProMap: Effective Bilingual Lexicon Induction via Language Model Prompting",
      "url": "https://elmekki.me/articles/#promap-effective-bilingual-lexicon-induction-via-language-model-prompting",
      "paper_url": "https://arxiv.org/abs/2310.18778",
      "summary": "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics · 2023-11-04",
      "authors": "Abdellah El Mekki, Muhammad Abdul-Mageed, ElMoatez Billah Nagoudi, Ismail Berrada, Ahmed Khoumsi",
      "venue": "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics",
      "year": 2023,
      "content": "Bilingual Lexicon Induction (BLI), where words are translated between two languages, is an important NLP task. While noticeable progress on BLI in rich resource languages using static word embeddings has been achieved. The word translation performance can be further improved by incorporating information from contextualized word embeddings. In this paper, we introduce ProMap, a novel approach for BLI that leverages the power of prompting pretrained multilingual and multidialectal language models to address these challenges. To overcome the employment of subword tokens in these models, ProMap relies on an effective padded prompting of language models with a seed dictionary that achieves good performance when used independently. We also demonstrate the effectiveness of ProMap in re-ranking results from other BLI methods such as with aligned static word embeddings. When evaluated on both rich-resource and low-resource languages, ProMap consistently achieves stateof-the-art results. Furthermore, ProMap enables strong performance in few-shot scenarios (even with less than 10 training examples), making it a valuable tool for low-resource language translation. Overall, we believe our method offers both exciting and promising direction for BLI in general and low-resource languages in particular."
    },


    {
      "kind": "publication",
      "title": "Fed-ANIDS: Federated learning for anomaly-based network intrusion detection systems",
      "url": "https://elmekki.me/articles/#fed-anids-federated-learning-for-anomaly-based-network-intrusion-detection-systems",
      "paper_url": "https://www.sciencedirect.com/science/article/pii/S0957417423015026",
      "summary": "Expert Systems with Applications · 2023-08-30",
      "authors": "Meryem Janati Idrissi, Hamza Alami, Abdelkader El Mahdaouy, Abdellah El Mekki, Soufiane Oualil, Zakaria Yartaoui, Ismail Berrada",
      "venue": "Expert Systems with Applications",
      "year": 2023,
      "content": "As computer networks and interconnected systems continue to gain widespread adoption, ensuring cybersecurity has become a prominent concern for organizations, regardless of their scale or size. Meanwhile, centralized machine learning-based Anomaly Detection (AD) methods have shown promising results in improving the accuracy and efficiency of Network Intrusion Detection Systems (NIDS). However, new challenges arise such as privacy concerns and regulatory restrictions that must be tackled. Federated Learning (FL) has emerged as a solution that allows distributed clients to collaboratively train a shared model while preserving the privacy of their local data. In this paper, we propose Fed-ANIDS, a NIDS that leverages AD and FL to address the privacy concerns associated with centralized models. To detect intrusions, we compute an intrusion score based on the reconstruction error of normal traffic using various AD models, including simple autoencoders, variational autoencoders, and adversarial autoencoders. We thoroughly evaluate Fed-ANIDS using various settings and popular datasets, including USTC-TFC2016, CIC-IDS2017, and CSE-CIC-IDS2018. The proposed method demonstrates its effectiveness by achieving high performance in terms of different metrics while preserving the data privacy of distributed clients. Our findings highlight that autoencoder-based models outperform other generative adversarial network-based models, achieving high detection accuracy coupled with fewer false alarms. In addition, the FL framework (FedProx), which is a generalization and re-parametrization of the standard method for FL (FedAvg), achieves better results."
    },


    {
      "kind": "publication",
      "title": "OMCD: Offensive Moroccan Comments Dataset",
      "url": "https://elmekki.me/articles/#omcd-offensive-moroccan-comments-dataset",
      "paper_url": "https://link.springer.com/article/10.1007/s10579-023-09663-2",
      "summary": "Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) · 2023-06-05",
      "authors": "Kabil Essefar, Hassan Ait Baha, Abdelkader El Mahdaouy, Abdellah El Mekki, Ismail Berrada",
      "venue": "Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)",
      "year": 2023,
      "content": "Offensive content, such as verbal attacks, demeaning comments, or hate speech, has become widespread on social media. Automatic detection of this content is considered an important and challenging task. Although several research works have been proposed to address this challenge for high-resource languages, research on detecting offensive content in Dialectal Arabic (DA) remains under-explored. Recently, the detection of offensive language in DA has gained increasing interest among researchers in Natural Language Processing (NLP). However, only a limited number of annotated datasets have been introduced for single or multiple coarse-grained dialects. In this paper, we introduce Offensive Moroccan Comments Dataset (OMCD), the first dataset for offensive language detection for the Moroccan dialect. First, we present the data collection steps, the statistical analysis, and the annotation guidelines of the introduced dataset. Then, we evaluate several state-of-the-art Machine Learning (ML) and Deep Learning (DL) based models on the OMCD dataset. Finally, we highlight the impact of emojis on the evaluated models for offensive language detection."
    },


    {
      "kind": "publication",
      "title": "CS-UM6P at SemEval-2022 Task 6: Transformer-based Models for Intended Sarcasm Detection in English and Arabic",
      "url": "https://elmekki.me/articles/#cs-um6p-at-semeval-2022-task-6-transformer-based-models-for-intended-sarcasm-detection-in-english-and-arabic",
      "paper_url": "https://aclanthology.org/2022.semeval-1.117.pdf",
      "summary": "Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) · 2022-07-14",
      "authors": "Abdelkader El Mahdaouy, Abdellah El Mekki, Kabil Essefar, Abderrahman Skiredj, Ismail Berrada",
      "venue": "Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)",
      "year": 2022,
      "content": "Sarcasm is a form of figurative language where the intended meaning of a sentence differs from its literal meaning. This poses a serious challenge to several Natural Language Processing (NLP) applications such as Sentiment Analysis, Opinion Mining, and Author Profiling. In this paper, we present our participating system to the intended sarcasm detection task in English and Arabic languages. Our system consists of three deep learning-based models leveraging two existing pre-trained language models for Arabic and English. We have participated in all sub-tasks. Our official submissions achieve the best performance on sub-task A for Arabic language and rank second in sub-task B. For sub-task C, our system is ranked 7th and 11th on Arabic and English datasets, respectively."
    },


    {
      "kind": "publication",
      "title": "UM6P-CS at SemEval-2022 Task 11: Enhancing Multilingual and Code-Mixed Complex Named Entity Recognition via Pseudo Labels using Multilingual Transformer",
      "url": "https://elmekki.me/articles/#um6p-cs-at-semeval-2022-task-11-enhancing-multilingual-and-code-mixed-complex-named-entity-recognition-via-pseudo-labels-using-multilingual-transformer",
      "paper_url": "https://aclanthology.org/2022.semeval-1.207.pdf",
      "summary": "Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) · 2022-07-14",
      "authors": "Abdellah El Mekki, Abdelkader El Mahdaouy, Mohammed Akallouch, Ismail Berrada, Ahmed Khoumsi",
      "venue": "Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)",
      "year": 2022,
      "content": "Building real-world complex Named Entity Recognition (NER) systems is a challenging task. This is due to the complexity and ambiguity of named entities that appear in various contexts such as short input sentences, emerging entities, and complex entities. Besides, real-world queries are mostly malformed, as they can be code-mixed or multilingual, among other scenarios. In this paper, we introduce our submitted system to the Multilingual Complex Named Entity Recognition (MultiCoNER) shared task. We approach the complex NER for multilingual and code-mixed queries, by relying on the contextualized representation provided by the multilingual Transformer XLM-RoBERTa. In addition to the CRF-based token classification layer, we incorporate a span classification loss to recognize named entities spans. Furthermore, we use a self-training mechanism to generate weakly-annotated data from a large unlabeled dataset. Our proposed system is ranked 6th and 8th in the multilingual and code-mixed MultiCoNER’s tracks respectively."
    },


    {
      "kind": "publication",
      "title": "AdaSL: An Unsupervised Domain Adaptation framework for Arabic multi-dialectal Sequence Labeling",
      "url": "https://elmekki.me/articles/#adasl-an-unsupervised-domain-adaptation-framework-for-arabic-multi-dialectal-sequence-labeling",
      "paper_url": "https://www.sciencedirect.com/science/article/pii/S0306457322000814",
      "summary": "Information Processing & Management · 2022-07-01",
      "authors": "Abdellah El Mekki, Abdelkader El Mahdaouy, Ismail Berrada, Ahmed Khoumsi",
      "venue": "Information Processing & Management",
      "year": 2022,
      "content": "Dialectal Arabic (DA) refers to varieties of everyday spoken languages in the Arab world. These dialects differ according to the country and region of the speaker, and their textual content is constantly growing with the rise of social media networks and web blogs. Although research on Natural Language Processing (NLP) on standard Arabic, namely Modern Standard Arabic (MSA), has witnessed remarkable progress, research efforts on DA are rather limited. This is due to numerous challenges, such as the scarcity of labeled data as well as the nature and structure of DA. While some recent works have reached decent results on several DA sentence classification tasks, other complex tasks, such as sequence labeling, still suffer from weak performances when it comes to DA varieties with either a limited amount of labeled data or unlabeled data only. Besides, it has been shown that zero-shot transfer learning from models trained on MSA does not perform well on DA. In this paper, we introduce AdaSL, a new unsupervised domain adaptation framework for Arabic multi-dialectal sequence labeling, leveraging unlabeled DA data, labeled MSA data, and existing multilingual and Arabic Pre-trained Language Models (PLMs). The proposed framework relies on four key components: (1) domain adaptive fine-tuning of multilingual/MSA language models on unlabeled DA data, (2) sub-word embedding pooling, (3) iterative self-training on unlabeled DA data, and (4) iterative DA and MSA distribution alignment. We evaluate our framework on multi-dialectal Named Entity Recognition (NER) and Part-of-Speech (POS) tagging tasks. The overall results show that the zero-shot transfer learning, using our proposed framework, boosts the performance of the multilingual PLMs by 40.87% in macro-F1 score for the NER task, while it boosts the accuracy by 6.95% for the POS tagging task. For the Arabic PLMs, our proposed framework increases performance by 16.18% macro-F1 for the NER task and 2.22% accuracy for the POS tagging task, and thus, achieving new state-of-the-art zero-shot transfer learning performance for Arabic multi-dialectal sequence labeling."
    },


    {
      "kind": "publication",
      "title": "Deep Multi-Task Models for Misogyny Identification and Categorization on Arabic Social Media",
      "url": "https://elmekki.me/articles/#deep-multi-task-models-for-misogyny-identification-and-categorization-on-arabic-social-media",
      "paper_url": "https://ceur-ws.org/Vol-3159/T5-5.pdf",
      "summary": "Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation (FIRE-WN 2021), Gandhinagar, India · 2021-12-13",
      "authors": "Abdellah El Mekki, Abdelkader El Mahdaouy, Mohammed Akallouch, Ismail Berrada, Ahmed Khoumsi",
      "venue": "Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation (FIRE-WN 2021), Gandhinagar, India",
      "year": 2021,
      "content": "The prevalence of toxic content on social media platforms, such as hate speech, offensive language, and misogyny, presents serious challenges to our interconnected society. These challenging issues have attracted widespread attention in Natural Language Processing (NLP) community. In this paper, we present the submitted systems to the first Arabic Misogyny Identification shared task. We investigate three multi-task learning models as well as their single-task counterparts. In order to encode the input text, our models rely on the pre-trained MARBERT language model. The overall obtained results show that all our submitted models have achieved the best performances (top three ranked submissions) in both misogyny identification and categorization tasks."
    },


    {
      "kind": "publication",
      "title": "CS-UM6P at SemEval-2021 Task 1: A Deep Learning Model-based Pre-trained Transformer Encoder for Lexical Complexity",
      "url": "https://elmekki.me/articles/#cs-um6p-at-semeval-2021-task-1-a-deep-learning-model-based-pre-trained-transformer-encoder-for-lexical-complexity",
      "paper_url": "https://aclanthology.org/2021.semeval-1.73.pdf",
      "summary": "Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021) · 2021-08-05",
      "authors": "Nabil El Mamoun, Abdelkader El Mahdaouy, Abdellah El Mekki, Kabil Essefar, Ismail Berrada",
      "venue": "Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)",
      "year": 2021,
      "content": "Lexical Complexity Prediction (LCP) involves assigning a difficulty score to a particular word or expression, in a text intended for a target audience. In this paper, we introduce a new deep learning-based system for this challenging task. The proposed system consists of a deep learning model, based on pre-trained transformer encoder, for word and Multi-Word Expression (MWE) complexity prediction. First, on top of the encoder’s contextualized word embedding, our model employs an attention layer on the input context and the complex word or MWE. Then, the attention output is concatenated with the pooled output of the encoder and passed to a regression module. We investigate both single-task and joint training on both Sub-Tasks data using multiple pre-trained transformer-based encoders. The obtained results are very promising and show the effectiveness of fine-tuning pre-trained transformers for LCP task."
    },


    {
      "kind": "publication",
      "title": "CS-UM6P at SemEval-2021 Task 7: Deep Multi-Task Learning Model for Detecting and Rating Humor and Offense",
      "url": "https://elmekki.me/articles/#cs-um6p-at-semeval-2021-task-7-deep-multi-task-learning-model-for-detecting-and-rating-humor-and-offense",
      "paper_url": "https://aclanthology.org/2021.semeval-1.159.pdf",
      "summary": "Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021) · 2021-08-05",
      "authors": "Kabil Essefar, Abdellah El Mekki, Abdelkader El Mahdaouy, Nabil El Mamoun, Ismail Berrada",
      "venue": "Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)",
      "year": 2021,
      "content": "Humor detection has become a topic of interest for several research teams, especially those involved in socio-psychological studies, with the aim to detect the humor and the temper of a targeted population (e.g. a community, a city, a country, the employees of a given company). Most of the existing studies have formulated the humor detection problem as a binary classification task, whereas it revolves around learning the sense of humor by evaluating its different degrees. In this paper, we propose an end-to-end deep Multi-Task Learning (MTL) model to detect and rate humor and offense. It consists of a pre-trained transformer encoder and task-specific attention layers. The model is trained using MTL uncertainty loss weighting to adaptively combine all sub-tasks objective functions. Our MTL model tackles all sub-tasks of the SemEval-2021 Task-7 in one end-to-end deep learning system and shows very promising results."
    },


    {
      "kind": "publication",
      "title": "On the Role of Orthographic Variations in Building Multidialectal Arabic Word Embeddings",
      "url": "https://elmekki.me/articles/#on-the-role-of-orthographic-variations-in-building-multidialectal-arabic-word-embeddings",
      "paper_url": "https://assets.pubpub.org/s5qybplo/11621610534420.pdf",
      "summary": "Proceedings of the Canadian Conference on Artificial Intelligence · 2021-06-08",
      "authors": "Abdellah El Mekki, Abdelkader El Mahdaouy, Ismail Berrada, Ahmed Khoumsi",
      "venue": "Proceedings of the Canadian Conference on Artificial Intelligence",
      "year": 2021,
      "content": "Dialectal Arabic (DA) is mostly used by over 400 million people across Arab countries as a communication channel on social media platforms, web forums, and daily life. Building Natural Language Processing systems for each DA variant is a challenging issue due to the lack of data and the noisy nature of the available corpora. In this paper, we propose a method to incorporate orthographic features into word embedding mapping methods, inducing a multidialectal embedding space. Our method can be used for both supervised and unsupervised cross-lingual embedding mapping approaches. The core idea of our method is to project the orthographic features into a shared vector space using Canonical Correlation Analysis (CCA). Then, it extends word embedding vectors using the resulting features and learns the multidialectal mapping. The overall obtained results of our proposed method show that our method enhances Bilingual Lexicon Induction of DA by 3.33% and 17.50% compared to state-of-the-art supervised and unsupervised cross-lingual alignment methods, respectively."
    },


    {
      "kind": "publication",
      "title": "Domain Adaptation for Arabic Cross-Domain and Cross-Dialect Sentiment Analysis from Contextualized Word Embedding",
      "url": "https://elmekki.me/articles/#domain-adaptation-for-arabic-cross-domain-and-cross-dialect-sentiment-analysis-from-contextualized-word-embedding",
      "paper_url": "https://aclanthology.org/2021.naacl-main.226.pdf",
      "summary": "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies · 2021-06-06",
      "authors": "Abdellah El Mekki, Abdelkader El Mahdaouy, Ismail Berrada, Ahmed Khoumsi",
      "venue": "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
      "year": 2021,
      "content": "Finetuning deep pre-trained language models has shown state-of-the-art performances on a wide range of Natural Language Processing (NLP) applications. Nevertheless, their generalization performance drops under domain shift. In the case of Arabic language, diglossia makes building and annotating corpora for each dialect and/or domain a more challenging task. Unsupervised Domain Adaptation tackles this issue by transferring the learned knowledge from labeled source domain data to unlabeled target domain data. In this paper, we propose a new unsupervised domain adaptation method for Arabic cross-domain and cross-dialect sentiment analysis from Contextualized Word Embedding. Several experiments are performed adopting the coarse-grained and the fine-grained taxonomies of Arabic dialects. The obtained results show that our method yields very promising results and outperforms several domain adaptation methods for most of the evaluated datasets. On average, our method increases the performance by an improvement rate of 20.8% over the zero-shot transfer learning from BERT."
    },


    {
      "kind": "publication",
      "title": "Deep Multi-Task Model for Sarcasm Detection and Sentiment Analysis in Arabic Language",
      "url": "https://elmekki.me/articles/#deep-multi-task-model-for-sarcasm-detection-and-sentiment-analysis-in-arabic-language",
      "paper_url": "https://aclanthology.org/2021.wanlp-1.42.pdf",
      "summary": "Proceedings of the Sixth Arabic Natural Language Processing Workshop · 2021-04-19",
      "authors": "Abdelkader El Mahdaouy, Abdellah El Mekki, Kabil Essefar, Nabil El Mamoun, Ismail Berrada, Ahmed Khoumsi",
      "venue": "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
      "year": 2021,
      "content": "The prominence of figurative language devices, such as sarcasm and irony, poses serious challenges for Arabic Sentiment Analysis (SA). While previous research works tackle SA and sarcasm detection separately, this paper introduces an end-to-end deep Multi-Task Learning (MTL) model, allowing knowledge interaction between the two tasks. Our MTL model’s architecture consists of a Bidirectional Encoder Representation from Transformers (BERT) model, a multi-task attention interaction module, and two task classifiers. The overall obtained results show that our proposed model outperforms its single-task and MTL counterparts on both sarcasm and sentiment detection subtasks."
    },


    {
      "kind": "publication",
      "title": "BERT-based multi-task model for country and province level MSA and dialectal Arabic identification",
      "url": "https://elmekki.me/articles/#bert-based-multi-task-model-for-country-and-province-level-msa-and-dialectal-arabic-identification",
      "paper_url": "https://aclanthology.org/2021.wanlp-1.31.pdf",
      "summary": "Proceedings of the Sixth Arabic Natural Language Processing Workshop · 2021-04-12",
      "authors": "Abdellah El Mekki, Abdelkader El Mahdaouy, Kabil Essefar, Nabil El Mamoun, Ismail Berrada, Ahmed Khoumsi",
      "venue": "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
      "year": 2021,
      "content": "Dialect and standard language identification are crucial tasks for many Arabic natural language processing applications. In this paper, we present our deep learning-based system, submitted to the second NADI shared task for country-level and province-level identification of Modern Standard Arabic (MSA) and Dialectal Arabic (DA). The system is based on an end-to-end deep Multi-Task Learning (MTL) model to tackle both country-level and province-level MSA/DA identification. The latter MTL model consists of a shared Bidirectional Encoder Representation Transformers (BERT) encoder, two task-specific attention layers, and two classifiers. Our key idea is to leverage both the task-discriminative and the inter-task shared features for country and province MSA/DA identification. The obtained results show that our MTL model outperforms single-task models on most subtasks."
    },


    {
      "kind": "publication",
      "title": "Weighted combination of BERT and N-GRAM features for Nuanced Arabic Dialect Identification",
      "url": "https://elmekki.me/articles/#weighted-combination-of-bert-and-n-gram-features-for-nuanced-arabic-dialect-identification",
      "paper_url": "https://aclanthology.org/2020.wanlp-1.27.pdf",
      "summary": "Proceedings of the Fifth Arabic Natural Language Processing Workshop · 2020-12-12",
      "authors": "Abdellah El Mekki, Ahmed Alami, Hamza Alami, Ahmed Khoumsi, Ismail Berrada",
      "venue": "Proceedings of the Fifth Arabic Natural Language Processing Workshop",
      "year": 2020,
      "content": "Around the Arab world, different Arabic dialects are spoken by more than 300M persons, and are increasingly popular in social media texts. However, Arabic dialects are considered to be low-resource languages, limiting the development of machine-learning based systems for these dialects. In this paper, we investigate the Arabic dialect identification task, from two perspectives: country-level dialect identification from 21 Arab countries, and province-level dialect identification from 100 provinces. We introduce an unified pipeline of state-of-the-art models, that can handle the two subtasks. Our experimental studies applied to the NADI shared task, show promising results both at the country-level (F1-score of 25.99%) and the province-level (F1-score of 6.39%), and thus allow us to be ranked 2nd for the country-level subtask, and 1st in the province-level subtask."
    },


    {
      "kind": "publication",
      "title": "Improving driver identification for the next-generation of in-vehicle software systems",
      "url": "https://elmekki.me/articles/#improving-driver-identification-for-the-next-generation-of-in-vehicle-software-systems",
      "paper_url": "https://ieeexplore.ieee.org/abstract/document/8746156",
      "summary": "IEEE Transactions on Vehicular Technology · 2019-06-23",
      "authors": "Abdellah El Mekki, Afaf Bouhoute, Ismail Berrada",
      "venue": "IEEE Transactions on Vehicular Technology",
      "year": 2019,
      "content": "This paper deals with driver identification and fingerprinting and its application for enhanced driver profiling and car security in connected cars. We introduce a new driver identification model based on collected data from smartphone sensors, and/or the OBD-II protocol, using convolutional neural networks, and recurrent neural networks (long short-term memory) RNN/LSTM. Unlike the existing works, we use a cross-validation technique that provides reproducible results when applied on unseen realistic data. We also studied the robustness of the model to sensor data anomalies. The obtained results show that our model accuracy remains acceptable even when the rate of the anomalies increases substantially. Finally, the proposed model was tested on different datasets and implemented in Automotive Grade Linux Framework, as a real-time anti-theft and the driver profiling system."
    },


    {
      "kind": "post",
      "title": "Alexandria: A Dialectal Arabic Machine Translation Dataset for Real-World Arabic MT",
      "url": "https://elmekki.me/blog/alexandria-dialectal-arabic-machine-translation-dataset/",
      "summary": "Alexandria is a human-translated Dialectal Arabic machine translation dataset for English-Arabic MT, low-resource machine translation, Arabic dialect benchmarking, and LLM evaluation across 13 Arab countries.",
      "year": 2026,
      "tags": ["Alexandria","Arabic machine translation","dialectal Arabic machine translation","Arabic MT dataset","Dialectal Arabic dataset","English Arabic translation dataset","Arabic dialect benchmark","low-resource machine translation","Arabic LLM evaluation","conversation-level machine translation","gender-aware machine translation","code-switching","culturally inclusive AI","Hugging Face dataset"],
      "content": " Dataset Release Arabic machine translation still has a large gap between formal written Arabic and the Arabic people use every day. Most Arabic MT systems are strongest on Modern Standard Arabic (MSA), but real conversations across the Arab world happen in local dialects, with city-level variation, code-switching, gendered forms, and domain-specific vocabulary. Alexandria was built to make that gap measurable, trainable, and harder to ignore.[1]   TL;DR: Alexandria is a large, community-driven, human-translated Dialectal Arabic machine translation dataset with 107K turns and 34,488 conversations across 13 Arab countries and 11 domains with high social impact. It is designed for English-Dialectal Arabic MT, Arabic dialect benchmarking, low-resource machine translation research, context-aware translation, and evaluation of Arabic-aware LLMs.[1][2]  This post is written for researchers, engineers, and dataset builders searching for an Arabic machine translation dataset, a Dialectal Arabic MT benchmark, an English Arabic translation dataset, or a low-resource machine translation resource for Arabic dialects.      107K turns   Parallel English and Dialectal Arabic turns grouped into multi-turn conversations.[1]       13 countries   Coverage across Egyptian, Levantine, Gulf, Nile, and Maghrebi dialect groups.[1]       11 domains   Healthcare, education, agriculture, legal and financial services, logistics, workplace communication, tourism, and more.[2]       City-level signal   Metadata goes beyond broad labels such as \"Levantine\" or \"Maghrebi\" and supports finer dialect analysis.[1]   What is Alexandria? Direct answer: Alexandria is a multi-domain, human-translated English-to-Dialectal Arabic and Dialectal Arabic-to-English machine translation dataset. It contains parallel, turn-aligned multi-turn conversations from 13 Arab countries, with metadata for country, city or sub-dialect, domain, persona role, speaker-addressee gender configuration, and split.[1][2]The dataset was created because Arabic MT has a structural evaluation problem. A system can look strong on formal MSA and still fail when users write or speak Egyptian Arabic, Moroccan Darija, Palestinian Arabic, Sudanese Arabic, Mauritanian Hassaniya, Omani Arabic, or Yemeni Arabic. This is not only a vocabulary issue. Dialects differ in morphology, syntax, politeness, register, borrowed words, and gender marking.Earlier dialectal Arabic MT resources helped the field, but many were limited by sentence-level structure, narrow domains, coarse dialect labels, or short utterances. Alexandria expands the design in four directions at once: conversation-level context, broader domain coverage, community translation and revision, and richer metadata for dialect and gender-sensitive analysis.[1]  The main Alexandria figure shows why the dataset is organized around communities, cities, domains, and dialogue turns. The map highlights participant geography and example conversations across dialects, domains, and speaker-addressee gender configurations.[1]What is inside the dataset?Alexandria contains 34,488 multi-turn conversations and approximately 107K total turns. Each example belongs to a conversation rather than an isolated sentence, which makes it useful for context-aware machine translation and dialogue-oriented LLM evaluation.[1]The 13 country-level dialect groups are: Egypt Jordan Lebanon Libya Mauritania Morocco Oman Palestine Saudi Arabia Sudan Syria Tunisia YemenThe 11 domains are: Agriculture and farming Commerce and transactions Construction and real estate Education and academia Energy and resources Everyday and social communication Healthcare and medical communication Legal and financial communication Logistics and transportation Professional and workplace communication Tourism and hospitalityThat domain mix matters. Many dialectal Arabic datasets focus on travel phrases, web text, or general-purpose short sentences. Alexandria targets situations where translation quality has real consequences: medical instructions, financial services, academic communication, logistics, workplace coordination, agriculture, and public-facing services. Key point: Alexandria is not just an Arabic dialect list. It is a metadata-rich benchmark for testing whether a model can preserve meaning, produce authentic local dialect, respect gendered forms, and remain robust across domains and cities.Practical use cases1. English-to-Dialectal Arabic machine translationThe most direct use case is training or evaluating systems that translate English into regional Arabic dialects. This is the harder direction in the experiments because the model must generate dialect-authentic Arabic, not merely understand it.[1]This is useful for MT systems, multilingual assistants, customer support, healthcare communication, education platforms, localization workflows, and public-service chatbots that need to speak to users in familiar local Arabic.2. Dialectal Arabic-to-English translationAlexandria also supports dialect-to-English translation. In the experiments, models generally perform better in this direction than in English-to-dialect translation, which suggests that current systems are often better at understanding dialectal input than producing authentic dialectal output.[1]This direction is valuable for search, moderation, multilingual analytics, public-interest monitoring, and accessibility tools that need to interpret dialectal Arabic content.3. Arabic LLM evaluationAlexandria is a benchmark for Arabic-aware LLMs. We evaluated 24 Arabic-capable models under turn-level, context-level, and conversation-level translation settings. This makes the dataset useful for comparing closed and open models, measuring translation robustness, and checking whether improvements hold across dialects rather than only on high-resource varieties.[1]4. Context-aware and conversation-level MTMany translation benchmarks are sentence-level. Alexandria is organized as multi-turn conversations, so it can test whether a system uses previous turns to translate the current turn more accurately. This matters for pronouns, speaker roles, tone, deixis, and other conversational dependencies.5. City-level and sub-dialect robustnessArabic dialects are not clean country-level blocks. A Palestinian rural variety can differ from an urban one. Omani sub-dialects differ across cities and regions. Moroccan, Tunisian, Mauritanian, Egyptian, Saudi, and Yemeni varieties all carry internal diversity. Alexandria’s city-anchored metadata allows researchers to ask whether a model is robust within a country, not only across countries.[1]6. Gender-aware machine translationArabic has many gender-marked forms. Alexandria includes speaker-addressee gender configurations, which makes it useful for evaluating whether translations preserve gender agreement and address forms. This is especially relevant for dialogue systems, because the gender of the speaker and the addressee can affect pronouns, verbs, adjectives, and social register.[1]7. Code-switching and register researchMany Arabic-speaking communities naturally mix Arabic with English, French, or other locally common languages, especially in technical, workplace, healthcare, and education settings. Alexandria explicitly allows conventional borrowed terms when they are natural in the target community. That makes it useful for studying code-switching, register control, and the boundary between dialect, MSA, and borrowed terminology.[1]How Alexandria was builtAlexandria was built through a six-month community-driven process involving 55 contributors from 13 Arab countries. This matters because the dataset is not just translated “into Arabic.” It is translated into local varieties by people tied to the target communities, with country leads coordinating local examples, onboarding, guideline interpretation, and quality control.[1]The community design was central to the dataset. Contributors represented city-anchored dialectal varieties, and the project used country teams rather than a single centralized annotation pool. That structure allowed the dataset to capture choices a generic Arabic translation process would often flatten: whether a phrase sounds Egyptian or Sudanese, whether a Moroccan speaker would naturally code-switch into French, whether an Omani term fits one locality but not another, and whether the gendered form matches the speaker and addressee.Project coordination was also part of the data quality process. The team used weekly project checks, a shared Slack workspace, bi-weekly reminders, and country-lead meetings every few weeks to surface recurring issues and refine the workflow.[1]The pipeline had three main phases.  Our data creation workflow has three phases: English source generation, human translation into Dialectal Arabic, and peer revision with correction and validation. The figure also shows the human translator audit step that filters irrelevant source conversations before translation.[1]   1. Source creation  Gemini-2.5 Pro generated English multi-turn conversations conditioned on country, domain, topic, persona, and gender configuration.[1]    2. Human translation  Native speakers translated the English conversations into their local Arabic dialects while preserving meaning, tone, persona, register, and gender direction.[1]    3. Peer revision  A second contributor from the same country reviewed translations for dialect authenticity, gender alignment, register, faithfulness, punctuation, and code-switching consistency.[1] Source generation and screeningThe English source conversations were generated in a controlled way before translation. For each country-domain pair, the pipeline generated 55 subdomains and 10 topics per subdomain, giving 550 topic specifications per country-domain. These topics were then used to create 2-4 turn English conversations conditioned on the target country, domain, persona, and gender configuration.[1]This source step was not treated as automatically correct. Contributors audited source conversations before translating them and skipped examples that did not fit local context, contained problematic cultural assumptions, or introduced factual issues. On average, 2.94% of source sentences were skipped as irrelevant. This is a useful reminder for multilingual dataset construction: LLM-generated source text can help scale coverage, but community review is what prevents source artifacts from becoming target-side noise.[1]Human translation and local dialect decisionsThe translation guidelines asked contributors to preserve semantic faithfulness while using natural local dialect rather than forcing the output into MSA. Translators were allowed to use Arabic script without enforcing a single standardized spelling system, which is important for dialectal Arabic because many varieties do not have one universally accepted orthography.The guidelines also allowed code-switching when it was conventional in the target community. That detail matters for domains such as healthcare, education, commerce, logistics, and workplace communication, where English, French, or other borrowed terms may be the most natural local choice.Peer revision and final compilationThe revision phase was human-only. Each translated conversation was reviewed by another participant from the same country, who checked dialect authenticity, gender alignment, register, faithfulness, punctuation, and code-switching consistency. Reviewers marked each item as accepted, minor edit, or major issue. If a reviewer came from a different regional variety inside the same country, the guidelines restricted them to mechanical edits rather than rewriting another local variety into their own.[1]The revision results are important. In the human-only revision phase, 68.4% of turns remained unchanged, 30.6% received minor edits, and 1% were flagged for major issues. The final revised data received high average quality scores for dialectal authenticity, register appropriateness, and semantic faithfulness.[1]Splits built for evaluationAlexandria is released with training, public development, public test, and private test splits. The public development and test sets were stratified across dialect groups, gender configurations, and translators. This design makes the dataset useful not only for training, but also for fairer evaluation and future shared tasks or leaderboards.[1] Data creation takeaway: for Dialectal Arabic machine translation, quality does not come from translation volume alone. It comes from community anchoring, local review, gender-aware metadata, and a revision process that respects within-country dialect diversity.What the experiments show Result summary: current Arabic-aware LLMs are much better at preserving meaning than producing dialect-authentic Arabic. Dialect-to-English translation is consistently easier than English-to-dialect translation. Maghrebi varieties, especially Mauritanian Arabic, remain among the hardest settings. Code-switching often lowers automatic translation scores. Metadata helps some models, but not all models consistently use it well.[1]The evaluation is useful because it avoids treating “Arabic translation” as one flat task. We evaluated 24 Arabic-capable LLMs across turn-level, context-level, and conversation-level settings. The main discussion focuses on the context-level setting because it best matches realistic dialogue MT: the model translates the current turn while seeing previous turns, but not future turns. Conversation-level translation gives higher raw scores, but it is a more permissive offline setup.[1]We chose the automatic metrics carefully. We reported spBLEU and chrF++, and avoided COMET because model-based MT metrics are less reliable for dialectal Arabic. Since spBLEU and chrF++ are highly correlated in our experiments, we used spBLEU for the main automatic analysis and reserved chrF++ for the appendix.[1]Dialect-to-English is easier than English-to-dialectAcross the evaluated models, Alexandria shows a strong directional asymmetry. Models perform better when translating dialectal Arabic into English than when translating English into dialectal Arabic. This is a central finding for Arabic MT and LLM evaluation: understanding a dialect is not the same as generating it naturally.[1]For product builders, this means a model that can answer questions about dialectal Arabic input may still sound unnatural when asked to generate local Arabic. For researchers, it means evaluation should separate comprehension from dialect-authentic generation.This directional asymmetry is one of the most actionable findings. If a system is meant to serve Arabic-speaking users, it is not enough to report dialect-to-English scores. English-to-dialect generation should be tested separately because that is where models are more likely to drift into MSA, generic Arabic, or the wrong regional form.Maghrebi dialects remain especially difficultPerformance varies heavily by dialect group. The models tend to do better on Egyptian and Levantine varieties, likely because these varieties are better represented in training data. Maghrebi dialects are harder, and Mauritanian Arabic is consistently among the most challenging in the benchmark.[1]This finding matters for low-resource machine translation because the hardest dialects are often the ones most in need of better resources. A benchmark that only averages across all Arabic varieties can hide these gaps.We also analyze lexical overlap with MSA and find that translation quality tends to be higher when dialectal references are lexically closer to MSA. This helps explain why varieties with stronger distance from MSA, including Maghrebi varieties, are more difficult for current models. The important evaluation lesson is that “Arabic” performance can be inflated by dialects that are closer to MSA while masking failure on more distant varieties.[1]City-level evaluation reveals stable sub-dialect difficultyAlexandria also evaluates selected sub-dialects within countries. We find that relative sub-dialect rankings are broadly consistent across model families. In other words, some sub-dialects are systematically harder across models, not just unlucky for one model.[1]That is exactly why city-level metadata matters. Without it, a benchmark may say “Palestinian Arabic” or “Omani Arabic” while missing important variation inside the label.Domain rankings are stable across modelsAlexandria covers 11 domains, and our experiments show that model ranking is fairly stable across those domains. The strongest models remain strong across topics, and smaller open-weight models remain lower-tier in this setup. We find limited evidence that one model is uniquely specialized for a particular domain under the tested prompting conditions.[1]This is useful for benchmarking because it suggests Alexandria can expose general Arabic dialect MT strength, not only topic-specific tricks.At the same time, domain coverage remains important. Stable model rankings do not mean domains are interchangeable. Technical domains still expose lexical gaps, MSA leakage, and code-switching pressure that would be invisible in a benchmark made only of everyday or tourism phrases.LLMs beat NLLB in the tested dialects, but metadata is mixedWe compare a subset of LLMs against NLLB-200-3.3B on the nine Alexandria dialects supported by NLLB. The evaluated LLMs outperform NLLB across those supported dialects, even without metadata in the prompt.[1]The metadata ablation is more nuanced. Full metadata helps Command-A in some cases, but the gains are not universal. For some models and dialects, adding all metadata has little effect or even hurts. This suggests that models differ in how well they use structured context such as participant gender, country, domain, and role information.Code-switching hurts many modelsAlexandria makes code-switching measurable. We compare translation quality for sentences with and without Latin-script tokens and find that code-mixing generally degrades performance for many dialects, including Egyptian, Jordanian, Lebanese, Moroccan, Palestinian, and Tunisian Arabic.[1]This is a practical result. Real Arabic users often code-switch, especially in technical, business, medical, and educational settings. A model that only handles clean Arabic script is not enough for production-grade dialectal Arabic MT.This is also where Alexandria becomes useful beyond translation scores. Because code-switching is measurable by dialect and domain, the dataset can support targeted evaluation of whether a system handles French-influenced Moroccan or Tunisian professional language, English-heavy workplace vocabulary, or technical terms that speakers would not naturally force into colloquial Arabic.Reasoning does not automatically improve translationThe experiments also compare reasoning and non-reasoning configurations for selected models. Reasoning generally does not help and often hurts translation performance, with the main exception being Gemini-3-Flash, where reasoning improved average spBLEU by about 2.0 points for English-to-dialect and about 0.4 points for dialect-to-English.[1]The broader lesson is that “more thinking” is not automatically better for translation. Translation needs faithful, fluent, locally appropriate output, and reasoning traces may not target that objective.Evaluation insights for Arabic MT buildersThe practical evaluation lessons from Alexandria are: Evaluate both directions: dialect-to-English and English-to-dialect answer different questions. Report dialect-level results, not only macro averages over Arabic. Separate semantic adequacy from dialect authenticity because a translation can be meaningful but still sound non-native. Include code-switching and technical domains because they are part of real Arabic use. Test whether metadata helps your specific model instead of assuming that more metadata always improves translation. Use human evaluation when dialectness matters, since automatic metrics do not fully capture register, locality, or naturalness.What the human evaluation addsAutomatic metrics are useful, but dialectal Arabic translation needs human judgment. Alexandria’s human evaluation separates three dimensions: semantic adequacy, gender accuracy, and dialectness or fluency.[1]The results show a clear pattern: Gender accuracy is usually high, often above 90%, when gender constraints are explicit. Semantic adequacy is generally above 3 out of 5 across dialects. Dialectness and fluency are lower, sometimes close to 2 out of 5 for difficult model-country pairs.This is one of the most important conclusions from Alexandria. Current models often know what the sentence means, but they do not always know how a native speaker in the target dialect would say it. They preserve semantics better than dialect authenticity.Among the human-evaluated systems, Gemini-3-Flash and Command-A define the strongest adequacy-dialectness trade-off, while some large models still produce weaker dialectness despite preserving meaning.[1]How to load AlexandriaThe dataset is available on Hugging Face at UBC-NLP/alexandria. You need to review and accept the dataset access conditions on Hugging Face before loading the files.[2]from datasets import load_datasetrepo_id = \"UBC-NLP/alexandria\"# Example: Morocco subsettrain_data = load_dataset(repo_id, name=\"MA\", split=\"train\")test_data = load_dataset(repo_id, name=\"MA\", split=\"test\")first_conv = train_data[0]eng_turn = first_conv[\"english_conversation\"][0]dialect_turn = first_conv[\"dialectal_conversation\"][0]print(f\"English: {eng_turn['text']}\")print(f\"Dialect: {dialect_turn['text']}\")Responsible use and limitationsAlexandria is intended for research, evaluation, and model development around Dialectal Arabic MT and Arabic-aware LLMs. Before training or redistributing outputs, users should check the Hugging Face access conditions and the CC BY-NC-ND 4.0 license.[2]FAQWhat is Alexandria?Alexandria is a multi-domain Dialectal Arabic machine translation dataset and benchmark. It contains English and Dialectal Arabic multi-turn conversations translated and revised by native speakers from 13 Arab countries.[1]Is Alexandria an Arabic machine translation dataset or an LLM benchmark?It is both. Alexandria can be used as a training resource for English-Dialectal Arabic MT and as a benchmark for Arabic-capable LLMs under dialect, domain, context, and gender variation.[1][2]Why is Dialectal Arabic machine translation hard?Dialectal Arabic is highly variable across countries, cities, social contexts, and domains. It also mixes with MSA and other languages. A model must preserve meaning while producing locally natural vocabulary, morphology, gender marking, and register.Which dialects are included?Alexandria covers Egypt, Jordan, Lebanon, Libya, Mauritania, Morocco, Oman, Palestine, Saudi Arabia, Sudan, Syria, Tunisia, and Yemen, with finer sub-dialect or city-level metadata where available.[1]What makes Alexandria useful for low-resource machine translation?Many Arabic dialects have limited high-quality parallel data. Alexandria provides human-translated, peer-revised, multi-domain parallel conversations for dialects that are often underrepresented in MT benchmarks and training data.What did the experiments conclude?The main conclusion is that current Arabic-aware LLMs are better at meaning preservation than dialect-authentic generation. Dialect-to-English is easier than English-to-dialect, Maghrebi varieties are among the hardest, code-switching often lowers quality, and metadata helps only when the model can use it effectively.[1]Can Alexandria be used for gender-aware MT?Yes. Alexandria includes speaker-addressee gender configurations, making it useful for studying whether translation systems preserve gendered Arabic forms in dialogue.[1]SummaryAlexandria shows that the next step for Arabic machine translation is not simply more MSA data or broader language labels. The field needs benchmarks that reflect how Arabic is actually used: local dialects, city-level variation, multi-turn context, gendered speech, code-switching, and high-impact domains.The strongest result is also the simplest: models often understand dialectal Arabic better than they can generate it. Alexandria gives researchers and builders a way to measure that gap directly, improve Arabic MT systems, and build more culturally and linguistically inclusive language technology.References El Mekki, A., Magdy, S. M., Atou, H., AbuHweidi, R., Qawasmeh, B., Nacar, O., and others. (2026). Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs. ACL 2026 Main. arXiv UBC-NLP. (2026). Alexandria Dataset Card. Hugging Face Dataset. https://huggingface.co/datasets/UBC-NLP/alexandria UBC-NLP. Alexandria GitHub Repository. https://github.com/UBC-NLP/Alexandria Alexandria Project Website. https://alexandria.dlnlp.ai/Links arXiv article: https://arxiv.org/abs/2601.13099 Dataset: https://huggingface.co/datasets/UBC-NLP/alexandria GitHub repository: https://github.com/UBC-NLP/Alexandria Project website: https://alexandria.dlnlp.ai/ Disclaimer (May 03, 2026): The latest version of this blog post was post-edited and formatted using an LLM."
    },

    {
      "kind": "post",
      "title": "WVS2Persona: World Values Survey Wave 7 Personas for Culture-Aware AI",
      "url": "https://elmekki.me/blog/wvs2persona-world-values-survey-personas/",
      "summary": "WVS2Persona is a Hugging Face dataset that turns World Values Survey Wave 7 respondent records into textual personas for culture-aware AI, persona-based prompting, and cultural alignment research.",
      "year": 2026,
      "tags": ["WVS2Persona","World Values Survey","persona dataset","culture-aware AI","cultural alignment","LLM personas","NileChat","Hugging Face dataset","social values dataset","persona-based prompting"],
      "content": " Dataset Release In recent research, we built NileChat, a culturally aligned LLM for Egyptian and Moroccan Arabic communities. A key part of that work was feeding local personas to the LLM so controlled synthetic data generation could reflect community values, not only surface-level language patterns. For NileChat, those personas were parsed from World Values Survey records for Morocco and Egypt, because these were the two use cases in the paper. With WVS2Persona, I am releasing the same kind of persona resource for all countries covered in this dataset so other researchers and builders can reuse the idea beyond the original NileChat setting.[1][2]   TL;DR: WVS2Persona is Hugging Face dataset of 97,220 respondent-level persona descriptions derived from World Values Survey Wave 7 records, organized into 66 country subsets.[2]  This dataset is relevant to culture-aware AI, LLM cultural alignment, persona-based prompting, social values modeling, and evaluation of value-sensitive generation.      66 country subsets   Country-level configurations such as Morocco, Egypt, United_States, India, and Brazil.[2]       97,220 personas   One textual persona per WVS Wave 7 respondent record.[2]       NileChat method   The construction follows the WVS-to-persona direction used in NileChat for culturally grounded generation.[1]     WVS2Persona turns World Values Survey Wave 7 respondent records into textual personas that can be loaded by country subset and used in culture-aware AI workflows.From NileChat to WVS2PersonaThe motivation comes directly from NileChat. In NileChat paper, we proposed a methodology for adapting LLMs to local communities by considering three axes together: language, cultural heritage, and cultural values. The values component is where WVS-derived personas matter. They make it possible to condition data generation on concrete social profiles instead of vague labels such as “local speaker” or “person from country X.”[1]In NileChat, this idea was applied to Egyptian and Moroccan Arabic. WVS2Persona expands the persona resource itself across the countries available in the release, making it easier for other researchers and builders to reuse the same type of value-grounded conditioning outside the original NileChat experiments.  The persona creation figure from NileChat: survey responses are extracted, decoded into readable text, and formatted into persona descriptions that can be used for prompting and culturally grounded generation.[1]There is one important release detail. The current WVS2Persona dataset provides full deterministic persona descriptions generated from decoded core WVS questionnaire variables. The concise, summarized persona style used for compact prompting in NileChat is planned as a future extension.[2]What is inside the dataset?WVS2Persona is organized by country as Hugging Face subsets. Each subset has one train split and two columns: persona_id: a stable identifier for the persona record. persona: a full English persona description grounded in the respondent’s decoded WVS Wave 7 core-questionnaire answers.The personas are respondent-level renderings. They are not cluster centroids, not invented archetypes, and not synthetic summaries of a demographic group. That distinction matters because a country does not have one value profile. A useful culture-aware dataset should preserve within-country variation across age, gender, education, religion, political attitudes, trust, well-being, economic values, family norms, security concerns, and other survey dimensions.The released personas use only the WVS Wave 7 core questionnaire sections, including social values, happiness and well-being, trust and organizational membership, economic values, corruption, migration, security, science and technology, religious values, ethical values, political participation, political culture, and demographics.[2][3] Key point: WVS2Persona is a bridge between social survey data and LLM workflows. It turns structured survey responses into text that can be retrieved, prompted, summarized, filtered, and inspected by the same tools already used for language model experimentation.Why this kind of dataset mattersMost language model datasets are good at representing what people write online. They are much weaker at representing how different communities answer questions about family, trust, religion, democracy, security, migration, gender norms, work, technology, and moral judgments. Those topics are central to cultural alignment, but they are not reliably captured by web text alone.That gap matters for three reasons.First, culture-aware AI needs internal variation. A single country label is too coarse. WVS2Persona keeps respondent-level diversity visible, which helps avoid collapsing a society into one stereotype.Second, persona-grounded generation is easier to audit than free-form prompting. A model can be conditioned on a specific textual profile, and the researcher can inspect the profile that shaped the output.Third, evaluation can move beyond generic benchmarks. If a model claims to represent a community, researchers can test whether its answers, explanations, or generated examples reflect the range of values observed in survey-grounded profiles rather than only dominant internet priors.How to load WVS2PersonaThe dataset is available on Hugging Face at 3ebdola/wvs2persona.[2] Each country is loaded as a subset/config.from datasets import load_datasetrepo_id = \"3ebdola/wvs2persona\"ds_morocco = load_dataset(repo_id, \"Morocco\", split=\"train\")print(ds_morocco)print(ds_morocco.column_names)print(ds_morocco[0][\"persona_id\"])print(ds_morocco[0][\"persona\"][:500])Country names with spaces use underscores in the subset name:from datasets import load_datasetrepo_id = \"3ebdola/wvs2persona\"ds_us = load_dataset(repo_id, \"United_States\", split=\"train\")ds_gb = load_dataset(repo_id, \"Great_Britain\", split=\"train\")ds_south_korea = load_dataset(repo_id, \"South_Korea\", split=\"train\")Practical use cases1. Persona-based promptingThe most direct use is to condition a model on a persona and ask it to generate an answer, conversation, story, or opinionated response from that perspective.persona = ds_morocco[0][\"persona\"]prompt = f\"\"\"You are writing a short first-person answer grounded in this persona.Do not repeat the persona verbatim. Reflect the values and background implicitly.Persona:{persona}Question:What makes a community trustworthy?\"\"\"This is useful when building culturally varied synthetic data, simulated user populations, or controlled evaluation prompts. The important constraint is that the persona should guide generation without being treated as a real person’s complete biography.2. Retrieval for culturally grounded generationBecause each persona is plain text, it can be embedded and retrieved. A researcher can retrieve personas relevant to a topic such as trust in institutions, migration, work values, political participation, or family norms, then use those profiles as conditioning context.This is often cleaner than sampling random country labels. Retrieval lets the data pipeline select profiles that actually mention the theme under study.3. Value-sensitive model evaluationWVS2Persona can help create evaluation sets for questions where cultural and social values shape the answer. For example, researchers can sample personas from a country subset, ask a model to answer a question under each profile, and then compare answer patterns across countries, demographic groups, or value dimensions.This does not replace statistical analysis of the original WVS data. It gives LLM researchers a text-native layer for testing how models behave when values are explicit in the prompt.4. Summarization and compression researchThe full persona descriptions are long. That makes the dataset useful for studying controlled summarization: how to compress a survey-grounded profile into a shorter prompt while preserving the value signals that matter for downstream generation.That direction connects back to NileChat, where compact personas were used inside controlled synthetic data generation prompts.[1]Good practicesUse WVS2Persona as a research and prototyping resource for culture-aware AI, not as a source of stereotypes. A persona is a textual rendering of one respondent’s survey answers. It should not be generalized to an entire country, religion, gender, class, or language community.For most experiments, I recommend reporting: which country subsets were used, how personas were sampled, whether full personas or compressed summaries were used, what prompt template conditioned the model, how sensitive attributes were handled, and whether outputs were evaluated at the individual-profile level or aggregated level.This documentation is not busywork. It is what makes persona-based cultural alignment experiments reproducible and less likely to turn into anecdotal claims.FAQWhat is WVS2Persona?WVS2Persona is a dataset that converts World Values Survey Wave 7 respondent records into English textual personas. Each row contains a stable persona_id and a full persona description grounded in decoded survey answers.[2]How many personas are included?The current release contains 97,220 personas across 66 country subsets.[2]Is WVS2Persona synthetic data?It is not synthetic in the sense of inventing people from scratch. Each persona is a deterministic natural-language rendering of one WVS respondent record. However, the text is generated from decoded survey responses, so it is not a verbatim respondent statement.[2]How is it connected to NileChat?The dataset follows the WVS-to-persona construction approach introduced in NileChat. NileChat used WVS-derived personas as part of controlled synthetic data generation for culturally aware LLM adaptation.[1]What can I build with it?Common uses include persona-based prompting, culture-aware synthetic data generation, retrieval over social profiles, value-sensitive model evaluation, and summarization of long persona descriptions into compact prompt-ready profiles.What should I avoid?Avoid treating a persona as a full biography or as representative of an entire group. Avoid using individual personas to make claims about countries or communities without aggregation, sampling documentation, and careful interpretation.SummaryWVS2Persona is useful because it makes cultural values operational for LLM workflows. It does not reduce culture to a country tag. Instead, it gives researchers a large set of respondent-level textual profiles that can be sampled, retrieved, summarized, and used as conditioning context.The broader lesson from NileChat still applies: culturally aware AI needs language, local knowledge, and values to be modeled deliberately. WVS2Persona focuses on the values part of that pipeline and makes it easier to reuse across communities.References El Mekki, A., Atou, H., Nacar, O., Shehata, S., and Abdul-Mageed, M. (2025). NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities. EMNLP 2025. ACL Anthology El Mekki, A. (2026). WVS2Persona: Parsed World Values Survey (WVS) Wave 7 records into textual personas. Hugging Face Dataset. Dataset card World Values Survey Association. World Values Survey Wave 7 documentation and questionnaire resources. DocumentationLinks Dataset: https://huggingface.co/datasets/3ebdola/wvs2persona Underlying WVS data-use terms: https://www.worldvaluessurvey.org/AJDownloadLicense.jsp NileChat paper: https://aclanthology.org/2025.emnlp-main.556/ NileChat blog post Disclaimer (April 25, 2026): The latest version of this blog post was post-edited and formatted using an LLM."
    },

    {
      "kind": "post",
      "title": "Unsupervised Machine Translation in the Age of LLMs",
      "url": "https://elmekki.me/blog/unsupervised-machine-translation-age-of-llms/",
      "summary": "Unsupervised machine translation still matters in the LLM era. This post explains how self-mined in-context examples can improve translation for low-resource languages without parallel data.",
      "year": 2026,
      "tags": ["unsupervised machine translation","in-context learning","LLM translation","low-resource languages","multilingual LLMs","few-shot prompting","machine translation"],
      "content": " Behind the Paper For years, machine translation improved by scaling data. But scaling parallel data is not an option for most languages. In the LLM era, that bottleneck did not disappear. Few-shot prompting works best when good translation examples already exist, and for many language pairs they do not. In our recent paper, we asked a harder question: can we mine those examples automatically and still translate well?[4]   Paper summary: this post explains why unsupervised machine translation still matters, how in-context learning changes the problem, and how self-mined examples can improve multilingual LLM translation for low-resource languages without large parallel corpora.  This article is especially relevant to unsupervised machine translation, low-resource translation, in-context learning for machine translation, and multilingual LLM adaptation.      What is unsupervised MT?   Translation without human-aligned sentence pairs, bootstrapped from monolingual text and weak cross-lingual signals.[2][4]       What changed with LLMs?   LLMs made few-shot translation practical, but example quality and example selection still matter a lot.[1][3]       What our paper adds   We self-mine word pairs, turn them into weak sentence examples, and rank the best demonstrations with similarity filtering plus BM25.[4]       Why it matters   If translation examples can be mined from small amounts of unlabeled text, more languages can benefit from modern translation systems.[2][4]   What is unsupervised machine translation? Direct answer: Unsupervised machine translation (UMT) is translation between languages without human-labeled parallel sentences. Classical UMT starts from monolingual corpora, a weak cross-lingual initialization, and iterative back-translation that gradually improves translation quality.[2]This line of work matters because conventional machine translation depends heavily on large parallel corpora, while many language pairs have little or no such data. One of the key insights in early UMT was that monolingual text is much easier to obtain than aligned bitext, so the right question is not only “How do we get more labels?” but also “How far can we go without them?”[2]A useful mental model comes from the 2018 UMT literature: first align the problem just enough to get off the ground, then rely on language modeling, denoising, and back-translation to iteratively refine the system. That recipe turned an ill-posed task into something trainable even before LLM prompting entered the picture.[2]  A compact visual summary of the classical UMT recipe: weak initialization, denoising or language modeling, and iterative back-translation. Source: Lample et al. (2018).UMT in the era of LLMs and in-context learningLarge language models changed the interface of translation. Brown et al. framed few-shot learning as giving the model a small number of task demonstrations directly in the prompt, with no gradient updates at inference time. In that setup, a few-shot translation prompt can be as simple as repeated source sentence and target translation pairs followed by the new source sentence to translate.[1]That is powerful, but it does not solve the hardest part for low-resource translation: where do those demonstrations come from? Brown et al. explicitly note that few-shot learning still requires a small amount of task-specific data. For machine translation, later work showed that the number and quality of prompt examples matter, that performance varies with prompt design, and that directly using monolingual examples can hurt translation while pseudo-parallel examples help.[1][3] The bottleneck moved, but it did not disappear: LLMs can translate with prompts, yet low-resource settings still suffer from a missing-example problem. The challenge is not only prompting the model, but also constructing the prompt when no parallel data exist.[3][4]  Prior prompting work for machine translation found that monolingual-only demonstrations are usually harmful, while pseudo-parallel examples created by back-translation or forward translation improve prompting quality. Source: Zhang et al. (2023).Our paper: self-mining in-context examples for unsupervised MT One-sentence summary: we treat the missing-demonstration problem as an unsupervised mining problem: first mine reliable word translations, then use them to create and filter sentence-level examples that an LLM can use for translation in context.[4]In our Findings of NAACL 2025 paper, we assume access to a multilingual LLM, vocabularies in the source and target languages, and a small amount of unlabeled text in each language. Importantly, the learning phase uses no human-labeled parallel data, and the paper emphasizes a data-scarce regime with fewer than 1,000 unlabeled sentences per language in the studied setup.[4]   1. Mine word pairs  Use zero-shot prompting to translate frequent source words, reverse the direction, keep consistent back-translations, and rank the remaining pairs by cross-lingual similarity to retain high-quality lexical anchors.[4]    2. Bootstrap with word-level prompts  Feed the best mined word pairs back into the model as in-context examples, refining the word inventory before moving to sentence-level translation.[4]    3. Create weak sentence translations  Translate sentences word by word to obtain rough but semantically useful sentence pairs. They are noisy, but they preserve enough meaning to seed the next stage.[4]    4. Select the right demonstrations  Back-translate to obtain more natural pairs, then choose input-specific examples with a two-step filter: similarity threshold first, BM25 ranking second. The final method is TopK+BM25.[4]  Key idea: instead of assuming demonstrations already exist, the system manufactures them from unlabeled text and then ranks them for each test input. Example selection becomes part of the learning pipeline, not an afterthought.[4]ResultsWe evaluated the approach with Llama-3 8B and Bloom 7B on 288 translation directions from FLORES-200. The headline result is that the unsupervised method can be comparable to, and sometimes better than, translation with regular in-context examples drawn from human-annotated data, while also outperforming prior UMT systems by an average of 7 BLEU points in the paper’s summary results.[4]   288 directions  Evaluation scale across FLORES-200 translation directions with two multilingual LLMs.[4]    +7 BLEU  Average improvement over prior state-of-the-art unsupervised MT methods reported in the paper's abstract.[4]    55.76 spBLEU  Average score for TopK+BM25 on the English-involving subset, competitive with regular human-annotated in-context learning.[4]    40.13 BLEU  WMT benchmark average in the paper's Table 2, ahead of the best listed baseline at 33.68.[4]   Translation performance is highest when both source and target languages are high-resource and lower when either side becomes more data-scarce. Source: El Mekki and Abdul-Mageed (2025).  More mined in-context examples generally help, especially when moving from 1 to about 8 examples, after which gains become smaller. Source: El Mekki and Abdul-Mageed (2025).Two findings stood out to me. First, resource level still matters: even with strong multilingual LLMs, translation is easier when the target side is better represented. Second, better prompting is not just about the model. It is also about retrieval and filtering. In our experiments, a carefully selected unsupervised demonstration set was the difference between a rough translation and a competitive one.[4]Why this matters for low-resource translationThe broader significance is straightforward. If translation quality depends on large curated bitexts, then many languages remain blocked by a data collection problem before they can benefit from new models. But if an LLM can bootstrap usable demonstrations from a small amount of unlabeled text, the entry cost drops dramatically.[2][4]That does not mean the problem is solved. Low-resource translation remains harder, and our heatmap makes that visible. But it does mean the path forward looks different. Instead of waiting for perfect parallel corpora, we can start from weak lexical evidence, noisy sentence pairs, and strong multilingual priors, then iteratively mine something useful.[4]My own takeaway is that unsupervised MT is newly relevant in the LLM era. Not because LLMs made supervision obsolete, but because they made bootstrapping supervision more plausible. For underrepresented languages, that distinction matters. It is the difference between “we cannot build this yet” and “we can begin with what we have.”[2][4] Bottom line: if we can mine trustworthy in-context examples from unlabeled data, translation systems no longer have to wait for abundant parallel corpora before they become useful. That is a practical route toward broader language coverage in search, assistants, education, and public-facing digital tools.FAQWhat is unsupervised machine translation?It is translation without human-aligned sentence pairs. Instead of supervised bitext, the system has to bootstrap from monolingual data, weak lexical alignments, denoising, and back-translation.[2][4]Is unsupervised MT the same as zero-shot translation?No. Zero-shot translation is an inference setting where the model receives an instruction but no examples. In our work, the main problem is how to create reusable in-context examples from unlabeled data so the model can translate more reliably than plain zero-shot prompting.[1][4]Why not just use monolingual examples as demonstrations?Prior work on prompting for machine translation found that monolingual-only demonstrations generally hurt translation, whereas pseudo-parallel examples created through zero-shot back-translation or forward translation are much more effective.[3]How much unlabeled data does the approach assume?The paper studies a setting with fewer than 1,000 unlabeled sentences in each language, together with source and target vocabularies, a multilingual LLM, and an unsupervised sentence similarity function.[4]What is the main empirical takeaway?The paper reports that self-mined in-context examples can match or beat regular human-annotated in-context learning in many settings, while improving on previous UMT systems by an average of 7 BLEU points in the paper’s summary results.[4]References Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models are Few-Shot Learners. NeurIPS. PDF Lample, G., Ott, M., Conneau, A., Denoyer, L., and Ranzato, M. A. (2018). Phrase-Based and Neural Unsupervised Machine Translation. EMNLP. PDF Zhang, B., Haddow, B., and Birch, A. (2023). Prompting Large Language Model for Machine Translation: A Case Study. ICML / PMLR. PDF El Mekki, A., and Abdul-Mageed, M. (2025). Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs. Findings of NAACL 2025. PDFLinks Project code: https://github.com/UBC-NLP/sm-umt Paper: https://aclanthology.org/2025.findings-naacl.238/ Disclaimer (April 04, 2026): The latest version of this blog post was post-edited and formatted using an LLM."
    },

    {
      "kind": "post",
      "title": "LLM Cultural Alignment: Synthetic Data Generation and Cultural Value Alignment in NileChat",
      "url": "https://elmekki.me/blog/nilechat-cultural-alignment-llms-synthetic-data/",
      "summary": "This article summarizes an LLM cultural alignment pipeline in NileChat, covering cultural bias in LLMs, synthetic data generation for LLMs, and the evaluation of cultural value alignment in LLM systems.",
      "year": 2026,
      "tags": ["LLMs","synthetic data","cultural alignment","llm cultural alignment","cultural bias in LLMs","synthetic data generation","value alignment evaluation","multilingual AI","NileChat"],
      "content": " Behind the Paper NileChat addresses a common problem in low-resource LLM adaptation. Egyptian and Moroccan Arabic are widely spoken, but continued pretraining data in these dialects remains limited. Translating large amounts of English educational content into the target dialect can improve fluency and knowledge transfer. However, translation alone does not ensure that a model captures local heritage, everyday references, or community-specific value patterns. That gap between linguistic adaptation and cultural alignment motivates the NileChat pipeline.   Pipeline summary: The paper combines machine translation of educational content for fluency and knowledge transfer, controlled synthetic data generation for LLM adaptation conditioned on local context, cultural heritage concepts, linguistic expressions, and representative personas, and a retrieval step that queries culturally specific web content. The pipeline is applied to Egyptian and Moroccan Arabic communities.  This work is directly relevant to LLM cultural alignment, cultural bias in LLMs, synthetic data generation for LLMs, and the evaluation of cultural value alignment in LLM systems.      Translation   Educational translation for fluency, coherence, and topical breadth.       Synthetic Generation   Local context documents, heritage concepts, linguistic cues, and representative personas.       Retrieval   Search-and-parse culturally specific web content for local heritage coverage.   Define the community, not just the language labelThe paper treats alignment targets as communities rather than abstract language names. “Arabic” is too broad. Even “Egyptian Arabic” or “Moroccan Arabic” is still incomplete unless the intended speech varieties, scripts, references, and value distributions are specified.This framing is also relevant to discussions of cultural bias in LLMs. If the target community is underspecified, the resulting model may inherit source-language assumptions or dominant-culture defaults instead of local norms.This framing raises several practical design questions: Which forms of speech should feel natural to the model? Which local references should be treated as common knowledge rather than niche trivia? Which social and moral distinctions matter enough to shape the data? Which communities are included, and which are still underrepresented?This framing changes the unit of design. The task is not only to translate a corpus into another language, but to construct a corpus that reflects how a community talks, remembers, and evaluates.  The NileChat pipeline: translation expands linguistic coverage, controlled generation combines local context, cultural concepts, linguistic cues, and personas, and retrieval adds culturally specific web content gathered for pretraining.Each data source served a different roleThe pipeline assigns distinct responsibilities to three layers rather than expecting a single corpus to satisfy every objective.1. Translation provided breadthIn the paper, machine translation is the layer for linguistic fluency and coherence. Educational content is translated from English into Egyptian and Moroccan Arabic.That translated layer was chosen for topical breadth. It covers areas such as education, history, health, medicine, and biology, which helps continued pretraining when native dialectal corpora are limited.Translated data can still carry source-language cultural biases, so MT improves language coverage and general knowledge without by itself solving cultural heritage or value alignment.2. Controlled synthetic generation used local context and personasControlled synthetic generation is not open-ended prompting. The teacher model is conditioned on four components: local contextual information from local news websites, core cultural heritage concepts extracted from country-specific Wikipedia portals, linguistic and cultural expressions such as proverbs, idioms, TV dialogue, and local terminology, and representative personas derived from World Values Survey responses.This setup limits open-ended invention. The prompt ties the generated text to realistic documents and a concrete persona profile, and the appendix explicitly instructs the model to rely on the provided context while reflecting the persona’s background. Prompt Structure In The Paper Inputs: a persona description, local context text, a cultural concept, and dialect-specific linguistic cues. Generation task: write a story, personal essay, blog post, review, or conversation in the target dialect rather than Modern Standard Arabic.   Use the provided context when writing.  Reflect the persona's cultural background, values, and worldview.  Incorporate dialectal expressions and local wording supplied in the prompt.  Keep the output in the target dialect and avoid drifting into MSA.  Use the persona implicitly instead of restating the persona description. The pipeline deliberately generates multiple genres: stories, personal essays, blog posts, reviews, and conversations. This variety exposes the model to different discourse patterns rather than one synthetic template repeated at scale.This is the core synthetic data generation for LLMs component of the pipeline. The paper uses synthetic generation to shape local discourse, persona-grounded language, and culturally specific content rather than relying only on translated corpora.3. Retrieval added local cultural heritage materialRetrieval in NileChat is also a data-construction step for pretraining, not an inference-time retriever. The system queries a search engine API using predefined cultural concepts that span categories such as food, clothes, landmarks, festivals and celebrations, geography, handicrafts, architecture, fauna, flora, and music.For each concept, it keeps the top 20 search results and parses the textual content with Trafilatura. This retrieved material adds naturally occurring, culturally specific web text that prompting alone may not provide. Key point: translation can teach a model how to say things in a target variety, but it cannot by itself teach the model what a community treats as obvious, familiar, or socially legible.Role of personasPersonas are the mechanism for bringing moral, demographic, and socioeconomic variation into the synthetic data.A key methodological detail is where they come from. The personas were derived from World Values Survey participant responses, not hand-written stereotypes. Selected survey answers were transformed into textual descriptions and then summarized by an LLM into concise persona profiles that could be plugged into prompts.The paper generates 1,200 persona descriptions from Egyptian and Moroccan WVS participants. Once those personas were combined with local context, cultural concepts, and linguistic cues, the synthetic data became more closely tied to the target communities than translation alone.  The persona pipeline in NileChat: participant responses are extracted, parsed into text, and then formatted into a prompt-ready persona. This grounds generation in structured social profiles rather than vague labels such as \"local speaker.\"This step gives the model structured exposure to differences in priorities, beliefs, and social conditions inside the same country-level population. In the paper, the goal is not a generic “local speaker” persona, but a set of promptable profiles grounded in observed survey responses.Pipeline summaryIn summary, NileChat combines three ingredients at scale: a machine-translated educational layer built from 5.5 million Fineweb-edu texts for each dialect, a controlled synthetic layer built from personas, local news context, cultural heritage concepts, and dialectal expressions across stories, personal essays, blog posts, reviews, and conversations, and a retrieval layer built by querying cultural concepts on the web, keeping the top 20 non-social-media results, and parsing the returned pages.The paper continues pretraining Qwen-2.5-3B on that mixture, then performs supervised fine-tuning for Egyptian and Moroccan variants. The main evaluations focus on understanding, translation, cultural knowledge, and value alignment.ResultsCompared with Qwen2.5-3B-Instruct, NileChat substantially improved understanding benchmarks, roughly doubled cultural knowledge scores on Palm for both dialects, and moved value alignment closer to World Values Survey response distributions across most measured dimensions. The paper presents these results as evidence that the combined MT, controlled generation, and retrieval pipeline improved local alignment beyond translation alone.Evaluation of cultural value alignment in LLMsOne of the paper’s more useful contributions is its evaluation of cultural value alignment in LLM systems. Rather than treating generic reasoning benchmarks as a proxy for alignment, it evaluates cultural knowledge and compares model responses with World Values Survey response distributions.This evaluation of cultural value alignment in LLMs matters because a model can be fluent in a target variety while still reflecting cultural bias in LLM behavior inherited from translated or globally dominant source data. In NileChat, the reported gains come not only from language adaptation, but also from explicit testing of whether the model’s answers move closer to community-level value patterns.How the recipe could transfer to another communityThe same structure could be adapted to another setting with a similar sequence: Define the target population in terms of language, cultural heritage, and values. Translate structured educational content for fluency and topical breadth. Build controlled prompts from local context, cultural concepts, linguistic expressions, and representative personas. Add search-based retrieval of culturally specific web pages and parse them into text. Evaluate understanding, translation, cultural knowledge, and value alignment explicitly.That last step is easy to skip, but the paper shows why it matters: cultural alignment claims are stronger when they are tested against cultural knowledge and WVS-based value alignment, not inferred from generic benchmarks alone.LimitationsThe paper also notes several limitations: The method depends on a strong teacher model that can already generate the target low-resource variety. The supervised fine-tuning stage still relied heavily on translated data because native instruction data was scarce. A 3B model is still more susceptible to hallucination and incomplete information than larger architectures. Synthetic data generation is computationally expensive.SummaryA central conclusion is that community alignment does not come from one prompt or one translated dataset. In NileChat, it comes from giving MT, controlled generation, and retrieval distinct roles across language, cultural heritage, and values.This structure can transfer beyond Arabic. The exact dialects, cultural concepts, and evaluation sets will change, but the underlying principle stays the same: a model reflects a community more effectively when that community is encoded into the data pipeline on purpose.For more in-depth detail about the data collection approach, see the paper.Links Project resources (models and collected datasets): UBC-NLP NileChat collection Paper: NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities Disclaimer (March 21, 2026): The latest version of this blog post was post-edited and formatted using an LLM."
    }

  ]
}
