{
  "service": "elmekki-site-api",
  "resource": "posts",
  "generated_at": "2026-05-03T22:58:18-07:00",
  "total": 4,
  "items": [

    {
      "id": "alexandria-dialectal-arabic-machine-translation-dataset",
      "kind": "post",
      "title": "Alexandria: A Dialectal Arabic Machine Translation Dataset for Real-World Arabic MT",
      "url": "https://elmekki.me/blog/alexandria-dialectal-arabic-machine-translation-dataset/",
      "date": "2026-05-03T00:00:00-07:00",
      "year": 2026,
      "excerpt": "Alexandria is a human-translated Dialectal Arabic machine translation dataset for English-Arabic MT, low-resource machine translation, Arabic dialect benchmarking, and LLM evaluation across 13 Arab countries.",
      "summary": " Dataset Release Arabic machine translation still has a large gap between formal written Arabic and the Arabic people use every day. Most Arabic MT systems are strongest on Modern Standard Arabic (MSA), but real conversations across the Arab world happen in local dialects, wit...",
      "tags": ["Alexandria","Arabic machine translation","dialectal Arabic machine translation","Arabic MT dataset","Dialectal Arabic dataset","English Arabic translation dataset","Arabic dialect benchmark","low-resource machine translation","Arabic LLM evaluation","conversation-level machine translation","gender-aware machine translation","code-switching","culturally inclusive AI","Hugging Face dataset"],
      "content": " Dataset Release Arabic machine translation still has a large gap between formal written Arabic and the Arabic people use every day. Most Arabic MT systems are strongest on Modern Standard Arabic (MSA), but real conversations across the Arab world happen in local dialects, with city-level variation, code-switching, gendered forms, and domain-specific vocabulary. Alexandria was built to make that gap measurable, trainable, and harder to ignore.[1]   TL;DR: Alexandria is a large, community-driven, human-translated Dialectal Arabic machine translation dataset with 107K turns and 34,488 conversations across 13 Arab countries and 11 domains with high social impact. It is designed for English-Dialectal Arabic MT, Arabic dialect benchmarking, low-resource machine translation research, context-aware translation, and evaluation of Arabic-aware LLMs.[1][2]  This post is written for researchers, engineers, and dataset builders searching for an Arabic machine translation dataset, a Dialectal Arabic MT benchmark, an English Arabic translation dataset, or a low-resource machine translation resource for Arabic dialects.      107K turns   Parallel English and Dialectal Arabic turns grouped into multi-turn conversations.[1]       13 countries   Coverage across Egyptian, Levantine, Gulf, Nile, and Maghrebi dialect groups.[1]       11 domains   Healthcare, education, agriculture, legal and financial services, logistics, workplace communication, tourism, and more.[2]       City-level signal   Metadata goes beyond broad labels such as \"Levantine\" or \"Maghrebi\" and supports finer dialect analysis.[1]   What is Alexandria? Direct answer: Alexandria is a multi-domain, human-translated English-to-Dialectal Arabic and Dialectal Arabic-to-English machine translation dataset. It contains parallel, turn-aligned multi-turn conversations from 13 Arab countries, with metadata for country, city or sub-dialect, domain, persona role, speaker-addressee gender configuration, and split.[1][2]The dataset was created because Arabic MT has a structural evaluation problem. A system can look strong on formal MSA and still fail when users write or speak Egyptian Arabic, Moroccan Darija, Palestinian Arabic, Sudanese Arabic, Mauritanian Hassaniya, Omani Arabic, or Yemeni Arabic. This is not only a vocabulary issue. Dialects differ in morphology, syntax, politeness, register, borrowed words, and gender marking.Earlier dialectal Arabic MT resources helped the field, but many were limited by sentence-level structure, narrow domains, coarse dialect labels, or short utterances. Alexandria expands the design in four directions at once: conversation-level context, broader domain coverage, community translation and revision, and richer metadata for dialect and gender-sensitive analysis.[1]  The main Alexandria figure shows why the dataset is organized around communities, cities, domains, and dialogue turns. The map highlights participant geography and example conversations across dialects, domains, and speaker-addressee gender configurations.[1]What is inside the dataset?Alexandria contains 34,488 multi-turn conversations and approximately 107K total turns. Each example belongs to a conversation rather than an isolated sentence, which makes it useful for context-aware machine translation and dialogue-oriented LLM evaluation.[1]The 13 country-level dialect groups are: Egypt Jordan Lebanon Libya Mauritania Morocco Oman Palestine Saudi Arabia Sudan Syria Tunisia YemenThe 11 domains are: Agriculture and farming Commerce and transactions Construction and real estate Education and academia Energy and resources Everyday and social communication Healthcare and medical communication Legal and financial communication Logistics and transportation Professional and workplace communication Tourism and hospitalityThat domain mix matters. Many dialectal Arabic datasets focus on travel phrases, web text, or general-purpose short sentences. Alexandria targets situations where translation quality has real consequences: medical instructions, financial services, academic communication, logistics, workplace coordination, agriculture, and public-facing services. Key point: Alexandria is not just an Arabic dialect list. It is a metadata-rich benchmark for testing whether a model can preserve meaning, produce authentic local dialect, respect gendered forms, and remain robust across domains and cities.Practical use cases1. English-to-Dialectal Arabic machine translationThe most direct use case is training or evaluating systems that translate English into regional Arabic dialects. This is the harder direction in the experiments because the model must generate dialect-authentic Arabic, not merely understand it.[1]This is useful for MT systems, multilingual assistants, customer support, healthcare communication, education platforms, localization workflows, and public-service chatbots that need to speak to users in familiar local Arabic.2. Dialectal Arabic-to-English translationAlexandria also supports dialect-to-English translation. In the experiments, models generally perform better in this direction than in English-to-dialect translation, which suggests that current systems are often better at understanding dialectal input than producing authentic dialectal output.[1]This direction is valuable for search, moderation, multilingual analytics, public-interest monitoring, and accessibility tools that need to interpret dialectal Arabic content.3. Arabic LLM evaluationAlexandria is a benchmark for Arabic-aware LLMs. We evaluated 24 Arabic-capable models under turn-level, context-level, and conversation-level translation settings. This makes the dataset useful for comparing closed and open models, measuring translation robustness, and checking whether improvements hold across dialects rather than only on high-resource varieties.[1]4. Context-aware and conversation-level MTMany translation benchmarks are sentence-level. Alexandria is organized as multi-turn conversations, so it can test whether a system uses previous turns to translate the current turn more accurately. This matters for pronouns, speaker roles, tone, deixis, and other conversational dependencies.5. City-level and sub-dialect robustnessArabic dialects are not clean country-level blocks. A Palestinian rural variety can differ from an urban one. Omani sub-dialects differ across cities and regions. Moroccan, Tunisian, Mauritanian, Egyptian, Saudi, and Yemeni varieties all carry internal diversity. Alexandria’s city-anchored metadata allows researchers to ask whether a model is robust within a country, not only across countries.[1]6. Gender-aware machine translationArabic has many gender-marked forms. Alexandria includes speaker-addressee gender configurations, which makes it useful for evaluating whether translations preserve gender agreement and address forms. This is especially relevant for dialogue systems, because the gender of the speaker and the addressee can affect pronouns, verbs, adjectives, and social register.[1]7. Code-switching and register researchMany Arabic-speaking communities naturally mix Arabic with English, French, or other locally common languages, especially in technical, workplace, healthcare, and education settings. Alexandria explicitly allows conventional borrowed terms when they are natural in the target community. That makes it useful for studying code-switching, register control, and the boundary between dialect, MSA, and borrowed terminology.[1]How Alexandria was builtAlexandria was built through a six-month community-driven process involving 55 contributors from 13 Arab countries. This matters because the dataset is not just translated “into Arabic.” It is translated into local varieties by people tied to the target communities, with country leads coordinating local examples, onboarding, guideline interpretation, and quality control.[1]The community design was central to the dataset. Contributors represented city-anchored dialectal varieties, and the project used country teams rather than a single centralized annotation pool. That structure allowed the dataset to capture choices a generic Arabic translation process would often flatten: whether a phrase sounds Egyptian or Sudanese, whether a Moroccan speaker would naturally code-switch into French, whether an Omani term fits one locality but not another, and whether the gendered form matches the speaker and addressee.Project coordination was also part of the data quality process. The team used weekly project checks, a shared Slack workspace, bi-weekly reminders, and country-lead meetings every few weeks to surface recurring issues and refine the workflow.[1]The pipeline had three main phases.  Our data creation workflow has three phases: English source generation, human translation into Dialectal Arabic, and peer revision with correction and validation. The figure also shows the human translator audit step that filters irrelevant source conversations before translation.[1]   1. Source creation  Gemini-2.5 Pro generated English multi-turn conversations conditioned on country, domain, topic, persona, and gender configuration.[1]    2. Human translation  Native speakers translated the English conversations into their local Arabic dialects while preserving meaning, tone, persona, register, and gender direction.[1]    3. Peer revision  A second contributor from the same country reviewed translations for dialect authenticity, gender alignment, register, faithfulness, punctuation, and code-switching consistency.[1] Source generation and screeningThe English source conversations were generated in a controlled way before translation. For each country-domain pair, the pipeline generated 55 subdomains and 10 topics per subdomain, giving 550 topic specifications per country-domain. These topics were then used to create 2-4 turn English conversations conditioned on the target country, domain, persona, and gender configuration.[1]This source step was not treated as automatically correct. Contributors audited source conversations before translating them and skipped examples that did not fit local context, contained problematic cultural assumptions, or introduced factual issues. On average, 2.94% of source sentences were skipped as irrelevant. This is a useful reminder for multilingual dataset construction: LLM-generated source text can help scale coverage, but community review is what prevents source artifacts from becoming target-side noise.[1]Human translation and local dialect decisionsThe translation guidelines asked contributors to preserve semantic faithfulness while using natural local dialect rather than forcing the output into MSA. Translators were allowed to use Arabic script without enforcing a single standardized spelling system, which is important for dialectal Arabic because many varieties do not have one universally accepted orthography.The guidelines also allowed code-switching when it was conventional in the target community. That detail matters for domains such as healthcare, education, commerce, logistics, and workplace communication, where English, French, or other borrowed terms may be the most natural local choice.Peer revision and final compilationThe revision phase was human-only. Each translated conversation was reviewed by another participant from the same country, who checked dialect authenticity, gender alignment, register, faithfulness, punctuation, and code-switching consistency. Reviewers marked each item as accepted, minor edit, or major issue. If a reviewer came from a different regional variety inside the same country, the guidelines restricted them to mechanical edits rather than rewriting another local variety into their own.[1]The revision results are important. In the human-only revision phase, 68.4% of turns remained unchanged, 30.6% received minor edits, and 1% were flagged for major issues. The final revised data received high average quality scores for dialectal authenticity, register appropriateness, and semantic faithfulness.[1]Splits built for evaluationAlexandria is released with training, public development, public test, and private test splits. The public development and test sets were stratified across dialect groups, gender configurations, and translators. This design makes the dataset useful not only for training, but also for fairer evaluation and future shared tasks or leaderboards.[1] Data creation takeaway: for Dialectal Arabic machine translation, quality does not come from translation volume alone. It comes from community anchoring, local review, gender-aware metadata, and a revision process that respects within-country dialect diversity.What the experiments show Result summary: current Arabic-aware LLMs are much better at preserving meaning than producing dialect-authentic Arabic. Dialect-to-English translation is consistently easier than English-to-dialect translation. Maghrebi varieties, especially Mauritanian Arabic, remain among the hardest settings. Code-switching often lowers automatic translation scores. Metadata helps some models, but not all models consistently use it well.[1]The evaluation is useful because it avoids treating “Arabic translation” as one flat task. We evaluated 24 Arabic-capable LLMs across turn-level, context-level, and conversation-level settings. The main discussion focuses on the context-level setting because it best matches realistic dialogue MT: the model translates the current turn while seeing previous turns, but not future turns. Conversation-level translation gives higher raw scores, but it is a more permissive offline setup.[1]We chose the automatic metrics carefully. We reported spBLEU and chrF++, and avoided COMET because model-based MT metrics are less reliable for dialectal Arabic. Since spBLEU and chrF++ are highly correlated in our experiments, we used spBLEU for the main automatic analysis and reserved chrF++ for the appendix.[1]Dialect-to-English is easier than English-to-dialectAcross the evaluated models, Alexandria shows a strong directional asymmetry. Models perform better when translating dialectal Arabic into English than when translating English into dialectal Arabic. This is a central finding for Arabic MT and LLM evaluation: understanding a dialect is not the same as generating it naturally.[1]For product builders, this means a model that can answer questions about dialectal Arabic input may still sound unnatural when asked to generate local Arabic. For researchers, it means evaluation should separate comprehension from dialect-authentic generation.This directional asymmetry is one of the most actionable findings. If a system is meant to serve Arabic-speaking users, it is not enough to report dialect-to-English scores. English-to-dialect generation should be tested separately because that is where models are more likely to drift into MSA, generic Arabic, or the wrong regional form.Maghrebi dialects remain especially difficultPerformance varies heavily by dialect group. The models tend to do better on Egyptian and Levantine varieties, likely because these varieties are better represented in training data. Maghrebi dialects are harder, and Mauritanian Arabic is consistently among the most challenging in the benchmark.[1]This finding matters for low-resource machine translation because the hardest dialects are often the ones most in need of better resources. A benchmark that only averages across all Arabic varieties can hide these gaps.We also analyze lexical overlap with MSA and find that translation quality tends to be higher when dialectal references are lexically closer to MSA. This helps explain why varieties with stronger distance from MSA, including Maghrebi varieties, are more difficult for current models. The important evaluation lesson is that “Arabic” performance can be inflated by dialects that are closer to MSA while masking failure on more distant varieties.[1]City-level evaluation reveals stable sub-dialect difficultyAlexandria also evaluates selected sub-dialects within countries. We find that relative sub-dialect rankings are broadly consistent across model families. In other words, some sub-dialects are systematically harder across models, not just unlucky for one model.[1]That is exactly why city-level metadata matters. Without it, a benchmark may say “Palestinian Arabic” or “Omani Arabic” while missing important variation inside the label.Domain rankings are stable across modelsAlexandria covers 11 domains, and our experiments show that model ranking is fairly stable across those domains. The strongest models remain strong across topics, and smaller open-weight models remain lower-tier in this setup. We find limited evidence that one model is uniquely specialized for a particular domain under the tested prompting conditions.[1]This is useful for benchmarking because it suggests Alexandria can expose general Arabic dialect MT strength, not only topic-specific tricks.At the same time, domain coverage remains important. Stable model rankings do not mean domains are interchangeable. Technical domains still expose lexical gaps, MSA leakage, and code-switching pressure that would be invisible in a benchmark made only of everyday or tourism phrases.LLMs beat NLLB in the tested dialects, but metadata is mixedWe compare a subset of LLMs against NLLB-200-3.3B on the nine Alexandria dialects supported by NLLB. The evaluated LLMs outperform NLLB across those supported dialects, even without metadata in the prompt.[1]The metadata ablation is more nuanced. Full metadata helps Command-A in some cases, but the gains are not universal. For some models and dialects, adding all metadata has little effect or even hurts. This suggests that models differ in how well they use structured context such as participant gender, country, domain, and role information.Code-switching hurts many modelsAlexandria makes code-switching measurable. We compare translation quality for sentences with and without Latin-script tokens and find that code-mixing generally degrades performance for many dialects, including Egyptian, Jordanian, Lebanese, Moroccan, Palestinian, and Tunisian Arabic.[1]This is a practical result. Real Arabic users often code-switch, especially in technical, business, medical, and educational settings. A model that only handles clean Arabic script is not enough for production-grade dialectal Arabic MT.This is also where Alexandria becomes useful beyond translation scores. Because code-switching is measurable by dialect and domain, the dataset can support targeted evaluation of whether a system handles French-influenced Moroccan or Tunisian professional language, English-heavy workplace vocabulary, or technical terms that speakers would not naturally force into colloquial Arabic.Reasoning does not automatically improve translationThe experiments also compare reasoning and non-reasoning configurations for selected models. Reasoning generally does not help and often hurts translation performance, with the main exception being Gemini-3-Flash, where reasoning improved average spBLEU by about 2.0 points for English-to-dialect and about 0.4 points for dialect-to-English.[1]The broader lesson is that “more thinking” is not automatically better for translation. Translation needs faithful, fluent, locally appropriate output, and reasoning traces may not target that objective.Evaluation insights for Arabic MT buildersThe practical evaluation lessons from Alexandria are: Evaluate both directions: dialect-to-English and English-to-dialect answer different questions. Report dialect-level results, not only macro averages over Arabic. Separate semantic adequacy from dialect authenticity because a translation can be meaningful but still sound non-native. Include code-switching and technical domains because they are part of real Arabic use. Test whether metadata helps your specific model instead of assuming that more metadata always improves translation. Use human evaluation when dialectness matters, since automatic metrics do not fully capture register, locality, or naturalness.What the human evaluation addsAutomatic metrics are useful, but dialectal Arabic translation needs human judgment. Alexandria’s human evaluation separates three dimensions: semantic adequacy, gender accuracy, and dialectness or fluency.[1]The results show a clear pattern: Gender accuracy is usually high, often above 90%, when gender constraints are explicit. Semantic adequacy is generally above 3 out of 5 across dialects. Dialectness and fluency are lower, sometimes close to 2 out of 5 for difficult model-country pairs.This is one of the most important conclusions from Alexandria. Current models often know what the sentence means, but they do not always know how a native speaker in the target dialect would say it. They preserve semantics better than dialect authenticity.Among the human-evaluated systems, Gemini-3-Flash and Command-A define the strongest adequacy-dialectness trade-off, while some large models still produce weaker dialectness despite preserving meaning.[1]How to load AlexandriaThe dataset is available on Hugging Face at UBC-NLP/alexandria. You need to review and accept the dataset access conditions on Hugging Face before loading the files.[2]from datasets import load_datasetrepo_id = \"UBC-NLP/alexandria\"# Example: Morocco subsettrain_data = load_dataset(repo_id, name=\"MA\", split=\"train\")test_data = load_dataset(repo_id, name=\"MA\", split=\"test\")first_conv = train_data[0]eng_turn = first_conv[\"english_conversation\"][0]dialect_turn = first_conv[\"dialectal_conversation\"][0]print(f\"English: {eng_turn['text']}\")print(f\"Dialect: {dialect_turn['text']}\")Responsible use and limitationsAlexandria is intended for research, evaluation, and model development around Dialectal Arabic MT and Arabic-aware LLMs. Before training or redistributing outputs, users should check the Hugging Face access conditions and the CC BY-NC-ND 4.0 license.[2]FAQWhat is Alexandria?Alexandria is a multi-domain Dialectal Arabic machine translation dataset and benchmark. It contains English and Dialectal Arabic multi-turn conversations translated and revised by native speakers from 13 Arab countries.[1]Is Alexandria an Arabic machine translation dataset or an LLM benchmark?It is both. Alexandria can be used as a training resource for English-Dialectal Arabic MT and as a benchmark for Arabic-capable LLMs under dialect, domain, context, and gender variation.[1][2]Why is Dialectal Arabic machine translation hard?Dialectal Arabic is highly variable across countries, cities, social contexts, and domains. It also mixes with MSA and other languages. A model must preserve meaning while producing locally natural vocabulary, morphology, gender marking, and register.Which dialects are included?Alexandria covers Egypt, Jordan, Lebanon, Libya, Mauritania, Morocco, Oman, Palestine, Saudi Arabia, Sudan, Syria, Tunisia, and Yemen, with finer sub-dialect or city-level metadata where available.[1]What makes Alexandria useful for low-resource machine translation?Many Arabic dialects have limited high-quality parallel data. Alexandria provides human-translated, peer-revised, multi-domain parallel conversations for dialects that are often underrepresented in MT benchmarks and training data.What did the experiments conclude?The main conclusion is that current Arabic-aware LLMs are better at meaning preservation than dialect-authentic generation. Dialect-to-English is easier than English-to-dialect, Maghrebi varieties are among the hardest, code-switching often lowers quality, and metadata helps only when the model can use it effectively.[1]Can Alexandria be used for gender-aware MT?Yes. Alexandria includes speaker-addressee gender configurations, making it useful for studying whether translation systems preserve gendered Arabic forms in dialogue.[1]SummaryAlexandria shows that the next step for Arabic machine translation is not simply more MSA data or broader language labels. The field needs benchmarks that reflect how Arabic is actually used: local dialects, city-level variation, multi-turn context, gendered speech, code-switching, and high-impact domains.The strongest result is also the simplest: models often understand dialectal Arabic better than they can generate it. Alexandria gives researchers and builders a way to measure that gap directly, improve Arabic MT systems, and build more culturally and linguistically inclusive language technology.References El Mekki, A., Magdy, S. M., Atou, H., AbuHweidi, R., Qawasmeh, B., Nacar, O., and others. (2026). Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs. ACL 2026 Main. arXiv UBC-NLP. (2026). Alexandria Dataset Card. Hugging Face Dataset. https://huggingface.co/datasets/UBC-NLP/alexandria UBC-NLP. Alexandria GitHub Repository. https://github.com/UBC-NLP/Alexandria Alexandria Project Website. https://alexandria.dlnlp.ai/Links arXiv article: https://arxiv.org/abs/2601.13099 Dataset: https://huggingface.co/datasets/UBC-NLP/alexandria GitHub repository: https://github.com/UBC-NLP/Alexandria Project website: https://alexandria.dlnlp.ai/ Disclaimer (May 03, 2026): The latest version of this blog post was post-edited and formatted using an LLM."
    },

    {
      "id": "wvs2persona-world-values-survey-personas",
      "kind": "post",
      "title": "WVS2Persona: World Values Survey Wave 7 Personas for Culture-Aware AI",
      "url": "https://elmekki.me/blog/wvs2persona-world-values-survey-personas/",
      "date": "2026-04-25T00:00:00-07:00",
      "year": 2026,
      "excerpt": "WVS2Persona is a Hugging Face dataset that turns World Values Survey Wave 7 respondent records into textual personas for culture-aware AI, persona-based prompting, and cultural alignment research.",
      "summary": " Dataset Release In recent research, we built NileChat, a culturally aligned LLM for Egyptian and Moroccan Arabic communities. A key part of that work was feeding local personas to the LLM so controlled synthetic data generation could reflect community values, not only surface...",
      "tags": ["WVS2Persona","World Values Survey","persona dataset","culture-aware AI","cultural alignment","LLM personas","NileChat","Hugging Face dataset","social values dataset","persona-based prompting"],
      "content": " Dataset Release In recent research, we built NileChat, a culturally aligned LLM for Egyptian and Moroccan Arabic communities. A key part of that work was feeding local personas to the LLM so controlled synthetic data generation could reflect community values, not only surface-level language patterns. For NileChat, those personas were parsed from World Values Survey records for Morocco and Egypt, because these were the two use cases in the paper. With WVS2Persona, I am releasing the same kind of persona resource for all countries covered in this dataset so other researchers and builders can reuse the idea beyond the original NileChat setting.[1][2]   TL;DR: WVS2Persona is Hugging Face dataset of 97,220 respondent-level persona descriptions derived from World Values Survey Wave 7 records, organized into 66 country subsets.[2]  This dataset is relevant to culture-aware AI, LLM cultural alignment, persona-based prompting, social values modeling, and evaluation of value-sensitive generation.      66 country subsets   Country-level configurations such as Morocco, Egypt, United_States, India, and Brazil.[2]       97,220 personas   One textual persona per WVS Wave 7 respondent record.[2]       NileChat method   The construction follows the WVS-to-persona direction used in NileChat for culturally grounded generation.[1]     WVS2Persona turns World Values Survey Wave 7 respondent records into textual personas that can be loaded by country subset and used in culture-aware AI workflows.From NileChat to WVS2PersonaThe motivation comes directly from NileChat. In NileChat paper, we proposed a methodology for adapting LLMs to local communities by considering three axes together: language, cultural heritage, and cultural values. The values component is where WVS-derived personas matter. They make it possible to condition data generation on concrete social profiles instead of vague labels such as “local speaker” or “person from country X.”[1]In NileChat, this idea was applied to Egyptian and Moroccan Arabic. WVS2Persona expands the persona resource itself across the countries available in the release, making it easier for other researchers and builders to reuse the same type of value-grounded conditioning outside the original NileChat experiments.  The persona creation figure from NileChat: survey responses are extracted, decoded into readable text, and formatted into persona descriptions that can be used for prompting and culturally grounded generation.[1]There is one important release detail. The current WVS2Persona dataset provides full deterministic persona descriptions generated from decoded core WVS questionnaire variables. The concise, summarized persona style used for compact prompting in NileChat is planned as a future extension.[2]What is inside the dataset?WVS2Persona is organized by country as Hugging Face subsets. Each subset has one train split and two columns: persona_id: a stable identifier for the persona record. persona: a full English persona description grounded in the respondent’s decoded WVS Wave 7 core-questionnaire answers.The personas are respondent-level renderings. They are not cluster centroids, not invented archetypes, and not synthetic summaries of a demographic group. That distinction matters because a country does not have one value profile. A useful culture-aware dataset should preserve within-country variation across age, gender, education, religion, political attitudes, trust, well-being, economic values, family norms, security concerns, and other survey dimensions.The released personas use only the WVS Wave 7 core questionnaire sections, including social values, happiness and well-being, trust and organizational membership, economic values, corruption, migration, security, science and technology, religious values, ethical values, political participation, political culture, and demographics.[2][3] Key point: WVS2Persona is a bridge between social survey data and LLM workflows. It turns structured survey responses into text that can be retrieved, prompted, summarized, filtered, and inspected by the same tools already used for language model experimentation.Why this kind of dataset mattersMost language model datasets are good at representing what people write online. They are much weaker at representing how different communities answer questions about family, trust, religion, democracy, security, migration, gender norms, work, technology, and moral judgments. Those topics are central to cultural alignment, but they are not reliably captured by web text alone.That gap matters for three reasons.First, culture-aware AI needs internal variation. A single country label is too coarse. WVS2Persona keeps respondent-level diversity visible, which helps avoid collapsing a society into one stereotype.Second, persona-grounded generation is easier to audit than free-form prompting. A model can be conditioned on a specific textual profile, and the researcher can inspect the profile that shaped the output.Third, evaluation can move beyond generic benchmarks. If a model claims to represent a community, researchers can test whether its answers, explanations, or generated examples reflect the range of values observed in survey-grounded profiles rather than only dominant internet priors.How to load WVS2PersonaThe dataset is available on Hugging Face at 3ebdola/wvs2persona.[2] Each country is loaded as a subset/config.from datasets import load_datasetrepo_id = \"3ebdola/wvs2persona\"ds_morocco = load_dataset(repo_id, \"Morocco\", split=\"train\")print(ds_morocco)print(ds_morocco.column_names)print(ds_morocco[0][\"persona_id\"])print(ds_morocco[0][\"persona\"][:500])Country names with spaces use underscores in the subset name:from datasets import load_datasetrepo_id = \"3ebdola/wvs2persona\"ds_us = load_dataset(repo_id, \"United_States\", split=\"train\")ds_gb = load_dataset(repo_id, \"Great_Britain\", split=\"train\")ds_south_korea = load_dataset(repo_id, \"South_Korea\", split=\"train\")Practical use cases1. Persona-based promptingThe most direct use is to condition a model on a persona and ask it to generate an answer, conversation, story, or opinionated response from that perspective.persona = ds_morocco[0][\"persona\"]prompt = f\"\"\"You are writing a short first-person answer grounded in this persona.Do not repeat the persona verbatim. Reflect the values and background implicitly.Persona:{persona}Question:What makes a community trustworthy?\"\"\"This is useful when building culturally varied synthetic data, simulated user populations, or controlled evaluation prompts. The important constraint is that the persona should guide generation without being treated as a real person’s complete biography.2. Retrieval for culturally grounded generationBecause each persona is plain text, it can be embedded and retrieved. A researcher can retrieve personas relevant to a topic such as trust in institutions, migration, work values, political participation, or family norms, then use those profiles as conditioning context.This is often cleaner than sampling random country labels. Retrieval lets the data pipeline select profiles that actually mention the theme under study.3. Value-sensitive model evaluationWVS2Persona can help create evaluation sets for questions where cultural and social values shape the answer. For example, researchers can sample personas from a country subset, ask a model to answer a question under each profile, and then compare answer patterns across countries, demographic groups, or value dimensions.This does not replace statistical analysis of the original WVS data. It gives LLM researchers a text-native layer for testing how models behave when values are explicit in the prompt.4. Summarization and compression researchThe full persona descriptions are long. That makes the dataset useful for studying controlled summarization: how to compress a survey-grounded profile into a shorter prompt while preserving the value signals that matter for downstream generation.That direction connects back to NileChat, where compact personas were used inside controlled synthetic data generation prompts.[1]Good practicesUse WVS2Persona as a research and prototyping resource for culture-aware AI, not as a source of stereotypes. A persona is a textual rendering of one respondent’s survey answers. It should not be generalized to an entire country, religion, gender, class, or language community.For most experiments, I recommend reporting: which country subsets were used, how personas were sampled, whether full personas or compressed summaries were used, what prompt template conditioned the model, how sensitive attributes were handled, and whether outputs were evaluated at the individual-profile level or aggregated level.This documentation is not busywork. It is what makes persona-based cultural alignment experiments reproducible and less likely to turn into anecdotal claims.FAQWhat is WVS2Persona?WVS2Persona is a dataset that converts World Values Survey Wave 7 respondent records into English textual personas. Each row contains a stable persona_id and a full persona description grounded in decoded survey answers.[2]How many personas are included?The current release contains 97,220 personas across 66 country subsets.[2]Is WVS2Persona synthetic data?It is not synthetic in the sense of inventing people from scratch. Each persona is a deterministic natural-language rendering of one WVS respondent record. However, the text is generated from decoded survey responses, so it is not a verbatim respondent statement.[2]How is it connected to NileChat?The dataset follows the WVS-to-persona construction approach introduced in NileChat. NileChat used WVS-derived personas as part of controlled synthetic data generation for culturally aware LLM adaptation.[1]What can I build with it?Common uses include persona-based prompting, culture-aware synthetic data generation, retrieval over social profiles, value-sensitive model evaluation, and summarization of long persona descriptions into compact prompt-ready profiles.What should I avoid?Avoid treating a persona as a full biography or as representative of an entire group. Avoid using individual personas to make claims about countries or communities without aggregation, sampling documentation, and careful interpretation.SummaryWVS2Persona is useful because it makes cultural values operational for LLM workflows. It does not reduce culture to a country tag. Instead, it gives researchers a large set of respondent-level textual profiles that can be sampled, retrieved, summarized, and used as conditioning context.The broader lesson from NileChat still applies: culturally aware AI needs language, local knowledge, and values to be modeled deliberately. WVS2Persona focuses on the values part of that pipeline and makes it easier to reuse across communities.References El Mekki, A., Atou, H., Nacar, O., Shehata, S., and Abdul-Mageed, M. (2025). NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities. EMNLP 2025. ACL Anthology El Mekki, A. (2026). WVS2Persona: Parsed World Values Survey (WVS) Wave 7 records into textual personas. Hugging Face Dataset. Dataset card World Values Survey Association. World Values Survey Wave 7 documentation and questionnaire resources. DocumentationLinks Dataset: https://huggingface.co/datasets/3ebdola/wvs2persona Underlying WVS data-use terms: https://www.worldvaluessurvey.org/AJDownloadLicense.jsp NileChat paper: https://aclanthology.org/2025.emnlp-main.556/ NileChat blog post Disclaimer (April 25, 2026): The latest version of this blog post was post-edited and formatted using an LLM."
    },

    {
      "id": "unsupervised-machine-translation-age-of-llms",
      "kind": "post",
      "title": "Unsupervised Machine Translation in the Age of LLMs",
      "url": "https://elmekki.me/blog/unsupervised-machine-translation-age-of-llms/",
      "date": "2026-04-04T00:00:00-07:00",
      "year": 2026,
      "excerpt": "Unsupervised machine translation still matters in the LLM era. This post explains how self-mined in-context examples can improve translation for low-resource languages without parallel data.",
      "summary": " Behind the Paper For years, machine translation improved by scaling data. But scaling parallel data is not an option for most languages. In the LLM era, that bottleneck did not disappear. Few-shot prompting works best when good translation examples already exist, and for many...",
      "tags": ["unsupervised machine translation","in-context learning","LLM translation","low-resource languages","multilingual LLMs","few-shot prompting","machine translation"],
      "content": " Behind the Paper For years, machine translation improved by scaling data. But scaling parallel data is not an option for most languages. In the LLM era, that bottleneck did not disappear. Few-shot prompting works best when good translation examples already exist, and for many language pairs they do not. In our recent paper, we asked a harder question: can we mine those examples automatically and still translate well?[4]   Paper summary: this post explains why unsupervised machine translation still matters, how in-context learning changes the problem, and how self-mined examples can improve multilingual LLM translation for low-resource languages without large parallel corpora.  This article is especially relevant to unsupervised machine translation, low-resource translation, in-context learning for machine translation, and multilingual LLM adaptation.      What is unsupervised MT?   Translation without human-aligned sentence pairs, bootstrapped from monolingual text and weak cross-lingual signals.[2][4]       What changed with LLMs?   LLMs made few-shot translation practical, but example quality and example selection still matter a lot.[1][3]       What our paper adds   We self-mine word pairs, turn them into weak sentence examples, and rank the best demonstrations with similarity filtering plus BM25.[4]       Why it matters   If translation examples can be mined from small amounts of unlabeled text, more languages can benefit from modern translation systems.[2][4]   What is unsupervised machine translation? Direct answer: Unsupervised machine translation (UMT) is translation between languages without human-labeled parallel sentences. Classical UMT starts from monolingual corpora, a weak cross-lingual initialization, and iterative back-translation that gradually improves translation quality.[2]This line of work matters because conventional machine translation depends heavily on large parallel corpora, while many language pairs have little or no such data. One of the key insights in early UMT was that monolingual text is much easier to obtain than aligned bitext, so the right question is not only “How do we get more labels?” but also “How far can we go without them?”[2]A useful mental model comes from the 2018 UMT literature: first align the problem just enough to get off the ground, then rely on language modeling, denoising, and back-translation to iteratively refine the system. That recipe turned an ill-posed task into something trainable even before LLM prompting entered the picture.[2]  A compact visual summary of the classical UMT recipe: weak initialization, denoising or language modeling, and iterative back-translation. Source: Lample et al. (2018).UMT in the era of LLMs and in-context learningLarge language models changed the interface of translation. Brown et al. framed few-shot learning as giving the model a small number of task demonstrations directly in the prompt, with no gradient updates at inference time. In that setup, a few-shot translation prompt can be as simple as repeated source sentence and target translation pairs followed by the new source sentence to translate.[1]That is powerful, but it does not solve the hardest part for low-resource translation: where do those demonstrations come from? Brown et al. explicitly note that few-shot learning still requires a small amount of task-specific data. For machine translation, later work showed that the number and quality of prompt examples matter, that performance varies with prompt design, and that directly using monolingual examples can hurt translation while pseudo-parallel examples help.[1][3] The bottleneck moved, but it did not disappear: LLMs can translate with prompts, yet low-resource settings still suffer from a missing-example problem. The challenge is not only prompting the model, but also constructing the prompt when no parallel data exist.[3][4]  Prior prompting work for machine translation found that monolingual-only demonstrations are usually harmful, while pseudo-parallel examples created by back-translation or forward translation improve prompting quality. Source: Zhang et al. (2023).Our paper: self-mining in-context examples for unsupervised MT One-sentence summary: we treat the missing-demonstration problem as an unsupervised mining problem: first mine reliable word translations, then use them to create and filter sentence-level examples that an LLM can use for translation in context.[4]In our Findings of NAACL 2025 paper, we assume access to a multilingual LLM, vocabularies in the source and target languages, and a small amount of unlabeled text in each language. Importantly, the learning phase uses no human-labeled parallel data, and the paper emphasizes a data-scarce regime with fewer than 1,000 unlabeled sentences per language in the studied setup.[4]   1. Mine word pairs  Use zero-shot prompting to translate frequent source words, reverse the direction, keep consistent back-translations, and rank the remaining pairs by cross-lingual similarity to retain high-quality lexical anchors.[4]    2. Bootstrap with word-level prompts  Feed the best mined word pairs back into the model as in-context examples, refining the word inventory before moving to sentence-level translation.[4]    3. Create weak sentence translations  Translate sentences word by word to obtain rough but semantically useful sentence pairs. They are noisy, but they preserve enough meaning to seed the next stage.[4]    4. Select the right demonstrations  Back-translate to obtain more natural pairs, then choose input-specific examples with a two-step filter: similarity threshold first, BM25 ranking second. The final method is TopK+BM25.[4]  Key idea: instead of assuming demonstrations already exist, the system manufactures them from unlabeled text and then ranks them for each test input. Example selection becomes part of the learning pipeline, not an afterthought.[4]ResultsWe evaluated the approach with Llama-3 8B and Bloom 7B on 288 translation directions from FLORES-200. The headline result is that the unsupervised method can be comparable to, and sometimes better than, translation with regular in-context examples drawn from human-annotated data, while also outperforming prior UMT systems by an average of 7 BLEU points in the paper’s summary results.[4]   288 directions  Evaluation scale across FLORES-200 translation directions with two multilingual LLMs.[4]    +7 BLEU  Average improvement over prior state-of-the-art unsupervised MT methods reported in the paper's abstract.[4]    55.76 spBLEU  Average score for TopK+BM25 on the English-involving subset, competitive with regular human-annotated in-context learning.[4]    40.13 BLEU  WMT benchmark average in the paper's Table 2, ahead of the best listed baseline at 33.68.[4]   Translation performance is highest when both source and target languages are high-resource and lower when either side becomes more data-scarce. Source: El Mekki and Abdul-Mageed (2025).  More mined in-context examples generally help, especially when moving from 1 to about 8 examples, after which gains become smaller. Source: El Mekki and Abdul-Mageed (2025).Two findings stood out to me. First, resource level still matters: even with strong multilingual LLMs, translation is easier when the target side is better represented. Second, better prompting is not just about the model. It is also about retrieval and filtering. In our experiments, a carefully selected unsupervised demonstration set was the difference between a rough translation and a competitive one.[4]Why this matters for low-resource translationThe broader significance is straightforward. If translation quality depends on large curated bitexts, then many languages remain blocked by a data collection problem before they can benefit from new models. But if an LLM can bootstrap usable demonstrations from a small amount of unlabeled text, the entry cost drops dramatically.[2][4]That does not mean the problem is solved. Low-resource translation remains harder, and our heatmap makes that visible. But it does mean the path forward looks different. Instead of waiting for perfect parallel corpora, we can start from weak lexical evidence, noisy sentence pairs, and strong multilingual priors, then iteratively mine something useful.[4]My own takeaway is that unsupervised MT is newly relevant in the LLM era. Not because LLMs made supervision obsolete, but because they made bootstrapping supervision more plausible. For underrepresented languages, that distinction matters. It is the difference between “we cannot build this yet” and “we can begin with what we have.”[2][4] Bottom line: if we can mine trustworthy in-context examples from unlabeled data, translation systems no longer have to wait for abundant parallel corpora before they become useful. That is a practical route toward broader language coverage in search, assistants, education, and public-facing digital tools.FAQWhat is unsupervised machine translation?It is translation without human-aligned sentence pairs. Instead of supervised bitext, the system has to bootstrap from monolingual data, weak lexical alignments, denoising, and back-translation.[2][4]Is unsupervised MT the same as zero-shot translation?No. Zero-shot translation is an inference setting where the model receives an instruction but no examples. In our work, the main problem is how to create reusable in-context examples from unlabeled data so the model can translate more reliably than plain zero-shot prompting.[1][4]Why not just use monolingual examples as demonstrations?Prior work on prompting for machine translation found that monolingual-only demonstrations generally hurt translation, whereas pseudo-parallel examples created through zero-shot back-translation or forward translation are much more effective.[3]How much unlabeled data does the approach assume?The paper studies a setting with fewer than 1,000 unlabeled sentences in each language, together with source and target vocabularies, a multilingual LLM, and an unsupervised sentence similarity function.[4]What is the main empirical takeaway?The paper reports that self-mined in-context examples can match or beat regular human-annotated in-context learning in many settings, while improving on previous UMT systems by an average of 7 BLEU points in the paper’s summary results.[4]References Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models are Few-Shot Learners. NeurIPS. PDF Lample, G., Ott, M., Conneau, A., Denoyer, L., and Ranzato, M. A. (2018). Phrase-Based and Neural Unsupervised Machine Translation. EMNLP. PDF Zhang, B., Haddow, B., and Birch, A. (2023). Prompting Large Language Model for Machine Translation: A Case Study. ICML / PMLR. PDF El Mekki, A., and Abdul-Mageed, M. (2025). Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs. Findings of NAACL 2025. PDFLinks Project code: https://github.com/UBC-NLP/sm-umt Paper: https://aclanthology.org/2025.findings-naacl.238/ Disclaimer (April 04, 2026): The latest version of this blog post was post-edited and formatted using an LLM."
    },

    {
      "id": "nilechat-cultural-alignment-llms-synthetic-data",
      "kind": "post",
      "title": "LLM Cultural Alignment: Synthetic Data Generation and Cultural Value Alignment in NileChat",
      "url": "https://elmekki.me/blog/nilechat-cultural-alignment-llms-synthetic-data/",
      "date": "2026-03-21T00:00:00-07:00",
      "year": 2026,
      "excerpt": "This article summarizes an LLM cultural alignment pipeline in NileChat, covering cultural bias in LLMs, synthetic data generation for LLMs, and the evaluation of cultural value alignment in LLM systems.",
      "summary": " Behind the Paper NileChat addresses a common problem in low-resource LLM adaptation. Egyptian and Moroccan Arabic are widely spoken, but continued pretraining data in these dialects remains limited. Translating large amounts of English educational content into the target dial...",
      "tags": ["LLMs","synthetic data","cultural alignment","llm cultural alignment","cultural bias in LLMs","synthetic data generation","value alignment evaluation","multilingual AI","NileChat"],
      "content": " Behind the Paper NileChat addresses a common problem in low-resource LLM adaptation. Egyptian and Moroccan Arabic are widely spoken, but continued pretraining data in these dialects remains limited. Translating large amounts of English educational content into the target dialect can improve fluency and knowledge transfer. However, translation alone does not ensure that a model captures local heritage, everyday references, or community-specific value patterns. That gap between linguistic adaptation and cultural alignment motivates the NileChat pipeline.   Pipeline summary: The paper combines machine translation of educational content for fluency and knowledge transfer, controlled synthetic data generation for LLM adaptation conditioned on local context, cultural heritage concepts, linguistic expressions, and representative personas, and a retrieval step that queries culturally specific web content. The pipeline is applied to Egyptian and Moroccan Arabic communities.  This work is directly relevant to LLM cultural alignment, cultural bias in LLMs, synthetic data generation for LLMs, and the evaluation of cultural value alignment in LLM systems.      Translation   Educational translation for fluency, coherence, and topical breadth.       Synthetic Generation   Local context documents, heritage concepts, linguistic cues, and representative personas.       Retrieval   Search-and-parse culturally specific web content for local heritage coverage.   Define the community, not just the language labelThe paper treats alignment targets as communities rather than abstract language names. “Arabic” is too broad. Even “Egyptian Arabic” or “Moroccan Arabic” is still incomplete unless the intended speech varieties, scripts, references, and value distributions are specified.This framing is also relevant to discussions of cultural bias in LLMs. If the target community is underspecified, the resulting model may inherit source-language assumptions or dominant-culture defaults instead of local norms.This framing raises several practical design questions: Which forms of speech should feel natural to the model? Which local references should be treated as common knowledge rather than niche trivia? Which social and moral distinctions matter enough to shape the data? Which communities are included, and which are still underrepresented?This framing changes the unit of design. The task is not only to translate a corpus into another language, but to construct a corpus that reflects how a community talks, remembers, and evaluates.  The NileChat pipeline: translation expands linguistic coverage, controlled generation combines local context, cultural concepts, linguistic cues, and personas, and retrieval adds culturally specific web content gathered for pretraining.Each data source served a different roleThe pipeline assigns distinct responsibilities to three layers rather than expecting a single corpus to satisfy every objective.1. Translation provided breadthIn the paper, machine translation is the layer for linguistic fluency and coherence. Educational content is translated from English into Egyptian and Moroccan Arabic.That translated layer was chosen for topical breadth. It covers areas such as education, history, health, medicine, and biology, which helps continued pretraining when native dialectal corpora are limited.Translated data can still carry source-language cultural biases, so MT improves language coverage and general knowledge without by itself solving cultural heritage or value alignment.2. Controlled synthetic generation used local context and personasControlled synthetic generation is not open-ended prompting. The teacher model is conditioned on four components: local contextual information from local news websites, core cultural heritage concepts extracted from country-specific Wikipedia portals, linguistic and cultural expressions such as proverbs, idioms, TV dialogue, and local terminology, and representative personas derived from World Values Survey responses.This setup limits open-ended invention. The prompt ties the generated text to realistic documents and a concrete persona profile, and the appendix explicitly instructs the model to rely on the provided context while reflecting the persona’s background. Prompt Structure In The Paper Inputs: a persona description, local context text, a cultural concept, and dialect-specific linguistic cues. Generation task: write a story, personal essay, blog post, review, or conversation in the target dialect rather than Modern Standard Arabic.   Use the provided context when writing.  Reflect the persona's cultural background, values, and worldview.  Incorporate dialectal expressions and local wording supplied in the prompt.  Keep the output in the target dialect and avoid drifting into MSA.  Use the persona implicitly instead of restating the persona description. The pipeline deliberately generates multiple genres: stories, personal essays, blog posts, reviews, and conversations. This variety exposes the model to different discourse patterns rather than one synthetic template repeated at scale.This is the core synthetic data generation for LLMs component of the pipeline. The paper uses synthetic generation to shape local discourse, persona-grounded language, and culturally specific content rather than relying only on translated corpora.3. Retrieval added local cultural heritage materialRetrieval in NileChat is also a data-construction step for pretraining, not an inference-time retriever. The system queries a search engine API using predefined cultural concepts that span categories such as food, clothes, landmarks, festivals and celebrations, geography, handicrafts, architecture, fauna, flora, and music.For each concept, it keeps the top 20 search results and parses the textual content with Trafilatura. This retrieved material adds naturally occurring, culturally specific web text that prompting alone may not provide. Key point: translation can teach a model how to say things in a target variety, but it cannot by itself teach the model what a community treats as obvious, familiar, or socially legible.Role of personasPersonas are the mechanism for bringing moral, demographic, and socioeconomic variation into the synthetic data.A key methodological detail is where they come from. The personas were derived from World Values Survey participant responses, not hand-written stereotypes. Selected survey answers were transformed into textual descriptions and then summarized by an LLM into concise persona profiles that could be plugged into prompts.The paper generates 1,200 persona descriptions from Egyptian and Moroccan WVS participants. Once those personas were combined with local context, cultural concepts, and linguistic cues, the synthetic data became more closely tied to the target communities than translation alone.  The persona pipeline in NileChat: participant responses are extracted, parsed into text, and then formatted into a prompt-ready persona. This grounds generation in structured social profiles rather than vague labels such as \"local speaker.\"This step gives the model structured exposure to differences in priorities, beliefs, and social conditions inside the same country-level population. In the paper, the goal is not a generic “local speaker” persona, but a set of promptable profiles grounded in observed survey responses.Pipeline summaryIn summary, NileChat combines three ingredients at scale: a machine-translated educational layer built from 5.5 million Fineweb-edu texts for each dialect, a controlled synthetic layer built from personas, local news context, cultural heritage concepts, and dialectal expressions across stories, personal essays, blog posts, reviews, and conversations, and a retrieval layer built by querying cultural concepts on the web, keeping the top 20 non-social-media results, and parsing the returned pages.The paper continues pretraining Qwen-2.5-3B on that mixture, then performs supervised fine-tuning for Egyptian and Moroccan variants. The main evaluations focus on understanding, translation, cultural knowledge, and value alignment.ResultsCompared with Qwen2.5-3B-Instruct, NileChat substantially improved understanding benchmarks, roughly doubled cultural knowledge scores on Palm for both dialects, and moved value alignment closer to World Values Survey response distributions across most measured dimensions. The paper presents these results as evidence that the combined MT, controlled generation, and retrieval pipeline improved local alignment beyond translation alone.Evaluation of cultural value alignment in LLMsOne of the paper’s more useful contributions is its evaluation of cultural value alignment in LLM systems. Rather than treating generic reasoning benchmarks as a proxy for alignment, it evaluates cultural knowledge and compares model responses with World Values Survey response distributions.This evaluation of cultural value alignment in LLMs matters because a model can be fluent in a target variety while still reflecting cultural bias in LLM behavior inherited from translated or globally dominant source data. In NileChat, the reported gains come not only from language adaptation, but also from explicit testing of whether the model’s answers move closer to community-level value patterns.How the recipe could transfer to another communityThe same structure could be adapted to another setting with a similar sequence: Define the target population in terms of language, cultural heritage, and values. Translate structured educational content for fluency and topical breadth. Build controlled prompts from local context, cultural concepts, linguistic expressions, and representative personas. Add search-based retrieval of culturally specific web pages and parse them into text. Evaluate understanding, translation, cultural knowledge, and value alignment explicitly.That last step is easy to skip, but the paper shows why it matters: cultural alignment claims are stronger when they are tested against cultural knowledge and WVS-based value alignment, not inferred from generic benchmarks alone.LimitationsThe paper also notes several limitations: The method depends on a strong teacher model that can already generate the target low-resource variety. The supervised fine-tuning stage still relied heavily on translated data because native instruction data was scarce. A 3B model is still more susceptible to hallucination and incomplete information than larger architectures. Synthetic data generation is computationally expensive.SummaryA central conclusion is that community alignment does not come from one prompt or one translated dataset. In NileChat, it comes from giving MT, controlled generation, and retrieval distinct roles across language, cultural heritage, and values.This structure can transfer beyond Arabic. The exact dialects, cultural concepts, and evaluation sets will change, but the underlying principle stays the same: a model reflects a community more effectively when that community is encoded into the data pipeline on purpose.For more in-depth detail about the data collection approach, see the paper.Links Project resources (models and collected datasets): UBC-NLP NileChat collection Paper: NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities Disclaimer (March 21, 2026): The latest version of this blog post was post-edited and formatted using an LLM."
    }

  ]
}