WVS2Persona is a Hugging Face dataset that converts World Values Survey Wave 7 respondent records into English textual personas. Each record has a stable persona_id and a persona description grounded in one respondent's decoded survey answers.

How many records are in WVS2Persona?

The current release contains 97,220 respondent-level personas organized into 66 country subsets. Each subset has a train split.

How is WVS2Persona related to NileChat?

WVS2Persona follows the persona construction approach introduced in NileChat, where World Values Survey responses are decoded into text and used to ground culturally aware data generation.

How do I use WVS2Persona?

Use the Hugging Face datasets library and load a country subset, for example load_dataset('3ebdola/wvs2persona', 'Morocco', split='train'). The returned dataset contains persona_id and persona columns.

What are the main limitations of WVS2Persona?

The personas are textual renderings of survey responses, not complete biographies or verbatim respondent statements. The release is in English, uses only core WVS variables, and may reflect survey limitations, response noise, and country-specific coding artifacts.

WVS2Persona: World Values Survey Wave 7 Personas for Culture-Aware AI

10 minute read

Published: April 25, 2026

Dataset Release

In recent research, we built NileChat, a culturally aligned LLM for Egyptian and Moroccan Arabic communities. A key part of that work was feeding local personas to the LLM so controlled synthetic data generation could reflect community values, not only surface-level language patterns. For NileChat, those personas were parsed from World Values Survey records for Morocco and Egypt, because these were the two use cases in the paper. With WVS2Persona, I am releasing the same kind of persona resource for all countries covered in this dataset so other researchers and builders can reuse the idea beyond the original NileChat setting.^[1]^[2]

TL;DR: WVS2Persona is Hugging Face dataset of 97,220 respondent-level persona descriptions derived from World Values Survey Wave 7 records, organized into 66 country subsets.^[2]

This dataset is relevant to culture-aware AI, LLM cultural alignment, persona-based prompting, social values modeling, and evaluation of value-sensitive generation.

66 country subsets

Country-level configurations such as Morocco, Egypt, United_States, India, and Brazil.^[2]

97,220 personas

One textual persona per WVS Wave 7 respondent record.^[2]

NileChat method

The construction follows the WVS-to-persona direction used in NileChat for culturally grounded generation.^[1]

WVS2Persona dataset overview showing World Values Survey Wave 7 records converted into textual personas across 66 country subsets and 97,220 respondent-level personas. — WVS2Persona turns World Values Survey Wave 7 respondent records into textual personas that can be loaded by country subset and used in culture-aware AI workflows.

From NileChat to WVS2Persona

The motivation comes directly from NileChat. In NileChat paper, we proposed a methodology for adapting LLMs to local communities by considering three axes together: language, cultural heritage, and cultural values. The values component is where WVS-derived personas matter. They make it possible to condition data generation on concrete social profiles instead of vague labels such as “local speaker” or “person from country X.”^[1]

In NileChat, this idea was applied to Egyptian and Moroccan Arabic. WVS2Persona expands the persona resource itself across the countries available in the release, making it easier for other researchers and builders to reuse the same type of value-grounded conditioning outside the original NileChat experiments.

NileChat persona creation pipeline from participant response extraction to response parsing and persona formatting. — The persona creation figure from NileChat: survey responses are extracted, decoded into readable text, and formatted into persona descriptions that can be used for prompting and culturally grounded generation.^[1]

There is one important release detail. The current WVS2Persona dataset provides full deterministic persona descriptions generated from decoded core WVS questionnaire variables. The concise, summarized persona style used for compact prompting in NileChat is planned as a future extension.^[2]

What is inside the dataset?

WVS2Persona is organized by country as Hugging Face subsets. Each subset has one train split and two columns:

persona_id: a stable identifier for the persona record.
persona: a full English persona description grounded in the respondent’s decoded WVS Wave 7 core-questionnaire answers.

The personas are respondent-level renderings. They are not cluster centroids, not invented archetypes, and not synthetic summaries of a demographic group. That distinction matters because a country does not have one value profile. A useful culture-aware dataset should preserve within-country variation across age, gender, education, religion, political attitudes, trust, well-being, economic values, family norms, security concerns, and other survey dimensions.

The released personas use only the WVS Wave 7 core questionnaire sections, including social values, happiness and well-being, trust and organizational membership, economic values, corruption, migration, security, science and technology, religious values, ethical values, political participation, political culture, and demographics.^[2]^[3]

Key point: WVS2Persona is a bridge between social survey data and LLM workflows. It turns structured survey responses into text that can be retrieved, prompted, summarized, filtered, and inspected by the same tools already used for language model experimentation.

Why this kind of dataset matters

Most language model datasets are good at representing what people write online. They are much weaker at representing how different communities answer questions about family, trust, religion, democracy, security, migration, gender norms, work, technology, and moral judgments. Those topics are central to cultural alignment, but they are not reliably captured by web text alone.

That gap matters for three reasons.

First, culture-aware AI needs internal variation. A single country label is too coarse. WVS2Persona keeps respondent-level diversity visible, which helps avoid collapsing a society into one stereotype.

Second, persona-grounded generation is easier to audit than free-form prompting. A model can be conditioned on a specific textual profile, and the researcher can inspect the profile that shaped the output.

Third, evaluation can move beyond generic benchmarks. If a model claims to represent a community, researchers can test whether its answers, explanations, or generated examples reflect the range of values observed in survey-grounded profiles rather than only dominant internet priors.

How to load WVS2Persona

The dataset is available on Hugging Face at 3ebdola/wvs2persona.^[2] Each country is loaded as a subset/config.

from datasets import load_dataset

repo_id = "3ebdola/wvs2persona"

ds_morocco = load_dataset(repo_id, "Morocco", split="train")

print(ds_morocco)
print(ds_morocco.column_names)
print(ds_morocco[0]["persona_id"])
print(ds_morocco[0]["persona"][:500])

Country names with spaces use underscores in the subset name:

from datasets import load_dataset

repo_id = "3ebdola/wvs2persona"

ds_us = load_dataset(repo_id, "United_States", split="train")
ds_gb = load_dataset(repo_id, "Great_Britain", split="train")
ds_south_korea = load_dataset(repo_id, "South_Korea", split="train")

Practical use cases

1. Persona-based prompting

The most direct use is to condition a model on a persona and ask it to generate an answer, conversation, story, or opinionated response from that perspective.

persona = ds_morocco[0]["persona"]

prompt = f"""
You are writing a short first-person answer grounded in this persona.
Do not repeat the persona verbatim. Reflect the values and background implicitly.

Persona:
{persona}

Question:
What makes a community trustworthy?
"""

This is useful when building culturally varied synthetic data, simulated user populations, or controlled evaluation prompts. The important constraint is that the persona should guide generation without being treated as a real person’s complete biography.

2. Retrieval for culturally grounded generation

Because each persona is plain text, it can be embedded and retrieved. A researcher can retrieve personas relevant to a topic such as trust in institutions, migration, work values, political participation, or family norms, then use those profiles as conditioning context.

This is often cleaner than sampling random country labels. Retrieval lets the data pipeline select profiles that actually mention the theme under study.

3. Value-sensitive model evaluation

WVS2Persona can help create evaluation sets for questions where cultural and social values shape the answer. For example, researchers can sample personas from a country subset, ask a model to answer a question under each profile, and then compare answer patterns across countries, demographic groups, or value dimensions.

This does not replace statistical analysis of the original WVS data. It gives LLM researchers a text-native layer for testing how models behave when values are explicit in the prompt.

4. Summarization and compression research

The full persona descriptions are long. That makes the dataset useful for studying controlled summarization: how to compress a survey-grounded profile into a shorter prompt while preserving the value signals that matter for downstream generation.

That direction connects back to NileChat, where compact personas were used inside controlled synthetic data generation prompts.^[1]

Good practices

Use WVS2Persona as a research and prototyping resource for culture-aware AI, not as a source of stereotypes. A persona is a textual rendering of one respondent’s survey answers. It should not be generalized to an entire country, religion, gender, class, or language community.

For most experiments, I recommend reporting:

which country subsets were used,
how personas were sampled,
whether full personas or compressed summaries were used,
what prompt template conditioned the model,
how sensitive attributes were handled,
and whether outputs were evaluated at the individual-profile level or aggregated level.

This documentation is not busywork. It is what makes persona-based cultural alignment experiments reproducible and less likely to turn into anecdotal claims.

FAQ

What is WVS2Persona?

WVS2Persona is a dataset that converts World Values Survey Wave 7 respondent records into English textual personas. Each row contains a stable persona_id and a full persona description grounded in decoded survey answers.^[2]

How many personas are included?

The current release contains 97,220 personas across 66 country subsets.^[2]

Is WVS2Persona synthetic data?

It is not synthetic in the sense of inventing people from scratch. Each persona is a deterministic natural-language rendering of one WVS respondent record. However, the text is generated from decoded survey responses, so it is not a verbatim respondent statement.^[2]

How is it connected to NileChat?

The dataset follows the WVS-to-persona construction approach introduced in NileChat. NileChat used WVS-derived personas as part of controlled synthetic data generation for culturally aware LLM adaptation.^[1]

What can I build with it?

Common uses include persona-based prompting, culture-aware synthetic data generation, retrieval over social profiles, value-sensitive model evaluation, and summarization of long persona descriptions into compact prompt-ready profiles.

What should I avoid?

Avoid treating a persona as a full biography or as representative of an entire group. Avoid using individual personas to make claims about countries or communities without aggregation, sampling documentation, and careful interpretation.

Summary

WVS2Persona is useful because it makes cultural values operational for LLM workflows. It does not reduce culture to a country tag. Instead, it gives researchers a large set of respondent-level textual profiles that can be sampled, retrieved, summarized, and used as conditioning context.

The broader lesson from NileChat still applies: culturally aware AI needs language, local knowledge, and values to be modeled deliberately. WVS2Persona focuses on the values part of that pipeline and makes it easier to reuse across communities.

References

El Mekki, A., Atou, H., Nacar, O., Shehata, S., and Abdul-Mageed, M. (2025). NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities. EMNLP 2025. ACL Anthology
El Mekki, A. (2026). WVS2Persona: Parsed World Values Survey (WVS) Wave 7 records into textual personas. Hugging Face Dataset. Dataset card
World Values Survey Association. World Values Survey Wave 7 documentation and questionnaire resources. Documentation