Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

Proposed Framework

Figure: main architecture — **Overall architecture of the *OG-MAR* framework.** The pipeline illustrates the overall architecture of *OG-MAR*. The pipeline begins with Data Preprocessing & Ontology Construction (left). During inference, for a given query and target demographics, it performs Ontology & Demographic Retrieval (center) to gather relevant context. This context is used to instantiate multiple Persona Agents (top right) whose outputs are synthesized by a Judgment Agent (bottom right) to produce the final, culturally aligned prediction.

Data Preprocessing & Structuring

Figure: ontology final stage main — Visualization of the final ontology structure. The ontology comprises 76 classes and 150 pairs of object properties, forming a comprehensive semantic network.

Figure: ontology plt legend 1 — Visualization of the final ontology structure. The ontology comprises 76 classes and 150 pairs of object properties, forming a comprehensive semantic network.

Experiment Results

Accuracy of baseline methods across regional datasets. Bold text indicates the best performance, underlined text the second-best performance. * denotes significant improvements (paired t-test with Holm--Bonferroni correction, p < 0.05) over all baseline model(s). † denotes our proposed method.

Method	EVS (Europe)	GSS (Global)	CGSS (China)	ISD (India)	AFRO (Africa)	LAPOP	Avg
GPT-4o mini
Zero-shot	0.5606	0.5164	0.5847	0.6139	0.5324	0.5760	0.5640
Role (2024)	0.5892	0.5184	0.6014	0.6060	0.5505	0.5674	0.5722
Self-consistency (2022)	0.5558	0.4920	0.5631	0.5976	0.5224	0.5551	0.5477
Debate (2025)	0.5985	0.5509	0.5993	0.6568	0.5343	0.5306	0.5784
ValuesRAG (2025)	0.6127	0.5589	0.5889	0.6420	0.5654	0.6085	0.5961
OG-MAR (Ours)†	0.6206*	0.5480	0.6509*	0.6192	0.5389	0.6268	0.6007*
Gemini 2.5 Flash Lite
Zero-shot	0.5681	0.4957	0.6467	0.5000	0.5282	0.6225	0.5602
Role (2024)	0.5786	0.4992	0.6669	0.5521	0.5313	0.5852	0.5689
Self-consistency (2022)	0.5489	0.4728	0.6063	0.4705	0.5182	0.6268	0.5406
Debate (2025)	0.5977	0.5138	0.6348	0.6335	0.5046	0.5331	0.5696
ValuesRAG (2025)	0.6075	0.5376	0.6084	0.6041	0.5472	0.5339	0.5731
OG-MAR (Ours)†	0.6249*	0.5489*	0.7017*	0.7007*	0.5701*	0.6385*	0.6308*
QWEN 2.5
Zero-shot	0.5199	0.5069	0.2704	0.7222	0.4814	0.4908	0.4986
Role (2024)	0.5357	0.5037	0.3463	0.7452	0.5014	0.4712	0.5172
Self-consistency (2022)	0.5096	0.4975	0.3289	0.6278	0.4080	0.4975	0.4782
Debate (2025)	0.5511	0.5174	0.4578	0.6320	0.4875	0.4332	0.5132
ValuesRAG (2025)	0.5538	0.5215	0.4697	0.6591	0.4724	0.5268	0.5339
OG-MAR (Ours)†	0.5898*	0.5325*	0.5220*	0.6599	0.5180	0.6005	0.5705*
EXAONE 3.5
Zero-shot	0.5143	0.5311	0.2885	0.6041	0.4054	0.5006	0.4740
Role (2024)	0.5319	0.5326	0.3129	0.6048	0.4077	0.4602	0.4750
Self-consistency (2022)	0.5490	0.5266	0.2697	0.6122	0.4086	0.5368	0.4838
Debate (2025)	0.5713	0.5407	0.5624	0.6773	0.4995	0.4939	0.5575
ValuesRAG (2025)	0.5172	0.5520	0.5833	0.6446	0.4794	0.5913	0.5631
OG-MAR (Ours)†	0.6080*	0.5636	0.6307*	0.7810*	0.5045*	0.7002*	0.6313*

Ablation Studies

Varying the Number of Retrieved Individuals

Figure: ablation k overall avg — Performance comparison of four models across K in {1, 3, 5, 10} on average. Red vertical dashed lines indicate the best K and gray horizontal lines show the overall mean accuracy.

Impact of Value Inference Generation

Performance comparison between OG-MAR and the Value Inference Variant. Accuracy over four models on six regional datasets. Dashed lines show per-method average accuracy; red boxes report the average gap (Avg Delta = OG-MAR - Variant).

Impact of Multi-Persona Reasoning

Average accuracy of the full OG-MAR framework and the single-judge variant across four LLMs.

Model	Method	Avg. Accuracy
GPT-4o mini	OG-MAR	0.6007
GPT-4o mini	Single-Judge	0.5987
Gemini 2.5	OG-MAR	0.6308
Gemini 2.5	Single-Judge	0.6022
QWEN 2.5	OG-MAR	0.5705
QWEN 2.5	Single-Judge	0.5311
EXAONE 3.5	OG-MAR	0.6316
EXAONE 3.5	Single-Judge	0.5627

Figure: human evaluation average results — Average human evaluation scores (5-point Likert scale) across three tasks: Persona Fidelity (Consistency, Grounding), Judgment Logic (Synthesis, Context), and Retrieval Validity (Relevance). Scores are averaged over nine expert raters.

Discussion

Figure: acc mae vs token rolefixed — Performance--cost trade-off across methods. Left: accuracy vs total tokens (higher is better). Right: MAE vs total tokens (lower is better). Markers denote methods; the dashed line shows performance changes as token usage increases.

Training Details and Loss Curves

DeBERTa-v2-xxlarge Fine-tuning

Figure: deberta loss epochs latex — **Training and validation loss curves for DeBERTa-v2-xxlarge fine-tuning.** The x-axis represents epochs. Training loss (blue solid line) exhibits minor fluctuations typical of small-batch optimization, while validation loss (red dashed line, evaluated every 48 steps) decreases monotonically from 1.61 to 0.31 across three epochs, indicating effective learning without overfitting.

Value Category Classification

Topic classification performance on six regional datasets. Top-k: fraction of questions where the true category appears in top-k predictions. F1_macro: macro-averaged F1 across 12 categories. All metrics in [0,1].

Dataset	Top-1	Top-2	Top-3	F1_macro
Afrobarometer	0.5037	0.6875	0.7574	0.3070
CGSS	0.3375	0.5079	0.6656	0.2480
EVS	0.4315	0.5560	0.6680	0.3485
GSS	0.4545	0.6667	0.7765	0.3636
ISD	0.5439	0.7071	0.7950	0.2799
LAPOP	0.4396	0.6577	0.7349	0.3146
WVS (val)	0.9583	1.0000	1.0000	0.8250

Dataset Details

Figure: world map — Geographic coverage of cultural value datasets used in this study. Each country is colored according to its primary data source, prioritizing regional surveys over the global World Values Survey. Regional datasets include the General Social Survey for the United States, the European Values Study for Europe, Afrobarometer for Africa, the Chinese General Social Survey for China, and the India Social Dataset for India. Countries without regional coverage are represented by WVS data shown in gray. Color intensity reflects participant count on a logarithmic scale, ranging from 447 to 29,999 respondents per country. This multimodal approach ensures both regional specificity and global breadth in cultural alignment research.

Summary statistics of the retrieval corpus (WVS) and five test datasets. For WVS, we report the Wave 7 (2017--2022) subset and preprocessing as defined in WorldValuesBench (Zhao et al.). For test datasets, "#Value Qs" denotes the value-related items retained after our preprocessing/topic mapping (not necessarily the full questionnaire length).

Dataset	Type	Region	Wave / Year	#Countries	#Respondents	#Value Qs
Retrieval Corpus
WVS (World Values Survey)	Retrieval	Global	2017--2022	64	94,728	239
Test Datasets
EVS (European Values Study)	Test	Europe	2017	--	59,438	211
GSS (General Social Survey)	Test	U.S. (N. America)	2021--2022	--	8,181	44
CGSS (Chinese General Social Survey)	Test	China (E. Asia)	2021	--	~8,148	58
ISD (Pew India Survey Dataset)	Test	India (S. Asia)	2019--2020	--	29,999	33
LAPOP (AmericasBarometer)	Test	Latin America	2021	--	64,352	48
Afrobarometer	Test	Africa	2022	--	~48,100	144

Data sources used in our experiments. We use the World Values Survey (WVS) as the retrieval corpus, and evaluate generalization on six external test datasets (EVS, GSS, CGSS, ISD, LAPOP, and Afrobarometer), with official access links provided for reproducibility.

Dataset	Link
Retrieval Corpus
WVS	https://www.worldvaluessurvey.org/wvs.jsp
Test Datasets
EVS (European Values Study)	https://europeanvaluesstudy.eu
GSS (General Social Survey)	https://gss.norc.org
CGSS (Chinese General Social Survey)	https://cgss.ruc.edu.cn
ISD (Pew India Survey Dataset)	https://www.pewresearch.org/dataset/india-survey-dataset/
LAPOP (AmericasBarometer)	https://www.vanderbilt.edu/lapop
Afrobarometer	https://www.afrobarometer.org

Figure: Regional Dataset Distribution Pie — Distribution of selected value questions across regional datasets.

Distribution of Values-related Questions in WVS. The questions were categorized into 12 topics with a total of 253 questions covering most of the dimensions of values.

Topic	Count
Social Values, Norms, Stereotypes	45
Happiness and Wellbeing	11
Social Capital, Trust and Organizational Membership	47
Economic Values	6
Perceptions of Corruption	9
Perceptions of Migration	10
Perceptions of Security	21
Perceptions about Science and Technology	6
Religious Values	12
Ethical Values	23
Political Interest and Political Participation	35
Political Culture and Political Regimes	25

Extract Representative Sample to Cluster

Figure: voronoi cluster — Voronoi visualization of Faiss k-means centroids for six embedding datasets. Blue crosses denote cluster centroids, colored dots indicate embedded samples, and light polygons show Voronoi regions in a 2D projection, providing an intuitive overview of the spatial distribution and structure of the embedding space across datasets.

Human Evaluation of Reasoning Traces

Results

Human evaluation results (N=9) on a 5-point Likert scale. Task 1 measures Persona Fidelity (Consistency, Grounding), Task 2 measures Judgment Logic (Synthesis, Context), and Task 3 measures Retrieval Validity (Relevance).

Dataset (Region)	Consistency	Grounding	Synthesis	Context	Relevance
GSS (N.A.)	3.76	3.97	3.79	3.79	3.63
CGSS (E. Asia)	3.76	4.02	3.65	3.65	3.56
AFRO (Africa)	3.86	3.89	3.77	3.77	3.60
EVS (Europe)	3.77	3.80	3.77	3.77	3.72
ISD (S. Asia)	3.82	3.80	3.67	3.67	3.62
LAPOP (L. Am.)	3.70	3.78	3.67	3.67	3.71
Average	3.78	3.88	3.72	3.72	3.64

Prompt

Prompt for Persona Agent.

Prompt Template

Task:

You are Persona Agent {persona_id}.
Given {question} and {options_text}, select exactly one option that this persona would choose, based only on the persona's internal worldview.
Use only the provided persona-defining inputs: {demographics_text}, {value_summaries_text}, and {hyper_edges_text}.
Prohibited: any external knowledge, culturally neutral/common-sense reasoning, or unstated assumptions beyond the inputs.

Inputs:

[DEMOGRAPHICS]: {demographics_text}
[VALUE PROFILES]: {value_summaries_text}
[ONTOLOGY HYPER-NODES]: {hyper_nodes_text}
[RESPONSE OPTIONS]: {options_text}
[USER QUESTION]: {question}

Strict Rules:

Stay in persona; use only the provided inputs; no external knowledge or assumptions.
Integrate all value summaries and apply all hyper-edges explicitly (e.g., support/conflict/amplification).
Cite >= 2 demographic attributes; explain internal alignment, at least one conflict, and how it is resolved.
Choose exactly one option; output only one valid JSON object and nothing else.
reasoning must be >= 250 words and explicitly cover value/edge integration and the most influential demographics.

Output Format (JSON only):

{
  "persona_id": "{persona_id}",
  "chosen_answer": "<value>: <text>",
  "reasoning": "...",
  "alignment_factors": {
    "demographic": "...",
    "value_summaries_used": [],
    "hyper_edges_used": [],
    "integration_rationale": "..."
  }
}

Prompt for Judgment Agent.

Prompt Template

Task:

You are the Judgment Agent.
Given {question_text}, {options_text}, persona outputs, and a pre-computed vote summary, select exactly one final option by adjudicating only the Persona Agents' outputs.
Your decision must be based exclusively on: (1) Persona outputs (primary evidence) and (2) Vote summary (secondary context; do not recompute).
Prohibited: adding new facts or inventing any demographics/values/edges beyond what personas explicitly stated.

Inputs:

[USER QUESTION]: {question_text}
[RESPONSE OPTIONS]: {options_text}
[VOTE SUMMARY]: {vote_summary}
[PERSONA OUTPUTS]: {persona_outputs}

Strict Rules:

Use only information in [PERSONA OUTPUTS] and [VOTE SUMMARY].
Treat vote counts as correct and immutable; do not recount, estimate, or modify them.
Do not introduce any new persona attributes unless explicitly stated in persona outputs.
Do not use value/edge labels as standalone evidence; summarize evidence in natural language grounded in persona statements.

Decision Procedure:

A) Evidence Strength (Primary): Prefer the option supported by explicit, internally consistent persona reasoning grounded in stated demographics/values/edges.
B) Vote Summary (Secondary): Use vote counts only to break ties or confirm when evidence strength is comparable.
C) Relevance (Tie-breaker): If still tied, prefer evidence whose explicitly stated demographics are more directly relevant to the question.

Output Format (JSON only):

{
  "final_answer": "<value>: <text>",
  "reasoning": "..."
}

Prompt for Object-Property Generation Agent.

Prompt Template

Header:

You are an expert ontology engineer specialised in OWL 2 ontologies using Turtle syntax.
Your task is to generate only object properties that model directional relationships between value-derived classes of the World Values Survey (WVS) ontology.
You are working with an existing ontology. Its full class hierarchy is provided below:

Ontology Snapshot:

The following ontology snippet defines all OWL classes you are allowed to use.
You must not invent any new OWL classes.
All rdfs:domain and rdfs:range assignments must reference classes that appear in this snippet.

{ONTOLOGY_TTL}

Your job is not to modify the existing hierarchy.
Your job is to add only OWL object properties that express relations implied by the current Competency Question (CQ).
You follow a memoryless CQ-by-CQ pattern:
- You handle exactly one CQ per call.
- You forget all previous calls.
- You never reuse previous object properties unless explicitly shown.
- You never assume prior ontology state beyond what is in this prompt.

Helper:

You must generate OWL object properties in valid Turtle syntax under the following rules:

1. Object properties only

Each new property MUST declare rdf:type owl:ObjectProperty and specify exactly one existing class as rdfs:domain and one existing class as rdfs:range.
You MUST NOT create new classes, data properties, individuals, subclass axioms, owl:Restriction, reifications, inverse properties, or property chains.

2. Directionality

Domain = conceptual source (cause/driver)
Range = conceptual target (effect/outcome)

3. Naming of object properties (IRI)

Use prefix wvs:
The local name MUST be:
- a single English verb in base form, e.g., reduce, increase, undermine, OR
- a short verb phrase written in snake_case that clarifies the directionality, e.g., reduce_support, increase_concern, weaken_trust.
You MUST NOT embed any domain or range class names (e.g., reduce_outgroup_tolerance is forbidden).
The local name must use only lowercase letters and underscores (snake_case), never CamelCase.

Prompt for Object-Property Generation Agent (continued).

Prompt Template (continued)

4. Labels (natural-language)

Each object property MUST include exactly one rdfs:label (@en).
The label MUST be a full declarative English sentence that includes:
- the domain class concept (with capitalization matching its label, e.g., "Generalized Trust"),
- the verb,
- the range class concept (with capitalization matching its label, e.g., "Institutional Confidence").
The sentence MUST begin with a capital letter, use standard English spacing, avoid CamelCase inside the sentence, not end with a period, and reflect the correct direction.

5. Minimality

It is common and acceptable to create zero object properties.
Only create object properties if the CQ implies an actual directional conceptual relation that you can justify.
If NO meaningful directional relation exists, output zero properties: only output the prefix header + ontology declaration.

6. Class selection

Always choose the most specific allowed class that appears in the ontology snippet.
Avoid using top-level categories unless the CQ clearly refers to high-level concepts.

Story:

You are modelling cross-domain value relations in a WVS-based ontology to support a hypergraph-style retrieval-augmented generation system.
Nodes (hypernodes) correspond to value concepts (OWL classes), such as:
- wvs:GeneralizedTrust
- wvs:OutgroupTolerance
- wvs:ReligiousImportance
- wvs:PerceptionsOfMigration
- wvs:PerceptionsOfSecurity
- wvs:PoliticalParticipationActivities
- etc.
Edges (hyperedges) will be derived from your object properties:
- The domain class and the range class of each object property become the endpoints of a directional edge.
- The semantic content of the edge is given by the object property label.

Runtime inputs

Your ontology will be used to answer competency questions (CQs), such as:
- "How do subclasses of Happiness and wellbeing influence subclasses of the Perceptions of migration domain?"
- "How do subclasses of Perceptions about science and technology influence subclasses of the Religious values domain?"
At runtime, the user message will always contain:
- One current CQ in natural language, clearly marked.
- One RESPONDENT_DATA_JSON block (the current respondent).

Prompt for Object-Property Generation Agent (continued).

Prompt Template (continued)

Your task for each call is to:

Read the CQ and identify the main source and target value concepts.
Map them to the best-matching existing classes in the WVS ontology (prefer specific subclasses whenever possible).
Decide the most appropriate direction (domain -> range).
Choose a concise English verb phrase that describes the relationship.
Declare one or more new object properties in Turtle that capture these relations:
- Create new properties ONLY IF the CQ genuinely implies a directional semantic relation between two existing WVS classes.
- If the CQ does NOT express any meaningful or inferable relation between classes, do NOT create any object property; in that case, output only the required prefix header and ontology declaration.

For this call, you must handle the following CQ:

{CQS}

Focus within the CQ:

In this CQ, your primary focus is on the value domains that are explicitly mentioned in the question (for example, Economic Values, Social Values, Perceptions of Security, Perceptions of Migration, etc.).
Treat these high-level domains only as anchors: your actual modelling must happen at the level of their specific subclasses, not at the level of the broad domain classes.

Concretely:

Identify which domains the CQ linguistically treats as sources/causes/drivers and which domains it treats as targets/effects/outcomes.
Within the source domains, select the most appropriate subclasses as candidates for rdfs:domain.
Within the target domains, select the most appropriate subclasses as candidates for rdfs:range.
Prefer connections between concrete subclasses across domains, and avoid using generic top-level domain classes when a more specific subclass is available.

Respondent-data grounding:

The data that grounds these concepts comes from WVS respondent data.
Each API call provides one current respondent in JSON form, with a structure similar to:

RESPONDENT_DATA_JSON (Python-style dict or JSON object):

{
  "Q1": {
    "category": "Social Values, Norms, Stereotypes",
    "question": "On a scale of 1 to 4 ... how important is family in your life?",
    "response": "Very important"
  },
  "Q46": {
    "category": "Happiness and Wellbeing",
    "question": "Taking all things together, how would you rate your overall happiness?",
    "response": "Very happy"
  },
  "Q57": {
    "category": "Social Capital, Trust and Organizational Membership",
    "question": "Generally speaking, would you say that most people can be trusted ... ?",
    "response": "Need to be very careful"
  },
  ...
}

Current respondent data:

{{RESPONDENT_DATA_JSON}}

Important:

The categories in the JSON correspond exactly to the 12 value domains above.
The questions and responses give you an intuition about how a concrete person might link different value dimensions (e.g. high religiosity + low tolerance + strong security concerns).
However, you are not modelling this single person.
You are modelling general conceptual relations between classes that could explain, in the abstract, such patterns.

Use the respondent data as story-like grounding:

to observe which value domains the respondent expresses strongly or weakly,
to infer whether the relation suggested by the CQ is likely positive or negative,
to select a concise English verb that best matches the respondent's pattern,
to ensure that the chosen direction and verb feel plausible given the respondent's tendencies,
but never to create individuals or encode question IDs directly.

Footer:

When you answer, you must obey the following hard constraints:

1. Output format

Your entire answer must be valid Turtle.
Do not include any natural language explanation, bullets, or comments.
Do not include section headers such as [Header], [Helper], [Story], or [Footer] in your output.
Do not include # comments in the Turtle.
The output must be directly loadable by an OWL 2 tool such as Protege.

2. Prefixes

At the very top of your output, always include exactly the following prefix and base declarations:

@prefix : <http://cultural-alignment.org/wvs#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix wvs: <http://cultural-alignment.org/wvs#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@base <http://cultural-alignment.org/wvs#> .

<http://cultural-alignment.org/wvs#> rdf:type owl:Ontology .

Prompt for Object-Property Generation Agent (continued).

Prompt Template (continued)

3. Content constraints

Do not create new OWL classes, data properties, individuals, or restrictions; do not modify or delete existing classes; do not introduce blank nodes; and do not use owl:unionOf, owl:intersectionOf, owl:Restriction, or other complex OWL constructors.
Only add owl:ObjectProperty declarations with exactly one existing class as rdfs:domain, exactly one existing class as rdfs:range, and one English sentence as rdfs:label (@en).

4. No conversation

Do not explain what you are doing.
Do not apologise.
Do not ask questions.
Do not output anything other than Turtle Syntax.

5. Memoryless CQbyCQ behaviour

Treat each call independently, as if you are seeing the CQ and respondent data for the first time.
Do not refer to previous CQs or previous ontology fragments.
Do not try to summarise or merge previous results in your output.
Just generate the object properties needed to model the current CQ.

Your final output for each call must therefore be:

A Turtle document starting with the fixed prefix and ontology header above,
Followed only by a small set of owl:ObjectProperty declarations that model the given CQ as directional value relations between existing WVS classes.

Critical Instruction: Sensitive Value Judgments

This ontology-generation task may involve sensitive, controversial, or politically charged value dimensions (e.g., ethnic relations, national identity, immigration attitudes, religious norms, gender values, security preferences, etc.).
You MUST NOT default to neutrality or avoid making value-linked inferences when the CQ and respondent data clearly suggest a directional conceptual relationship.
You are allowed to provide analytical, conceptual, and research-oriented value judgments ONLY in the context of modelling abstract relationships between OWL classes.
These judgments must be grounded strictly in the ontology, logically inferred from value dimensions, expressed as directional influence relations, written without endorsing or promoting any real-world group, and focused entirely on conceptual modelling.
Do NOT suppress or weaken the relationship because the topic may seem sensitive.
Your role in this task is to faithfully encode the conceptual structure implied by the CQ and respondent data, even if the subject matter is sensitive or controversial.

Prompt for Value Profile Generation Agent.

Prompt Template

Task:

You are an expert social-science researcher.
Summarize the respondent's values for {domain_label} based on the provided Q&A pairs.

Inputs:

[TAXONOMY]: {domain_taxonomy_yaml}
[RESPONDENT ANSWERS] (Format: "- Q: Question | R: Response"): {value_input_yaml}

Strict Rules:

Zero fabrication: Every single statement MUST be directly supported by the provided answers; do NOT invent, guess, or hallucinate information.
Coverage constraint: If there is at least one Q&A pair related to a subcategory, you MUST write a summary; only skip a subcategory if there is absolutely zero relevant data.
Style (telegraphic): Omit the subject (e.g., "The respondent", "They"); start sentences directly with verbs or key adjectives; e.g., "Strongly values family..." (O) / "The respondent values..." (X).
Length: All summaries must be concise (approximately 50 tokens). For {domain_label}, do NOT list details; provide a high-level synthesis. For subcategories, focus on specific beliefs and attitudes.
Do NOT output any text other than the YAML block.

Output Format (YAML only):

{domain_label}: >
  (High-level synthesis of value orientation, starting with verb)
Subcategory 1: >
  (Specific summary, starting with verb)
Subcategory 2: >
  (Specific summary, starting with verb)

Case Study

Case Study: GSS

Figure: case1 — **Case study(GSS):** Evangelizing Preferences. For a target respondent profile, retrieved summaries from demographically similar individuals provide contextual signals about how faith commitment interacts with respect for others' autonomy. Aggregating these perspectives yields a final answer that reflects the target's most plausible choice while mitigating stereotype-driven inference and improving values alignment.

Case Study: CGSS

Figure: case2 — **Case study(CGSS):** Purpose of Marriage. Retrieved summaries complement the target profile with family- and responsibility-oriented value cues, supporting nuanced interpretation of what marriage primarily represents. The final answer is inferred by consolidating similar individuals' perspectives, capturing contemporary norm-sensitive reasoning beyond generic common sense.

Case Study: EVS

Figure: case3 — **Case study(EVS):** Social Distance Toward Jews. The target profile is enriched with retrieved summaries that foreground tolerance-related value cues and their interactions, helping interpret social-distance judgments with contextual sensitivity. By aggregating similar perspectives, the model infers the target's most likely response while reducing demographic over-attribution and stereotyping.

Ontology Details

CQ Examples

CQ Examples. Each CQ specifies two domains and asks about the relationships between their subclasses.

CQ	Content
CQ1	How do subclasses of Economic Values influence subclasses of the Political culture and political regimes domain?
CQ2	How do subclasses of Ethical values influence subclasses of the Perceptions of corruption domain?
CQ3	How do subclasses of Happiness and wellbeing influence subclasses of the Religious values domain?
CQ4	How do subclasses of Perceptions about science and technology influence subclasses of the Religious values domain?
CQ5	How do subclasses of Perceptions of corruption influence subclasses of the Social capital, trust and organizational membership domain?
CQ6	How do subclasses of Perceptions of migration influence subclasses of the Social capital, trust and organizational membership domain?
CQ7	How do subclasses of Perceptions of security influence subclasses of the Social values, norms, stereotypes domain?
CQ8	How do subclasses of Political culture and political regimes influence subclasses of the Social values, norms, stereotypes domain?
CQ9	How do subclasses of Political interest and political participation influence subclasses of the Social capital, trust and organizational membership domain?
CQ10	How do subclasses of Social capital, trust and organizational membership influence subclasses of the Social values, norms, stereotypes domain?

Pre-defined value taxonomy manually constructed through systematic analysis of WVS survey questions. This ontology taxonomy comprises 12 top-level categories and 64 subcategories, providing a fixed knowledge structure for ontology-grounded retrieval and multi-agent cultural reasoning.

Value Domain	Fine-grained Categories
Economic Values	Economic Equality Preference, Environment Versus Growth Preference, Government Responsibility Preference, Market Competition Preference, Ownership Preference, Work Success Beliefs
Ethical Values	Justifiability of Dishonest Behaviors, Moral Ambiguity Perception, Sexual Behavior Ethics, State Surveillance Rights, Violence Ethics
Happiness and Wellbeing	Basic Needs Security, Health Status, Intergenerational Comparison, Perceived Life Control, Subjective Wellbeing
Perceptions about Science and Technology	Importance of Science Knowledge, Science and Technology Optimism, Technology World Impact Evaluation
Perceptions of Corruption	Accountability Risk Perception, Bribe Experience, Corruption Gender Stereotype, Corruption In Institutions
Perceptions of Migration	Immigration Effects Perception, Immigration Policy Preference, Specific Immigration Impact Beliefs
Perceptions of Security	Economic Security Worry, National Defense Willingness, Neighborhood Safety Incidence, Neighborhood Security Feelings, Political Security Concerns, Security-related Behavior, Value Trade-off Preferences, Victimization Experience
Political Culture and Political Regimes	Democratic Characteristics Importance, Democratic Governance Perception, Human Rights Perception, Ideological Self-placement, National Identity, Regime System Approval, Territorial Attachment
Political Interest and Political Participation	Election Importance and Voice, Electoral Integrity And Efficacy, News Media Use For Politics, Political Interest, Political Participation Activities, Voting Behavior
Religious Values	Belief in Religious Concepts, Religion versus Science, Religious Authority Attitudes, Religious Exclusivism, Religious Identity, Religious Importance
Social Capital, Trust and Organizational Membership	Civic Organization Membership, Generalized Trust, Institutional Confidence, Interpersonal Trust
Social Values, Norms, Stereotypes	Attitudes Toward Future Social Change, Child Rearing Values, Family and Social Duty Attitudes, Gender Role Attitudes, Importance In Life, Outgroup Tolerance, Work Obligation Attitudes

Representative ontology triples for each value domain. The "Domain Category" column indicates the high-level category to which the subject class of the ontology triple belongs. The last row (*) represents cross-domain triples where the value class falls under "Social values, norms, stereotypes".

Domain Category	Ontology Triples
Economic Values	<Work Success Beliefs, reinforces, Work Obligation Attitudes> <Government Responsibility Preference, reduces, Economic Security Worry> <Market Competition Preference, may slightly increase, Political Interest>
Ethical Values	<State Surveillance Rights, may strengthen, Institutional Confidence> <Justifiability of Dishonest Behaviors, consistently heightens perception of, Corruption In Institutions> <Moral Ambiguity Perception, erodes feeling of, Perceived Life Control>
Happiness and Wellbeing	<Perceived Life Control, can weakly reduce, Economic Security Worry> <Subjective Wellbeing, consistently fosters, Outgroup Tolerance> <Basic Needs Security, tends to alleviate, Economic Security Worry>
Perceptions about Science and Technology	<Technology World Impact Evaluation, may foster openness to, Attitudes Toward Future Social Change> <Science and Technology Optimism, tends to alleviate, Economic Security Worry> <Science and Technology Optimism, tends to positively promote, Attitudes Toward Future Social Change>
Perceptions of Corruption	<Corruption In Institutions, dampens, Political Interest> <Bribe Experience, may reduce, Interpersonal Trust> <Accountability Risk Perception, may slightly increase, Economic Security Worry>
Perceptions of Migration	<Immigration Effects Perception, significantly reduces, Generalized Trust> <Immigration Effects Perception, tends to polarize towards exclusivism, Religious Exclusivism> <Specific Immigration Impact Beliefs, may motivate, Political Participation Activities>
Perceptions of Security	<Neighborhood Security Feelings, consistently enhances, Interpersonal Trust> <Political Security Concerns, erodes, Institutional Confidence> <Economic Security Worry, reinforces, Work Obligation Attitudes>
Political Culture and Political Regimes	<Democratic Governance Perception, fundamentally underpins, Institutional Confidence> <National Identity, may boost, Voting Behavior> <Regime System Approval, actively encourages participation in, Voting Behavior>
Political Interest and Participation	<Voting Behavior, may reinforce, Institutional Confidence> <Political Participation Activities, strongly drives, Civic Organization Membership> <Political Participation Activities, tends to foster acceptance of, Outgroup Tolerance>
Religious Values	<Religious Importance, strongly reinforces sense of, Family and Social Duty Attitudes> <Religious Importance, actively promotes participation in, Civic Organization Membership> <Religious Exclusivism, severely undermines, Outgroup Tolerance>
Social Capital, Trust and Org. Membership	<Generalized Trust, fundamentally underpins, Outgroup Tolerance> <Interpersonal Trust, helps cultivate, Outgroup Tolerance>
*	<Subjective Wellbeing, tends to heighten appreciation of, Importance In Life> <Work Success Beliefs, reinforces, Work Obligation Attitudes> <Science and Technology Optimism, tends to positively promote, Attitudes Toward Future Social Change>

Figure: ontology init — Visualization of the most primitive stage of our value ontology, where only the initial taxonomy is defined before constructing the ontology using competency questions (CQs). Nodes with the same color represent classes belonging to the same category. The large nodes denote the 12 parent classes directly under `owl:Thing`, while the small nodes correspond to their subclasses. All grey edges in this figure represent *subClassOf* relations.

Figure: ontology stage 2 — Visualization of the intermediate stage of ontology construction. Subclasses from the Economic domain are now interconnected with subclasses from other domains, establishing semantic relationships across categories. For example: Economic Equality may increase Immigration Effects, Market Competition widely promotes Science Optimism. The ontology progressively forms fine-grained relationships by iteratively processing each competency question (CQ).

Figure: ontology degree — Final ontology structure with **76 classes** and **150** object-property pairs. Node colors show the **12 parent value categories**, and node size scales with the sum of in-degree and out-degree, so that larger nodes mark classes that are frequently instantiated in ontology triples and maintain rich relational connections to many other classes.

Figure: ontology plt legend — Final ontology structure with **76 classes** and **150** object-property pairs. Node colors show the **12 parent value categories**, and node size scales with the sum of in-degree and out-degree, so that larger nodes mark classes that are frequently instantiated in ontology triples and maintain rich relational connections to many other classes.

Ablation Study Details

VARYING THE NUMBER OF RETRIEVED INDIVIDUALS Full Figures

Figure: k ablation all datasets — Detailed ablation study on retrieval size K across six regional datasets. Each subplot shows the performance comparison of four models (GPT-4o mini, Gemini 2.5, QWEN 2.5, EXAONE 3.5) across K in {1, 3, 5, 10}. Red vertical dashed lines indicate the best K for each dataset, and black horizontal dashed lines show the dataset-specific mean accuracy. The results demonstrate that K=5 achieves optimal or near-optimal performance across most datasets, while K=10 often leads to performance degradation due to increased noise in the retrieved context.

IMPACT OF MULTI-PERSONA REASONING Full Table

Detailed breakdown of accuracy scores by region for the full OG-MAR framework compared to the Single-Judge variant (referenced in Section 5.2.3). The highest score between the two methods for each region is highlighted in bold.

Model	Method	EVS	GSS	CGSS	ISD	AFRO	LAPOP	Avg. Acc.
GPT-4o mini	OG-MAR	0.6206	0.5480	0.6509	0.6192	0.5389	0.6268	0.6007
GPT-4o mini	Single-Judge	0.5773	0.6000	0.6440	0.6996	0.5293	0.5419	0.5987
Gemini 2.5	OG-MAR	0.6249	0.5489	0.7017	0.7007	0.5701	0.6385	0.6308
Gemini 2.5	Single-Judge	0.5870	0.6222	0.5960	0.6551	0.5411	0.6116	0.6022
QWEN 2.5	OG-MAR	0.5898	0.5325	0.5220	0.6599	0.5180	0.6005	0.5705
QWEN 2.5	Single-Judge	0.5266	0.5777	0.4067	0.6485	0.4494	0.5779	0.5311
EXAONE 3.5	OG-MAR	0.6080	0.5636	0.6307	0.7810	0.5045	0.7022	0.6316
EXAONE 3.5	Single-Judge	0.5013	0.6444	0.4237	0.6900	0.4725	0.6444	0.5627

Additional Ablation Study:IMPACT of Retrieved Ontology Triples

Figure: hyper nodes ablation 2x4 — **Ablation study on ontology triples retrieval size.** Performance comparison across N in {1, 3, 5, 7, 9} for four LLM backbones on six regional datasets and their average. Red dashed vertical lines mark the Best N where average accuracy across all models peaks for each dataset. Gray dashed horizontal lines show the overall mean accuracy with values displayed. Results demonstrate that N=3 achieves competitive or near-optimal performance across most datasets.