CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

News

August 2025

Dataset and code release - CulturalFrames images and prompts released along with code for prompt/image generation and metric evaluation. Annotations will be released soon.

June 2025

Paper released on arXiv: CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

May 2025

CulturalFrames accepted at ICML 2025 Workshop on Models of Human Feedback for AI Alignment (MoFA)

About

CulturalFrames tackles a key question: can today's text-to-image models and the metrics that judge them meet our cultural expectations? Existing benchmarks fixate on literal prompt matching or concept-centric checks, overlooking the nuanced cultural context that guides human judgment—and our study shows this blind spot is wide.

Key Features

🌍 Culturally Rich Prompts: We design a methodoly to generate scenarios grounded in real-life cultural practices across 10 countries, covering domains like family, dates-of-significance, etiquette, religion and greetings.
🎯 Implicit & Explicit Evaluation: We capture whether models reflect both directly stated (explicit) and contextually expected (implicit) cultural cues in generations.
🧑🏽‍🤝‍🧑🏾 Diverse Human Judgments: Over 10,000 human ratings and free-form rationales collected from annotators with cultural familiarity and training across 4 criteria - image-prompt alignment, image quality, stereotype presence and overall satisfaction.
📝 Word-Level Error Tagging: Annotators identify specific prompt words that are incorrectly visualized, enabling fine-grained diagnostic analysis of model behavior.
📊 Model & Metric Benchmarking: Evaluates both text-to-image generation and metric performance, revealing where models and metrics diverge from human expectations
💬 Explanation-Based Analysis: Natural language explanations for ratings provide deeper insight into model failures and cultural mismatches.
📂 Open & Reproducible: All prompts, ratings, explanations, and model outputs will be publicly released for transparent, community-driven evaluation.

Dataset Overview

CulturalFrames is built with rigorous quality standards to capture the richness and complexity of real-world cultural scenarios.

Data Collection

To construct our dataset, we first curated culturally grounded knowledge from five key categories—greetings, family structure, etiquette, religion, and dates of significance—using the Cultural Atlas database. We then prompted LLMs to ground this knowledge into culturally reflective image generation prompts. Each prompt was reviewed by three annotators from the respective country, and only prompts with majority agreement on their cultural appropriateness were retained. Images were generated using four state-of-the-art text-to-image models. To collect reliable annotations, we ran multiple pilot studies to refine the rating instructions, and implemented a quality control loop that included annotator filtering and continuous feedback for high-performing annotators. This process yields a culturally rich dataset with over 10,000 human ratings with scores and free-text rationales.

Distribution of prompts across five socio-cultural domains in CulturalFrames

📝

983

Cultural Prompts

🖼️

3,637

Generated Images

🤖

4

T2I Models

✍️

10,000+

Human Annotations

📋

4

Annotation Criteria

👥

400+

Participants

Summary statistics for the CulturalFrames

Explore CulturalFrames

Browse through samples from different countries in the CulturalFrames dataset.

Browse CulturalFrames Annotations

Explore human evaluations of cultural alignment, quality, stereotype and overall satisfaction across different countries.

Results

Our evaluation reveals significant gaps in cultural alignment across both text-to-image models and evaluation metrics.

Models Performance

Model Performance Summary

Human evaluations show GPT-Image leads in prompt alignment (0.85) and overall preference, with Imagen3 rated highest for image quality. Open-source models SD-3.5-Large and Flux lag, with SD performing worst due to low quality and higher stereotype rates. Stereotypical outputs occur in 10-16% of images, most for SD and least for Flux. Cross-country differences are notable, with Asian countries giving lower scores. Of all sub-perfect ratings, 50.3% are explicit errors, 31.2% implicit, and 17.9% both, showing persistent cultural nuance challenges.

Human evaluation results for selected T2I models. From left to right: 1) Prompt Alignment (0-1 scale). 2) Image Quality (0-1 scale). 3) Stereotype Score (0-1 scale, 0 indicates no stereotyping). 4) Overall Score (1-5 Likert scale).

Distribution of image-prompt alignment errors (score < 1) by model, grouped by error type.

Cultural Disparity Across Countries

There is a disparity in model performance across countries, for different criteria. Images generated for Asian countries such as Japan and Iran generally have lower scores across all criteria. The plots below show performance metrics and error distributions across countries.

Word Misinterpretation Analysis

Words highlighted as problematic in prompts flagged by raters highlights two main error patterns. Country demonyms (e.g., Iranian, Brazilian) are often marked when an image lacks the expected country-specific element or when annotators cannot relate to its content. Other frequent errors involve broad cultural signifiers—such as rituals, social roles, and iconic objects—showing that T2I models often misrepresent these elements.

Word Misinterpretation Frequency Analysis

Evaluation Metrics Analysis

Metric Alignment with Human Judgment

Metrics correlate poorly with human judgments. VIEScore and UnifiedReward show the strongest alignment, though still below human-human agreement. All metrics perform poorly on image quality. Overall, VLM-based metrics best capture culturally grounded preferences.

Metric vs Human Evaluation Discrepancies

The reasons provided by automatic metrics are often not aligned with human judgments. Below we present some examples where automated evaluation metrics disagree with human cultural assessment.

Improving Models and Metrics

Based on our analysis of cultural misalignment in text-to-image models and their evaluation metrics, we highlight three key directions for improvement.

📝 Prompt Enhancement

We expand culturally implicit prompts in CulturalFrames by automatically adding missing cues (cultural objects, family roles, setting details, mood/atmosphere) based on our analysis of model failures. We generate images for the new prompts using Flux.1-Dev and evaluate alignment with VIEScore. This targeted, culturally informed expansion improves VIEScore from 7.3 for original prompts to 8.4 for the expanded prompts, showing that making implicit cues explicit helps better image generation.

📏 Metric Design

We rewrote VIEScore's GPT-4o instructions using our human rater guidelines to make both implicit and explicit cues salient, then re-evaluated image-prompt alignment. This raised Spearman correlation with human ratings from 0.30 to 0.32 and improved explanation alignment from 2.19 to 2.37 (5-point scale). Carefully crafted, culturally informed instructions thus boost both scores and rationales.

🎯 Metric Training

We compare a preference-trained judge (UnifiedReward on Qwen2.5-VL-7B) to its backbone and find consistently higher correlations with human judgments. It even edges out GPT-4o-based VIEScore on alignment (0.31 vs 0.30). Preference-based judge training, even without culture-specific data, meaningfully improves cultural alignment of metric scores and CulturalFrames can be used to push this further.

Citation

@inproceedings{nayak2025culturalframes, title={CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics}, author={Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd van Steenkiste, Yash Goyal, Karolina Stańczak, Aishwarya Agrawal}, booktitle={ICML 2025 Workshop on Models of Human Feedback for AI Alignment (MoFA)}, year={2025} }