CHITALE PRANJAL: Hi Gretchen. Thanks for inviting me.
HUIZINGA: So, Pranjal, give us an overview of this article. In a few sentences, what problem are you trying to solve and why should people care?
CHITALE: So we are witnessing exciting times as LLMs rapidly evolve as tools for countless use cases. While most of these LLMs were initially leveraged for natural language processing tasks, they are now being extended across languages and modalities. However, a major gap is the availability of multimodal data for languages other than English. Therefore, most multimodal models may not fully cover non-English languages or may simply rely heavily on translations of associated text in English-centric datasets in order to support multiple languages . The downside of this approach is that it often misses the cultural nuances of local languages. And another reason why this is not optimal is that the images are mostly Western-centric (and) therefore would not reflect the local culture of many regions well. This type of bias can therefore tilt these models towards a Western perspective, raising concerns about the inclusiveness and safety of the content they generate when serving a global population, which involves multicultural and multilingual users. Therefore, for a truly inclusive AI ecosystem, models must demonstrate cultural understanding to ensure that the content generated is safe and respectful of diverse communities. Assessing cultural awareness is extremely difficult, however, because the definition of culture itself remains an unresolved problem. However, in this work we try to take a step towards an indicator that could measure cultural understanding.
HUIZINGA: Well, talk about how you did that. What methodology did you use for this article and what were your main conclusions?
CHITALE: Now that we have defined our broader problem, it is important to decide the scope of our solution because, as we have discussed, culture is an umbrella term. So we need to set a smaller scope for this problem. We chose visual question answering, which is a multimodal task, and it is one of the most critical multimodal tasks for the scope of this work. Aware of the limitations of existing VQA benchmarks, which often rely on translations and lack cultural representation, we developed the CVQA, which is Culturally diverse multilingual VQA reference. CVQA covers 30 countries, 31 languages and offers more than 10,000 culturally nuanced questions, developed by native speakers and cultural experts. So our goal was to create questions that required what we call good cultural sense to answer. For example, with just the image it is not possible to answer the question. You need some cultural awareness of the local culture to be able to answer the question. These questions are therefore inspired by knowledge of local culture. An important aspect of this dataset is therefore that we include both local language and English variants of the same question to enable robust testing of models across linguistic concepts. I would say the crux of this effort is that while most previous efforts may be modest in terms of language (they might be language group or country specific for the most part), we wanted it to This is a much larger, global collaborative project. effort. This therefore covers 31 languages in 30 countries. So, to create the CVQA, we worked with trained volunteers from diverse age groups and genders, ensuring that the questions authentically represented their cultures. Thus, the images collected have been guaranteed to be royalty-free, culturally grounded and safe for work, with strict guidelines to ensure that we avoid images that reflect certain stereotypes or privacy violations. And we also had 10 categories, which covered topics ranging from daily life, sports, cuisine to the history of the region, so a holistic view of the culture of the region. Thus, each question was designed as a multiple-choice task with difficult answer options that required both image and cultural knowledge to solve. We also used a manufacturer-verifier approach to ensure quality and consistency.
HUIZINGA: So you have created the benchmark. You tested it. What were your main discoveries?
CHITALE: Now that we have created a benchmark, the next step is to evaluate the performance of these multimodal models against this benchmark. We therefore compared several state-of-the-art multimodal models, which include both open source offerings like CLIP, BLIP, LLaVA-1.5 and proprietary offerings like GPT-4o or Gemini 1.5 Flash. So what we’ve observed is that there’s a huge performance gap when we compare these proprietary offerings versus the open source models. GPT-4o was therefore the best performing model with an accuracy of 75.4% on English prompts and 74.3% on local prompts. However, the story is completely different when moving to open source models. These open source models lag significantly behind proprietary models. And one of the key findings from these open source models is that these models perform even worse when prompted in the native language when we compare them to English prompts. This potentially highlights the fact that these models lack multilingual understanding capabilities, which may be because multilingual training data is quite sparse.
HUIZINGA: Yeah.
CHITALE: LLaVA-1.5 therefore turned out to be the best open source model. So, one thing to note, LLaVA-1.5 performs well on a broad set of English VQA tests, but when it comes to cultural understanding, it is a pretty weak model. Additionally, we also performed some ablations to understand whether adding location-specific information to text prompts has any impact or not, but we identified that it does not result in any significant performance improvement. Additionally, we also performed a category-wise analysis. So, as we mentioned, there are 10 categories to which these images belong. So we observed that some categories, like people and daily life, consistently showed higher accuracy across a wide range of models. This may be likely due to the abundance of human activity data in the training datasets. However, when it comes to niche categories like cooking and food, pop culture, which are much more difficult, especially in local languages, these models struggle. These are therefore very diverse cultural contexts which must be improved.
HUIZINGA: How will this work impact outside the lab and in the real world?
CHITALE: CVQA is important because it fills a fundamental gap in how we evaluate vision language and multimodal models today. While proprietary models are making impressive progress, open source models, which are more accessible and easier to deploy, lag significantly behind in terms of cultural awareness and security. The CVQA therefore fills this gap and provides a much-needed reference to help us identify these gaps in the first place. In order to close them, we must first identify the gaps, and whether or not we are making progress can be measured by this benchmark. So for the real world this reference has far-reaching implications. Models that understand culture are not only better technically, but they would create interactions that are much more engaging, natural, and safe for users from diverse backgrounds. This framework therefore offers a whole new area of improvement, cultural awareness and linguistic diversity. Therefore, by improving a model’s ability to address culturally nuanced questions, CVQA ensures that researchers and developers think beyond accuracy and also focus on cultural awareness and inclusiveness before putting these models into practice. in production.
HUIZINGA: Pranjal, what are the unanswered questions or unresolved issues in this area, and what do you plan to do about them?
CHITALE: So, although the CVQA is making some progress in taking cultural and linguistic diversity into account, there is still much to explore in this area. So this dataset only covers 31 languages and cultures, but that’s just a subset of the incredible diversity that exists globally. Many languages and cultures remain underrepresented, including some that are endangered or have limited digital resources. So, expanding the CVQA to include more of these languages would be a natural next step. Second, CVQA only focuses on single-round question-answer pairs. But in reality, human interaction is often conversational in nature and multi-layered. Thus, a multi-turn version of CVQA could better simulate real-world use cases and challenge patterns to maintain cultural and contextual awareness during extended dialogues. Another interesting area is personalization. It would therefore be very interesting to be able to teach models to adapt in real time to a user’s cultural context, preferences or even regional nuances. This remains a significant challenge, although this benchmark could help us take a step towards our larger goal.
(MUSIC)
HUIZINGA: Well, Pranjal Chitale, this is extremely important research and thank you for joining us today. To our listeners, thank you for listening. If you would like to read more about this article, you can find it at aka.ms/abstracts. You can also find it on arXiv and the NeurIPS website. And if you’re at NeurIPS, you can go hear about it too. See you next time on Summaries!
(MUSIC FADE IN)