In an April 2, 2025 paper researchers from the Transitional Artificial Intelligence Research Group at UNSW Sydney and the Centre for Artificial Intelligence and Innovation at Pingla Institute, Sydney, evaluated how large language models (LLMs) fare with complex, culturally and emotionally rich Indian texts.
The study compared the performance of OpenAI’s GPT-3.5 and GPT-4o, Google’s Gemini, and Google Translate when translating from Sanskrit, Telugu, and Hindi into English.
The researchers examined three texts spanning different genres and eras: a foundational text of Hinduism, encompassing practices, spirituality, and philosophy (Bhagavad Gita), an anthology of poems (Maha Prasthanam), and a novel (Tamas).
Expert human translations of these texts served as benchmarks. The researchers generated translations using LLMs and Google Translate and then compared them against the human references using sentiment and semantic analysis. (Sentiment analysis measures the emotional tone expressed, whereas semantic analysis determines whether the translated text accurately conveys the intended meaning.)
According to the researchers, capturing the nuanced sentiment and semantic depth embedded in culturally and historically significant works remains a challenge — even for top-tier LLMs.
“LLMs are generally better at translation for capturing sentiments when compared to Google Translate.” — Chandra et al.
“Our findings suggest that while LLMs have made significant progress in translation accuracy, challenges remain in preserving sentiment and semantic integrity, especially in figurative and philosophical contexts,” they said.
However, they noted that “LLMs are generally better at translation for capturing sentiments when compared to Google Translate.”
The Most Consistent and Reliable Model
Their analysis revealed clear differences in how the models captured sentiment. Specifically, GPT-4o delivered sentiment distributions remarkably close to those of human experts, indicating a strong alignment with human understanding.
Gemini followed but showed some shifts in emotional tone, while Google Translate exhibited more erratic swings — alternating between an overly optimistic and a starkly pessimistic tone — indicating “inconsistency in emotional tone” and raising questions about its reliability in emotionally charged contexts.
In terms of semantic alignment, GPT-4o consistently achieved the highest overall match with human translations for Sanskrit and Hindi. Gemini excelled with Telugu by capturing the rhythmic and figurative language inherent in poetic texts. In contrast, Google Translate tended to provide literal translations, struggling with the layered complexity of both ancient and modern texts.
According to the researchers, GPT-4o is “the most consistent and reliable model” among those tested, as it better maintains the meaning, sentence structure, and context of the original text.
Nonetheless, they acknowledged that human expert translations still retain the deeper poetic and contextual nuances of the original text.
Need for More Culturally Sensitive Systems
The researchers also found that LLMs face particular difficulties with abstract, poetic, and philosophical content. The emotional and metaphoric expressions present further challenges, leading to a “tendency for models to rephrase rather than directly translate poetic language.” LLMs also have difficulties in handling context-dependent translations, with honorifics and formal expressions being often inconsistently translated.
They noted that “while LLM can recognize broad emotional trends, they still exhibit model-specific biases in sentiment interpretation” and generally translate text “without integrating background knowledge from the source material.”
They suggest that providing contextual prompts could improve translation quality by ensuring that the emotional tone better aligns with the original text. They propose that further research should explore whether incorporating historical, cultural, or philosophical context can further improve sentiment preservation.
“While LLM can recognize broad emotional trends, they still exhibit model-specific biases in sentiment interpretation”. — Chandra et al.
“The results of this study could help in the development of more accurate and culturally sensitive translation systems,” the researchers concluded.
Authors: Rohitash Chandra, Aryan Chaudhari, and Yeshwanth Rayavarapu