For translating pediatric discharge instructions, Google Translate and ChatGPT were largely on par with professional translations for Spanish and Brazilian Portuguese, but didn't do so well with Haitian Creole, researchers found.
Professional translations of instructions from English into Haitian Creole significantly outscored the online tools on every measure rated on a 5-point Likert scale for adequacy in preserving information, meaning consistent with the original intent, fluency in readability, and severity of potential harm introduced by the translation.
Google Translate and Chat GPT did the best with Spanish translation, though, as Ryan Brewster, MD, of Boston Children's Hospital, and colleagues reported .
For Spanish, Google and ChatGPT actually scored higher than professional translations for adequacy (4.6 and 4.7 vs 4.2, respectively, P=0.003 and P≤0.001), fluency (4.5 and 4.7 vs 4.2, P=0.007 and P≤0.001), and meaning (4.7 and 4.8 vs 4.4, P=0.011 and P<0.001). Professional translations had lower risk of causing harm than ChatGPT (4.6 vs 4.8, P=0.026) but similar to that from Google Translate (P=0.164).
For Brazilian Portuguese, adequacy ratings were higher for ChatGPT than for professional translations (4.7 vs 4.5, P=0.025) but all other measures were similar across translation sources.
Overall, evaluators preferred Google Translate and ChatGPT in the majority of cases. Professional translations were preferred over the online tools in only 15% of cases for Spanish discharge instructions and in 43% for Portuguese and 48% for Haitian Creole.
"The differential results for Haitian Creole and other languages of limited diffusion may expose already at-risk patient populations to excess clinical harm and poorer quality of care," Brewster told 鶹ý in an email. "Additional steps, such as incorporating a human translator to review translation outputs, may help mitigate associated risks."
If patients and their families don't fully understand discharge instructions, they risk missing follow-up appointments and misunderstanding medication recommendations, the researchers noted. That can contribute to more frequent unplanned healthcare utilization, adverse patient safety events, and increased healthcare costs, they said.
"Many institutions employ in-person or virtual interpreter services and develop standardized clinical documents in different languages," but they are not universally available across practice environments, languages, and clinical circumstances, Brewster's group noted.
Machine translation engines are rapidly evolving and hold potential as a solution. However, "ensuring equity, safety, and quality will require a systematic understanding of their merits and limitations," Brewster and colleagues further wrote.
The group examined performance on the most commonly spoken languages other than English within their health systems: Spanish, Brazilian Portuguese, and Haitian Creole.
Overall, 20 standardized discharge instructions for pediatric conditions were translated into those languages by professional translation services, Google Translate, and ChatGPT-4.
Translations were rated by nine multilingual evaluators (77.8% female) who were blinded to the source of translation. All of the evaluators had at least professional working proficiency in the languages, and two-thirds of the group had more than 10 years of residence in the U.S.
Professional translations into Haitian Creole "consistently outscored" those by Google Translate and ChatGPT:
- Adequacy (4.5 vs 4.0 and 3.9, P=0.005 and P<0.001)
- Fluency (3.9 vs 3.6 and 3.4, P=0.167 and P=0.008)
- Meaning (4.0 vs 3.7 and 3.6, P=0.028 and P=0.009)
- Severity of potential harm (4.5 vs 4.0 and 3.8, P=0.014 and P<0.001)
The study also showed discrepancies across languages in the potential for translations to result in clinical harm or delay, Brewster and colleagues reported. Clinically meaningful errors in Haitian Creole translations were found in 8.3% done by professionals compared with 23.3% with Google Translate (P=0.024) and 33.3% with ChatGPT (P<0.001). Potentially harmful translations were less common for Spanish (professional 5%, Google Translate 6.7%, and ChatGPT 3.3%) and Portuguese (6.7%, 16.7%, and 5%), with no statistically significant differences across translation sources.
Limitations included that the research team's assessment of translations "relied on bilingual clinicians whose responses may not be representative of the average patient or level of health literacy," Brewster and colleagues noted. Furthermore, standardized discharge instructions "do not necessarily reflect variations in the style and readability of free text content," they added.
"We are currently working on a follow-up study to capture a more holistic perspective of machine translation performance that features not only clinicians, but also translators and patients and families themselves," Brewster told 鶹ý.
Disclosures
Brewster received grant funding from Boston Children's Hospital and Boston Medical Center outside the submitted work. A co-author reported serving as a health equity consultant to the New York City Department of Hygiene and Mental Health's Office of the Chief Medical Officer, advising the Rise to Health Coalition, serving as a commissioner on The Lancet's Commission on Antiracism and Solidarity, and receiving grant funding from Boston Children's Hospital and Boston Medical Center.
Primary Source
Pediatrics
Brewster RCL, et al "Performance of ChatGPT and Google Translate for pediatric discharge instruction translation" Pediatrics 2024; DOI: 10.1542/peds.2023-065573.