Evaluating Large Language Models on Aerospace Medicine Principles

Kyle D. Anderson, MD, PhD; Cole A. Davis, BS; Shawn M. Pickett, BS, MBA; Michael S. Pohlen, MD

Wilderness & Environmental Medicine · April 2025 · DOI: 10.1177/10806032251330628

Abstract

Large language models (LLMs) hold immense potential as clinical decision-support tools for Earth-independent medical operations. However, the generation of incorrect information may be misleading or even harmful when applied to care in this setting. Method: This work tested two publicly available LLMs, ChatGPT-4 and Google Gemini Advanced, as well as a custom Retrieval-Augmented Generation (RAG) LLM on factual knowledge and clinical reasoning in aerospace medicine. The models were assessed for their consistency and reasoning using board-style questions. Results: ChatGPT-4, Gemini Advanced, and RAG LLMs achieved high, yet varied, scores. Nevertheless, all models exhibited gaps in factual knowledge and inconsistencies that could prove harmful. Quantitatively, the RAG LLM achieved the highest accuracy. Conclusion: There is considerable promise for LLM use in autonomous medical operations in spaceflight, but their current limitations indicate that further advancements in model training and healthcare-specific fine-tuning are required.

View at Journal PubMed: 40289627 ← Back to Home