Dear Editor,
We thank our colleagues for their valuable comments1 on our study.2 Below, we address the points raised in the context of our study’s aim and design choices, and we outline potential improvements for future research.
The primary aim of our work was to provide an objective baseline assessment of a general-purpose large language model (LLM) in the most straightforward and realistic “out-of-the-box,” browser-based scenario. This approach was intentionally chosen to highlight the current limitations of model architectures and training data, particularly the absence of radiology-specific pretraining. Recent reviews3-5 have highlighted that LLMs continue to face challenges in data scarcity, coarse visual embeddings, and limited explainability, all of which contribute to difficulties in capturing subtle signal or texture patterns.
We agree that radiologists base their decisions on volumetric, multiplanar, and multiphase image series. Our study, however, was designed as a standardized and ethically safe browser-based scenario to establish a “minimum requirement” baseline. Future studies will incorporate sequential and volumetric inputs, as well as multiphase evaluation. Of course, this will require LLMs to become technically capable of ingesting larger and more complex inputs.
To isolate image-based signal interpretation, patient history was excluded, allowing us to measure the model’s pure image-based performance. In clinical reality, the integration of imaging and history is essential. Yet, as Bulut et al.6 recently demonstrated, even when clinical findings were provided, overall accuracy remained low. Although their work focused on pneumothorax detection, these results still indicate that the performance of current models remains questionable, even in a clinical context.
Moving forward, we believe improvements should include: (i) the ingestion of sequential/volumetric and multiphase data; (ii) radiology-specific pretraining and/or adapter fine-tuning; (iii) structured prompt libraries and chain-of-thought reasoning; (iv) the integration of clinical metadata; and (v) blinded multicenter studies comparing different levels of radiologist expertise. Industry collaboration will be crucial to achieving these goals.
In conclusion, although many of the limitations highlighted by our colleagues were already acknowledged in our original manuscript, we view these comments as an opportunity to expand the scope of our subsequent studies and to establish a concrete roadmap for future research.


