Skip to main content
  • Letter to the Editor
  • Open access
  • Published:

Visual Turing test is not sufficient to evaluate the performance of medical generative models

The Original Article was published on 30 November 2022

To the Editor,

We read with great interest the article by Wang et al. [1], reporting that generative adversarial networks (GANs) could generate synthetic ground glass opacities (GGOs) in computed tomography. While we appreciate their ambitious research to advance clinical radiology, we feel that the performance evaluation of the GANs is insufficient for their aim.

In their study, the authors stated that the model performance was evaluated by both subjective and objective approaches, namely the visual Turing test (VTT) and the distribution of radiomic features. We agree that VTT is a suitable approach to assess the realism of synthesized medical images [2], but a low VTT score does not guarantee the diversity of the generated data; it tells us they just look real. As the authors admitted as a limitation in the “Discussion” section, about 40% of the distributions of the radiomic features (e.g., NGTDM coarseness) were significantly different between generated and original images. Therefore, we suspect that their generative model may only be able to produce biased images due to the so-called mode collapse phenomenon [3]. If this were the case, it would diminish the usefulness of the data augmentation for classification tasks.

It is true that there is no single universal metric to assess the model performance and the quality of generated data; therefore, we need to combine several indicators, such as inception score, Fréchet inception distance, and geometry score [4, 5]. In addition to these, the image quality can be also evaluated quantitatively by NIQE, PIQE, and BRISQUE scores, as Oyelade and colleagues have demonstrated for mammography images [6]. As a practical matter, the images presented in the article are so small in size and resolution that the readers cannot fully appreciate what kind of images the GAN model has produced.

In summary, we believe that the authors need to provide more example images of the generated GGO and evaluate their GAN in several other ways to ensure the quality of data synthesis.

Availability of data and materials

Not applicable.


  1. Wang Z, Zhang Z, Feng Y et al (2022) Generation of synthetic ground glass nodules using generative adversarial networks (GANs). Eur Radiol Exp 6:59.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Higaki A, Kawada Y, Hiasa G, Yamada T, Okayama H (2022) Using a visual Turing test to evaluate the realism of generative adversarial network (GAN)-based synthesized myocardial perfusion images. Cureus. 14:e30646.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Bau D, Zhu J-Y, Wulff J et al (2019) Seeing what a GAN cannot generate. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE; 2019. p4501–4510.

  4. Shmelkov K, Schmid C, Alahari K (2018) How good is my GAN? Improving and optimizing operations: things that actually work - Plant Operators’ Forum 2004:218–234.

    Article  Google Scholar 

  5. Borji A (2019) Pros and cons of GAN evaluation measures. Comput Vis Image Underst 179:41–65.

    Article  Google Scholar 

  6. Oyelade ON, Ezugwu AE, Almutairi MS, Saha AK, Abualigah L, Chiroma H (2022) A generative adversarial network for synthetization of regions of interest based on digital mammograms. Sci Rep 12:1–30.

    Article  CAS  Google Scholar 

Download references


The authors declare that they received no external funding concerning this article.

Author information

Authors and Affiliations



AH conceptualized and drafted the manuscript. SY reviewed and revised the manuscript. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Akinori Higaki.

Ethics declarations

Ethics approval and consent to participate

This article is based on previously conducted studies and does not contain any studies with human participants or animals performed by the authors.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yamamoto, S., Higaki, A. Visual Turing test is not sufficient to evaluate the performance of medical generative models. Eur Radiol Exp 7, 31 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: