Visual Turing test is not sufficient to evaluate the performance of medical generative models
European Radiology Experimental volume 7, Article number: 31 (2023)
To the Editor,
We read with great interest the article by Wang et al. , reporting that generative adversarial networks (GANs) could generate synthetic ground glass opacities (GGOs) in computed tomography. While we appreciate their ambitious research to advance clinical radiology, we feel that the performance evaluation of the GANs is insufficient for their aim.
In their study, the authors stated that the model performance was evaluated by both subjective and objective approaches, namely the visual Turing test (VTT) and the distribution of radiomic features. We agree that VTT is a suitable approach to assess the realism of synthesized medical images , but a low VTT score does not guarantee the diversity of the generated data; it tells us they just look real. As the authors admitted as a limitation in the “Discussion” section, about 40% of the distributions of the radiomic features (e.g., NGTDM coarseness) were significantly different between generated and original images. Therefore, we suspect that their generative model may only be able to produce biased images due to the so-called mode collapse phenomenon . If this were the case, it would diminish the usefulness of the data augmentation for classification tasks.
It is true that there is no single universal metric to assess the model performance and the quality of generated data; therefore, we need to combine several indicators, such as inception score, Fréchet inception distance, and geometry score [4, 5]. In addition to these, the image quality can be also evaluated quantitatively by NIQE, PIQE, and BRISQUE scores, as Oyelade and colleagues have demonstrated for mammography images . As a practical matter, the images presented in the article are so small in size and resolution that the readers cannot fully appreciate what kind of images the GAN model has produced.
In summary, we believe that the authors need to provide more example images of the generated GGO and evaluate their GAN in several other ways to ensure the quality of data synthesis.
Availability of data and materials
Wang Z, Zhang Z, Feng Y et al (2022) Generation of synthetic ground glass nodules using generative adversarial networks (GANs). Eur Radiol Exp 6:59. https://doi.org/10.1186/s41747-022-00311-y
Higaki A, Kawada Y, Hiasa G, Yamada T, Okayama H (2022) Using a visual Turing test to evaluate the realism of generative adversarial network (GAN)-based synthesized myocardial perfusion images. Cureus. 14:e30646. https://doi.org/10.7759/cureus.30646
Bau D, Zhu J-Y, Wulff J et al (2019) Seeing what a GAN cannot generate. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE; 2019. p4501–4510. https://doi.org/10.1109/ICCV.2019.00460
Shmelkov K, Schmid C, Alahari K (2018) How good is my GAN? Improving and optimizing operations: things that actually work - Plant Operators’ Forum 2004:218–234. https://doi.org/10.1007/978-3-030-01216-8
Borji A (2019) Pros and cons of GAN evaluation measures. Comput Vis Image Underst 179:41–65. https://doi.org/10.1016/j.cviu.2018.10.009
Oyelade ON, Ezugwu AE, Almutairi MS, Saha AK, Abualigah L, Chiroma H (2022) A generative adversarial network for synthetization of regions of interest based on digital mammograms. Sci Rep 12:1–30. https://doi.org/10.1038/s41598-022-09929-9
The authors declare that they received no external funding concerning this article.
Ethics approval and consent to participate
This article is based on previously conducted studies and does not contain any studies with human participants or animals performed by the authors.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Yamamoto, S., Higaki, A. Visual Turing test is not sufficient to evaluate the performance of medical generative models. Eur Radiol Exp 7, 31 (2023). https://doi.org/10.1186/s41747-023-00347-8