This retrospective study was approved by the institutional research ethics board. The requirement for patient informed consent was waived. The authors had full control of the data and the information submitted for publication.
Study cohort
Data were extracted from our large departmental electronic database of de-identified computed tomography (CT) images involving two university hospitals. Contrast-enhanced CT scans were obtained in the period from 2010 to 2015 using 0.7–1.2 mm pixel spacing, 1.25–5 mm slice thickness, 120 kVp, and different convolution kernels or constructors (General Electrics, Milwaukee, USA; Siemens, Erlangen, Germany; Phillips, Best, Netherlands).
Two investigators selected the tumours to reflect the variability in location, size, and shape of liver and lung metastases, CT acquisition, reconstruction, and body mass, which all affect the contrast-to-noise ratio and therefore the ease of determination of tumour borders. However, tumours were selected irrespective of primary tumour type or other patient demographics. The number of segmentations was calculated to evaluate the precision of manual segmentation depending on reader experience, on different organs, and using two different segmentation methods. The number of tumours and readers involved in this study was adjusted to ensure sufficient statistical power and a total number of image segmentations greater than 500.
Image analysis
Datasets were imported into OsiriX, version 5.9 (OsiriX, Geneva, Switzerland), an open source DICOM image analysis suite and picture archiving and communication system workstation designed for the Apple Macintosh platform. Twenty readers independently analysed CT data from 13 identified non-treated index liver and lung metastases (six livers and seven lungs) using two different methods. Ten readers were radiologists with experience ranging from 1 to 25 years (group 1) and ten readers were scientists with basic knowledge on image segmentation (group 2).
Method 1 consisted of selecting the slice for a given tumour where a mensuration of diameter could be performed according to RECIST or WHO methods and subsequent manual contouring of the tumour on this slice in 2D. While not representative of typical radiologic practice, maximal diameter was automatically extracted from this contour in order to simplify the experimental design. Moreover, for patients with multiple tumours, an approximate tumour location was given by a range of slices where the tumour could be located.
Method 2 consisted of performing the same manual contouring, but the readers were aware of the slice number and tumour location. Method 2 was performed after method 1. Both groups performed both methods. Regions of interest (ROIs) were exported to the Federative Platform for Research in Computer Science and Mathematics (PlaFRIM). The PlaFRIM experimental test bed was used to perform the statistical analysis.
Statistical analysis
Only adequate segmentations were selected for subsequent evaluation. Segmentations were considered as inadequate if performed at least two slices away from the slice most often selected by all the readers or not only on the pre-identified nodule; these segmentations were excluded from the analysis. A χ2 test was used for independence. Mean, minimum/maximum values, and standard deviation (SD) of the tumour diameter and area were obtained according to organ, group of readers, and methods. To minimize the effect of tumour size factor, measurement variability was expressed as a percentage of the mean diameter/area measurement. Thus, mean SD was divided by the mean diameter or area (mean SD/diameter or area). Mean values were compared using Wilcoxon signed rank test.
To determine interobserver agreement, the between-subject SD and within-subject SD of each variable were compared. Intraclass correlation coefficients (ICCs) were calculated based on repeated measures ANOVA [12, 13]. ICC results were interpreted according to the following criteria: poor (ICC <0.50), moderate (0.50 < ICC < 0.75), good (0.75 < ICC < 0.90), and excellent (ICC > 0.90).
The SD was considered to reflect the variation of segmentation. The mean SD of each diameter or area was plotted according to the respective diameter or area of the tumours in lung and liver. A regression analysis was performed to derive the 95% confidence interval (95% CI) of diameter and area in each organ and for each size. This 95% CI reflects the uncertainty of segmentation whatever the diameter or the area of the tumour. The same 95% CI was also applied for the limits of RECIST 1.1 criteria of PD (+20%) and PR (−30%) either on diameter or on area. The purpose was to detect overlap between the 95% CI of diameter or area and limits of PD or PR. The RECIST was extended to area (A) by adapting the limits of PD and PR using the formula A = π r2. A cut-off value of diameter or area was determined if identified at the intersection of the overlap. A p value greater than 0.050 was considered to indicate a significant difference. All analyses were conducted using Stata 12.0 (StataCorp, College Station, Texas, United States).