Denoising diffusion-based MRI to CT image translation enables automated spinal segmentation

Background Automated segmentation of spinal magnetic resonance imaging (MRI) plays a vital role both scientifically and clinically. However, accurately delineating posterior spine structures is challenging. Methods This retrospective study, approved by the ethical committee, involved translating T1-weighted and T2-weighted images into computed tomography (CT) images in a total of 263 pairs of CT/MR series. Landmark-based registration was performed to align image pairs. We compared two-dimensional (2D) paired — Pix2Pix, denoising diffusion implicit models (DDIM) image mode, DDIM noise mode — and unpaired (SynDiff, contrastive unpaired translation) image-to-image translation using “peak signal-to-noise ratio” as quality measure. A publicly available segmentation network segmented the synthesized CT datasets, and Dice similarity coefficients (DSC) were evaluated on in-house test sets and the “MRSpineSeg Challenge” volumes. The 2D findings were extended to three-dimensional (3D) Pix2Pix and DDIM. Results 2D paired methods and SynDiff exhibited similar translation performance and DCS on paired data. DDIM image mode achieved the highest image quality. SynDiff, Pix2Pix, and DDIM image mode demonstrated similar DSC (0.77). For craniocaudal axis rotations, at least two landmarks per vertebra were required for registration. The 3D translation outperformed the 2D approach, resulting in improved DSC (0.80) and anatomically accurate segmentations with higher spatial resolution than that of the original MRI series. Conclusions Two landmarks per vertebra registration enabled paired image-to-image translation from MRI to CT and outperformed all unpaired approaches. The 3D techniques provided anatomically correct segmentations, avoiding underprediction of small structures like the spinous process. Relevance statement This study addresses the unresolved issue of translating spinal MRI to CT, making CT-based tools usable for MRI data. It generates whole spine segmentation, previously unavailable in MRI, a prerequisite for biomechanical modeling and feature extraction for clinical applications. Key points • Unpaired image translation lacks in converting spine MRI to CT effectively. • Paired translation needs registration with two landmarks per vertebra at least. • Paired image-to-image enables segmentation transfer to other domains. • 3D translation enables super resolution from MRI to CT. • 3D translation prevents underprediction of small structures. Graphical Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s41747-023-00385-2.


Background
The different contrasts of CT and MRI offer distinct clinical utilities.Generally, segmentation is a prerequisite to automatically extract biomarkers, especially in large cohorts like the German National Cohort (GNC) [1] or the UK-biobank [2].While the extraction of the precise bone structure of the spine from CT is publicly available [3,4], neither a segmentation nor an annotated ground truth dataset for the whole spine including the posterior elements is currently available for MRI.Accurate segmentations are not only vital for scientific studies but also enable the exact localization of abnormalities in clinical routine.Unlike CT, MRI provides additional information about bone marrow edema-like changes, intervertebral disc degeneration, degenerative endplate changes, ligaments, joint effusions, and the spinal cord.
Robust and precise segmentation and quantification of such spinal structures is a prerequisite e.g. to evaluate large epidemiologic studies or to enable automated reporting.
An alternative to labor-intensive manual annotations is the potential use of image-to-image translation to extract bony structures.This approach may overcome challenges like partial volume effects (e.g. at the spinous process) and subtle signal differences (e.g. of vertebral endplates and ligaments in MRI), which are easily distinguishable in high-resolution CT but not in MRI.
Image-to-image translation involves transforming images from one domain to another, and several deep learning methods have been employed for this purpose, including Pix2Pix [5], CycleGAN [6], and contrastive unpaired translation (CUT) [7].These methods have been used in various studies to generate missing sequences, translate to different domains, enhance image quality, and improve resolution [8].In the medical domain, these methods have shown success in rigid structures like the brain, head, and pelvis, where registration guarantees that both domains have similar tissue distributions and anomalies [8].However, if biases are not accounted for, the model may hallucinate new structures to fit both distributions [9].Due to this difficulty, translating warpable structures like the spine is less explored in the literature.Some successful implementations have shown that translated images can be similar to the target images and might mislead medical experts [10][11][12][13][14].
However, none of these works have focused on using translations for downstream tasks, such as segmentations in the output domain.
This study aimed to develop and compare different image translation networks for pretrained CT-based segmentation models when applied to MRI datasets (Figure 1).The primary focus was on segmenting the entire spine, with special attention to accurately translating the posterior spine structures, as they pose challenges in MRI delineation.We compared GANbased approaches [5,7] with new denoising diffusion models [15][16][17].Denoising diffusion functions are fundamentally different from GANs, as they add and remove noise to an image instead of relying on the discriminator and generator zero-sum game in GANs.In the computer vision domain, denoising diffusion models have outperformed GANs in various tasks, including upscaling, inpainting, image restoration, and paired image-to-image translation [18].While diffusion has been applied to medical image translation tasks in a limited number of papers [17,[19][20][21][22], we adapted the conditional denoising diffusion for paired image-to-image translation in 2D and 3D.
The purpose of this paper was (1) to improve existing image-to-image translation for spine MRI to CT translation by improving all steps of the process: from data alignment, implementation of new denoising diffusion translations and comparison to GANs, and finally extension of our findings to 3D translation.(2) To utilize the translated CT images for automatic segmentation of the entire spine, eliminating the need for a manually labeled segmentation mask in the original MRI domain.(3) To develop the ability to generate full spine segmentations on MRI, which are currently not available.Aligned images were used to train our image-to-image models.Finally, the MRIs of validation and test sets were translated to CT images.Segmentation was performed on synthesized CT images, and consequently, were perfectly aligned with the original MRIs (blue box from left to right; prediction).The generated segmentations can be used for generating additional and new center points to iteratively optimize the registration.

Materials and Methods
In brief, we aligned CT and MR spine images through rigid landmark registration [23].With this paired data, we trained various image-to-image models to generate CT images.We used an available CT segmentation algorithm [3,4] to generate vertebral masks in these synthesized CTs for the original MRI.These resulting segmentations were subsequently used to generate new landmarks for new training data (Figure 1).We compared different landmark registrations and 2D models.Finally, we adapted the results into 3D models and assessed the accuracy of the resulting segmentations.In this study, we retrospectively collected sagittal T1w/T2w MR and corresponding CT images of the spine from the same patient within a week.Approval from the local ethics committee was obtained, and informed consent was waived.Figure 2 illustrates our data selection process.62 T1w image series (18 males and 44 females; average age 66±15y / 72±13y) were used from another unpublished in-house study, including five thoracic and 57 lumbar volumes.Additionally, a new dataset was collected of 201 T2w image series (50 males and 42 females; average age 65±20y / 69±17y) from 92 patients, including 38 cervical, 99 thoracic, and 70 lumbar volumes.Patients with fractures and degenerative changes were included, while those with motion artifacts, metastases, and foreign objects were excluded, because for segmentation models, it would benefit when the translation suppresses these anomalies.We performed rigid registration of the matching MRIs and CTs based on the center of mass of the vertebral body and the spinous process.(See Figure 1 bottom left).Inhouse test set, training, and validation set were split patient-wise for different MR acquisitions of other spine regions.For validation, six T1w and nine T2w MRIs were used as they could not be aligned with the CTs due to substantially different patient positioning.

Data
We used 172 lumbar MR and segmentation volumes from the MRSpineSeg Challenge (MRSSegClg) [24,25] for external evaluation of Dice-Scores.This dataset focuses on the lumbar region, but the segmentation exceeds the bony borders, questioning its validity.One subject was used for pipeline development and validation.Validation sets were used to find optimal inference parameters and to avoid overfitting.Since the labels in MRSSegClg encompass not only the bony spine but also adjacent ligaments and soft tissue, we manually adjusted the labels for a subset of 20 volumes to restrict them solely to the bone.We analyzed these subsets as two distinct datasets.

Image preprocessing
CT and MR datasets were rigidly registered [23] by using landmarks to facilitate paired image translation.For the single-landmark approach, we selected the center of mass (CM) of the vertebral bodies.To address rotational misalignment around the cranio-caudal axis, the CM of the spinal processes was added for the two-landmark approach, as such rotational misalignment was frequently observed.Landmarks for CT were automatically determined based on vertebral and subregion segmentations (Figure 1).For the T2w images, we manually identified the CM points for both the vertebral bodies and the spinous processes.
The manual centroid selection and ground truth segmentation corrections in the test sets were performed by J. S., a radiologist with three years of experience.To obtain the points for the T1w image, we synthesized them by adapting the T2w to CT translation, generating segmentation, and extracting the CM for T1w.Roughly 10 to 20% of the failure cases were first excluded then translated with models that was trained on the other T1w images.This proved sufficient to generate all CM points.To assess the impact of additional landmarks on registration, we computed the Dice score using our pipeline on the T2w dataset using the manual ground truth as a reference.
CT images were transformed to the range of [-1, 1] by dividing the values by 1000 HU and clamping outliers to retain air, soft tissue, and bone while suppressing extreme intensities.
Linear rescaling was applied to the MRI data, converting the range from [max, 0] to [-1, 1].To account for varying intensities, MRIs were augmented with a random color jitter (brightness, contrast randomization: 0.2).Image pairs were resampled to a uniform spatial resolution of 1x1 millimeters in the sagittal plane and a slice thickness of 2.5-3.5 millimeters, as acquired in the MRI.To enhance the training data by a factor of 10 and simulate weak scoliosis and unaligned acquisition, we introduced 3D image deformations using the elastic deformation Python plugin [26].Subsequently, the volumes were sliced into 2D sagittal images, and slices without segmentation were removed.Random cropping was performed to adjust the image size to 256x256 pixels.

Models for Image-to-image Translation
To compare various image-to-image translation methods, we implemented two unpaired methods, namely CUT [7] and SynDiff [17], along with three paired methods, Pix2Pix [5], DDIM noise, and DDIM image.The training process involved unregistered and registered data using both single-and two-landmark approaches.For DDIM, we employed a UNet [26] architecture with convolutional self-attention and embeddings for the timesteps, which we refer to as self-attention U-network (SA-UNet) [18,27,28].The Diffusion mechanism predicted either noise or the image, with the other computed during inference.A learning rate of 0.00002 was used, and we set the timestep to t=20 for the DDIM inference parameter.The value of η = 1 (noise generation is fully random) was determined by optimizing on the validation set.We compared our approach to CUT [7], Pix2Pix [5], and SynDiff [17].During our experiments, we performed a hyperparameter search for the reference ResNet and UNet.
Additionally, we introduced a weighted structural similarity index metric (SSIM) loss from a recent paper [29] to update the loss formulation.To further explore the impact of different models and methods, we also tested CUT and Pix2Pix with the SA-UNet.All models were randomly initialized.In our analysis of DDIM, we ablated three inference parameters [16,30].
However, the results did not show substantial effects, and we have included them in the supplementary material along with brief descriptions of the tested methods.

Image Quality
The evaluation of image quality involved comparing actual and synthesized CT images.To quantify this, we used the "peak signal-to-noise ratio" (PSNR) metric.In this context, the reference image serves as the "signal," and the divergence between the two images is considered the "noise."A PSNR value above 30 dB indicates that the difference between the two images is imperceptible to the human eye [10].It's important to note that we did not control the correspondence of soft tissue, as it fell outside the scope of our downstream task.
To handle this in our evaluation, we masked pixels that were further than 10 pixels away from a segmented spine structure, setting them to zero.We also computed the absolute difference (L1) mean squared error (MSE) structural similarity index measure (SSIM) and Visual Information Fidelity (VIFp).

Downstream task: Segmentation
We utilized a publicly available segmentation algorithm [3,4] on the synthesized CT images.
We then compared the Dice scores globally and on a vertebral level between the synthesized and ground truth segmentations in four datasets.(1,2) The segmentation ground truth of the in-house datasets was derived from the aligned CT image and was manually corrected.(3) The segmentation of the MRSSegClg that is known to exceed the bony structures and (4) a manually corrected subset of MRSSegClg [24,25].In Figure 3 C/D the segmentation reaching beyond the bony structures of MRSSegClg is highlighted.For analysis purposes, we excluded structures that the CT segmentation algorithm could not segment, such as the sacrum and partially visualized vertebrae.

3D Image Translation with Diffusion
The first implementations of both DDIM and Pix2Pix in 3D, similar to the 2D approach, did not converge.We thus implemented changes according to recommendations of Bieder et al. [31].To optimize GPU storage, we eliminated attention layers and replaced concatenation skip connections with addition operations.Additionally, we introduced a position embedding by concatenating ramps ranging from zero to one of the original images' full dimensions into the input.The training was done on 3D patches and our approach used a patch size of (128x128x32), where the left/right side was limited to 32 pixels due to the image shape.This setup is "fully convolutional", which means that during inference, an image of any size can be computed by the network as long the sides are divisible by 8. To the best of our knowledge, this represents the first 3D image-to-image translation with diffusion.Since 3D translations require to include the left/right direction, we resampled all images to 1 mm isotropic.

Statistical Analysis and Software
We employed a paired t-test to assess the significance of PSNR and Dice score between different models.To achieve a fixed size of 256x256 pixels for assessing image quality, we used one crop per image slice.When reporting differences in multiple experiments, we present the worst (i.e., highest) p-value.We skip significance calculations other image quality metrics because the results are redundant.For 3D data, we pad the test data, and the 3D models generate 1 mm isotropic volumes, which are later resampled to the original MRI size.

Influence of Rigid Registration
Networks trained on unregistered data were incapable of learning the difference between soft tissue and bone.During our early testing, we noticed that most methods could correctly identify the vertebral body, but translating the posterior structures was impossible.Especially the spinous process was often omitted in the translation, as shown in Figure 4. "One point per vertebra" registration was sufficient for the vertebral body translation, but the spine could rotate around the cranio-caudal axis.This caused the spinous process to disappear in translated images (see Figure 4 A/B).Additionally, confusion between epidural fat and bone shifted the entire posterior elements towards the spinal cord.Overcoming this issue required accounting for rotation by adding additional points to the rigid registration (Figure 4).Next to visual findings, we observed a significant increase in Dice from one to two points per vertebra registration.Pix2Pix 0.68 to 0.73 (p<0.003);SynDiff 0.74 to 0.77 (p<0.001);DDIM noise 0.55 to 0.72 (p<0.011) and DDIM image 0.70 to 0.75 (p<0.001).Notably, the best unpaired method, SynDiff, could not learn posterior structure translation without registration.

Image Quality
The unpaired CUT models performed worse than all others (p<0.001), while all other models performed on a similar level.See Table 1 for PSNR and other common metrics.Example outputs from the test sets can be seen in Figure 5.The Pix2Pix with the SA-UNet performed better on T1 and worse on T2 than the smaller UNet (T1: p<0.001;T2: p=0.041).Even though SynDiff had an unpaired formulation, it had similar results compared to our paired Pix2Pix, DDIM noise (slightly worse in T1w and better in T2w, all p<0.003).The DDIM image mode performed slightly better than the DDIM noise mode (p<0.001),SynDiff (p<0.001), and Pix2Pix (p<0.001).DDIM image mode produces images with less noise than the original data.Less noise should make the segmentation easier.Overall, the DDIM image mode was our best-performing 2D model.Pix2Pix UNet (DDIM noise vs. Pix2Pix UNet p=0.972) were worse than the three best models (p<0.001).The CUT reconstruction was unsuited for segmentation and was the worst model (CUT vs. all p<0.001).An example of the segmentation from different translations for a full spine can be found in Figure 6 in an example dataset from the GNC [1].We observed comparable rankings in the MRSSegClg[24, 25] and T1w datasets when excluding the vertebral body (Table 3).In the in-house T2w test set, SynDiff has a considerably higher Dice score than Pix2Pix SA-UNet and DDIM image mode (p<0.001),indicating a better performance in the "more complicated" anatomical structures for this data set only.
The correction of the MRSSegClg segmentations resulted in an increased Dice score of up to 0.02.The rankings of all methods on the original versus the corrected MRSSegClg dataset were mostly consistent, indicating that no method had exploited the false delineation by overpredicting the segmentation.
Overall, Pix2Pix SA-UNet, DDIM image mode, and SynDiff were equally capable of producing CT images for the segmentation algorithm.Closely followed by DDIM noise mode and the Pix2Pix UNet.

3D Image Translation with Diffusion
All 3D models increased the Dice scores compared to our 2D models (p < 0.006).Pix2Pix 3D and DDIM 3D noise performed on a similar level, while DDIM 3D image performances were consistently a bit better close to the rounding threshold (p<0.001).PSNR showed a drop compared to the 2D variants.The 3D models outperform all 2D models on posterior structures (See Figure 7 T2w: p<0.024;MRSSegClg (ours): p<0.005 for DDIM 3D image, p<0.062DDIM 3D noise; p<0.462Pix2Pix 3D; Posterior structures are unavailable in the original MRSSegClg).With the rescaling to 1 mm isotropic, we receive a super-resolution of our mask in the thick slice direction that resembles a more realistic 3D shape than the native resolution (Figure 7).

Discussion
This study successfully demonstrated the feasibility of translating standard sagittal spine MRI into the CT domain, enabling subsequent CT-based image processing.Specifically, the registration process, with a minimum of two points per vertebra, enables accurately translating posterior structures, which are typically challenging for image translation and segmentation.To achieve this, a low-data registration technique was introduced for pairing CT and MRI images, which can be automated by our translation and segmentation pipeline.
In our low-data domain, paired translation methods performed on a similar level, with DDIM in image mode being the single best model.The spinous process was not always correctly translated in our 2D approaches.We resolved this issue by changing the process to 3D.Our 3D methods had a drop in image quality compared to the 2D translation.We believe this is due to the required resampling from the 1 mm isotropic output to the native resolution of the test data.Ultimately, the image-to-image translation facilitated MRI segmentation using a pretrained CT segmentation algorithm for all spine regions.
Our results extend prior works that have been limited to high-resolution gradient-echo Dixon T1w sequences to CT translations [14,32,33] as well as to intra-modality MR translations for different contrasts from standard T1w and T2w TSE sequences to short tau inversion recovery [34] or T2w fat-saturated images [35], frequently used in spinal MRI.Commercial products are available for MRI to CT translation [36,37].However, in contrast to our approach, they require a dedicated, isotropic gradient-echo sequence.They are unavailable for standard T1w or even T2w TSE sequences.Acquiring an additional, dedicated image only for segmentation is resource and time demanding in everyday medical practice and not possible at all in existing data like in available large epidemiological studies like the GNC.
Mature preprocessing pipelines enable image translation in other body regions [8].For example, in brain MRI, every sample can rigidly be registered to an atlas, and the non-brain tissue is removed.However, in the spine, where vertebrae may be moving between acquisitions, such a simple, rigid preprocessing is impossible.Additionally, the mapping of intensities from the MR to the CT domain is highly dependent on the anatomy: e.g.fat and water would have similar signals in T2w MRI but have substantially different density values in CT, despite being in close anatomical location with a high inter-subject variability.
Consequently, a network cannot learn the relationship between anatomy and intensity translation based on unpaired images: The tested unpaired method CUT [7] would require additional constraints to learn an anatomically correct translation.SynDiff [17] has an unpaired CycleGAN [6] in its formulation and worked on paired datasets similar to paired methods.Still, it could not correctly translate the posterior structures on unmatched data.We demonstrated that our rigid registration is a required preprocessing for a correct translation, even for SynDiff, and we believe that better processing, such as deformable registration, can lead to better results.However, to account for inter-vertebra movement between two acquisitions due to different patient lying positions between CT and MR acquisitions would require whole vertebral segmentation.Other papers combat this issue by using axial slices, which only need a local vertebra registration [10][11][12] or only focusing on the lumbar spine (5)(6)(7)(8)(9), where acquisitions can be performed in a more standardized patient positioning than the cervical spine.Oulbacha and Kadourys's et al. [38] also use sagittal slices like our study.
However, they face similar challenges with incorrectly translating posterior structures, as observed in their figures.To address these issues, we employed dedicated preprocessing techniques and transitioned to a 3D approach.

Limitation
Our pipeline enables us to generate segmentations that are available in other modalities.
This method cannot produce segmentations of structures that are not segmented but visible in the input domain.We observed weaknesses in translating neck and thoracal regions when using external images, especially for the 2D methods.The posterior elements in the thoracic region were still the most difficult and the segmentation and the translation showed more errors compared to other regions.Classifier-free guidance showed substantial improvement in language based DDIM generation [30], and had a visible impact in 2D translation on an out-of-training distribution like the GNC images.Still, the difference in image quality and the Dice scores are too small to measure.Therefore, we excluded classifier-free guidance [30] from our analysis, as the effect was too small to be investigated in available test sets.The same is true for testing a different number of time steps and the determinism parameter eta.
We go in more detail about these inference parameters in the supplemental materials.

Conclusion
We were able to show that image segmentations can be generated in a novel target domain without manual annotations if segmentations exist for another image domain and paired data for both domains can be obtained.For the spine, we identified minimum registration requirements for paired image-to-image translations.With this approach, SynDiff, Pix2Pix, and DDIM enabled translation of 2D images resulting in similarly good downstream segmentations.We introduced a 3D variant of conditional diffusion for image-to-image translation that improved the segmentation of posterior spinal elements compared to 2D translation.The synthesized segmentations represent a novel ground truth for MRI-based spine segmentations that are prerequisites for spine studies involving large cohorts.
domains, even if they are different.Consequently, issues like changing sizes, forgotten elements, or hallucinations may occur.We tested on a model called SynDiff [7], which is similar to CycleGAN.SynDiff includes a CycleGAN that generates image pairs for a DDPM [2].The DDPM operates in image mode with a fixed step size of 4. On the other hand, CUT [8] stands out as it does not utilize Cycle-Consistency-Loss, but instead, it employs a contrastive loss.In certain layers, there is an enforcement that patches in the same region should be similar across different layers, whereas patches between different regions should be dissimilar.

Denoising Diffusion
Denoising diffusion is a generative deep learning model.The model is forced to predict a Gaussian noise in an image where the noise has varying strengths.Gaussian noise is purely random, and the model has no choice but to learn how images of the dataset look like.
Giving a random noise without an image causes the model to introduce image features into the noise.The noise strengths are defined in t timesteps.Where 0 is the image without noise; in time step t, the image is fully replaced by noise.We always used t=1000 for all experiments.From one step i-1 to i a Gaussian noise of q(x i | x i−1 ) = (x i , √1 − β i x i−1 , β i Ι) is added where β controls the strength of the noise.Starting at 0 for the noise-free image (i=0) and 1 for the completely noised image (i = t).We use a quadratic cosine curve for beta as in Nichol et al [1].We optimize the model during training to predict the input noise ϵ or image x 0 .As loss, we use the absolute difference loss.We can compute for any timestep a noised image with the "forward formula" x  = √α ̅ i x 0 + √1 − α ̅ i ε where ε is a random normal We did an ablation on our DDIM.We only changed one parameter and kept the rest fixed on w=0, =1 and t=20.We choose two numbers of timesteps for the ablation t=10, t=20, and t=50.The results lead to no conclusion if there is a better t in image quality.We could reduce the t even further than 20 without sacrificing quality.An  of zero has a small negative impact (T1=27.48;T2=26.81 p<0.001) in noise mode and has a positive impact in image mode (T1=27.95;T2=27.41;p<0.001).We suspect that the parameters t and  have a too minor impact and we cannot say in general what value must be set to get an optimal result.The classifier free-guidance has a small impact on the test data and we see no pattern if any w improves the image quality.We turned off the classifier free-guidance for our 3D models.
The inference hyperparameter of DDIM image mode did not impact the segmentation results, while there are noticeable differences for DDIM in noise mode.The DDIM with t=10 (T1=0.81,T2=0.75, MRSSegClg=0.77)had the best scores in image quality but was the lowest performing inference type in the Dice metric (t=10 vs. t=20 p=0.09).The parameter  impacted the Dice score in an inconsistent way.We see no correlation between small differences in the quality metrics and Dice scores.

Figure 1 :
Figure 1: Our training pipeline.In our datasets, we identified the center of the vertebral body

Figure 2 :
Figure 2: Datasets, preparation, exclusion, and split.MR data were acquired with 12 different scanners from 3 different vendors.Additionally, we used the MRSSegClg for

Figure 3 :
Figure 3: Difficulties of the MRI data for unpaired training and issues with the MRSSegClg

Figure 4 :
Figure 4: Comparison of one and two registration points per vertebra on real data.A: We

.
We marked multiple values if they were below the rounding threshold.The ground truth is registered real CTs.The image pairs are from the test set of our in-house data.MSE = mean squared error, PSNR = peak signal-to-noise ratio, SSIM = structural similarity index metric, VIFp = visual information fidelity, DDIM = denoising diffusion implicit model, DDPM = denoising diffusion probabilistic model, CUT = contrastive unpaired translation, SA-UNet = self-attention U-network

Figure 5 :
Figure 5: Translation from test sets T1w/T2w to CT from the neck to the lumbar vertebra.

Figure 6 :
Figure 6: Translation from T2w MR to CT and the segmentation results in an external full

Figure 7 :Table 2 :
Figure 7: 3D visualization of the segmentation from subjects out of the GNC and in-house The vertebral body is removed from the calculation by an automatic subregion segmentation on The T1w, T2w, and MRSSegClg (ours).The unchanged MRSSegClg could not be subregion segmented.We marked the best values from 2D cases with  and the overall best * vol .= volume, vert.= vertebra, MRSSegClg = MRSpineSeg Challenge, DDIM = denoising diffusion implicit model, CUT = contrastive unpaired translation, SA-UNet = selfattention U-network

distribution and α i = 1 −
β i ; α ̅ i = ∏ α j  j=0 .During inference, we iterate over the time steps.The model predicts either the noise ϵ ̂ or the final image  ̂0.We can compute the other by puttingx  and ϵ ̂ or  ̂0 into the forward formula and solve for the missing value, like in the case of noise prediction:  ̂0 = 1 √α ̅ i (x  − √1 − α ̅ i ϵ ̂).With  ̂0 and x i we can compute the next x j for the time step j.The denoising diffusion probabilistic model (DDPM)[2]  iterates over every step random noise in every step, we can use only a random noise in the initial step and the other is computed from the previous input and the model prediction.This makes the inference deterministic and can enable interpolation of the DDIM output.This behavior is handled by the parameter . =0 means fully deterministic and =1 means that every step receives a fully random noise.A third inference parameter is classifier free-guidance w[10].If w is not zero, we sample the model in each step twice.One receives the conditional MRI input, and the other receives a black image.The conditioned output is multiplied by w+1 and the unconditioned output is multiplied by -w.Both are added together.The idea is to push the output away from the general bias of the network towards the condition.All three parameters can be used on a DDPM trained network without requiring retraining and are tested on the 2D network exclusively.
We marked multiple values if they were below the rounding threshold.MSE = mean squared error, PSNR = peak signal-to-noise ratio, SSIM = structural similarity index metric, VIFp = visual information fidelity, DDIM = denoising diffusion implicit model, DDPM = denoising diffusion probabilistic model, CUT = contrastive unpaired translation, SA-UNet = self-attention U-network

Table 1 :
Image Quality for T1w and T2w to CT translation.
Arrows indicate if smaller or bigger is better.As a visual aid, we marked the best values with

Table 3 :
Average posterior structures Dice score↑ per volume and per vertebra