Fully‑automated deep‑learning segmentation of pediatric cardiovascular magnetic resonance of patients with complex congenital heart diseases

Background For the growing patient population with congenital heart disease (CHD), improving clinical workflow, accuracy of diagnosis, and efficiency of analyses are considered unmet clinical needs. Cardiovascular magnetic resonance (CMR) imaging offers non-invasive and non-ionizing assessment of CHD patients. However, although CMR data facilitates reliable analysis of cardiac function and anatomy, clinical workflow mostly relies on manual analysis of CMR images, which is time consuming. Thus, an automated and accurate segmentation platform exclusively dedicated to pediatric CMR images can significantly improve the clinical workflow, as the present work aims to establish. Methods Training artificial intelligence (AI) algorithms for CMR analysis requires large annotated datasets, which are not readily available for pediatric subjects and particularly in CHD patients. To mitigate this issue, we devised a novel method that uses a generative adversarial network (GAN) to synthetically augment the training dataset via generating synthetic CMR images and their corresponding chamber segmentations. In addition, we trained and validated a deep fully convolutional network (FCN) on a dataset, consisting of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$64$$\end{document}64 pediatric subjects with complex CHD, which we made publicly available. Dice metric, Jaccard index and Hausdorff distance as well as clinically-relevant volumetric indices are reported to assess and compare our platform with other algorithms including U-Net and cvi42, which is used in clinics. Results For congenital CMR dataset, our FCN model yields an average Dice metric of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$91.0\mathrm{\%}$$\end{document}91.0% and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$86.8\mathrm{\%}$$\end{document}86.8% for LV at end-diastole and end-systole, respectively, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$84.7\mathrm{\%}$$\end{document}84.7% and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$80.6\mathrm{\%}$$\end{document}80.6% for RV at end-diastole and end-systole, respectively. Using the same dataset, the cvi42, resulted in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$73.2\mathrm{\%}$$\end{document}73.2%, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$71.0\mathrm{\%}$$\end{document}71.0%, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$54.3\mathrm{\%}$$\end{document}54.3% and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$53.7\mathrm{\%}$$\end{document}53.7% for LV and RV at end-diastole and end-systole, and the U-Net architecture resulted in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$87.4\mathrm{\%}$$\end{document}87.4%, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$83.9\mathrm{\%}$$\end{document}83.9%, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$81.8\mathrm{\%}$$\end{document}81.8% and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$74.8\mathrm{\%}$$\end{document}74.8% for LV and RV at end-diastole and end-systole, respectively. Conclusions The chambers’ segmentation results from our fully-automated method showed strong agreement with manual segmentation and no significant statistical difference was found by two independent statistical analyses. Whereas cvi42 and U-Net segmentation results failed to pass the t-test. Relying on these outcomes, it can be inferred that by taking advantage of GANs, our method is clinically relevant and can be used for pediatric and congenital CMR segmentation and analysis.


Background
Congenital heart diseases (CHDs) are the most common among the birth defects [1]. It is currently estimated that 83% of newborns with CHD in the U.S. survive infancy [2]. These patients require routine imaging follow ups. Cardiovascular magnetic resonance (CMR) imaging is the imaging modality of choice for assessment of cardiac function and anatomy in children with CHD. Not only does CMR deliver images with high spatial and acceptable temporal resolution, but also it is non-invasive and non-ionizing [3,4]. On the other hand, CMR analysis in pediatric CHD patients is among the most challenging, time-consuming, and operator-intensive clinical tasks.
Presently, artificial intelligence (AI) and particularly deep-learning show strong promise for automatic segmentation of CMR images [5][6][7][8]. While the current AI-based methods have been successfully used for delineating the adult heart disease, they are not yet reliable for segmenting the CMR images of CHD patients, and particularly in children [8,9]. The foremost basis for this shortcoming is the anatomical heterogeneity and lack of large CMR databases that include data from a diverse group of CHD subjects acquired by diverse scanners and pulse sequences. As indicated by Bai et al. [7], a major limitation of the existing learning methods is the use of homogeneous datasets where the majority of the CMR data are from adult subjects with healthy or closely mimicking healthy hearts, e.g., the Second Annual Data Science Bowl [10] and UK CMR Biobank [11], among others [12,13].
Training neural networks requires a large set of data that does not currently exist for complex CHD subjects. Another limitation is overfitting, especially over training, to image patterns in a specific dataset that includes images from the same scanner model/vendor, as also reported by Bai et al. [7]. Dealing with limited data is a major challenge in designing effective neural networks for pediatric CMR, particularly for CHD subjects, and necessitates innovative approaches [9].
Among the learning-based algorithms, supervised deep-learning is currently considered the state-of-theart for CMR segmentation [14]. Nevertheless, major limitations of deep-learning methods are their dependency on the number of manually-annotated training data [15]. Small datasets can incur a large bias, which makes these methods ineffective and unreliable when the heart shape is outside the learning set, as frequently observed in CHD subjects.
To mitigate the need for large datasets of manually-annotated CHD data, in this study, we employ a Deep Convolutional Generative Adversarial Network (DCGAN) [16] that generates synthetically segmented CMR images and further enriches the training data beyond the classical affine transformations. DCGAN has enabled our deep-learning algorithms to successfully and accurately segment CMR images of complex CHD subjects beyond the existing AI methods.

Dataset
Our dataset includes 64 CMR studies from pediatric patients with an age range of 2 to 18 scanned at the Children's Hospital Los Angeles (CHLA). The CMR dataset includes scans from patients with Tetralogy of Fallot (TOF; n = 20 ), Double Outlet Right Ventricle (DORV; n = 9 ), Transposition of the Great Arteries (TGA; n = 9 ), Cardiomyopathy ( n = 8 ), Coronary Artery Anomaly (CAA; n = 9 ), Pulmonary Stenosis or Atresia ( n = 4 ), Truncus Arteriosus ( n = 3 ), and Aortic Arch Anomaly ( n = 2 ). All TGA cases were D-type but had been repaired with an arterial switch operation. The study was reviewed by the Children's Hospital Los Angeles Institutional Review Board and was granted an exemption per 45 CFR 46.104[d] [4][iii] and a waiver of HIPAA authorization per the Privacy Rule (45 CFR Part 160 and Subparts A and E of Part 164).

CMR studies
Imaging studies were performed on either a 1.5 T (Achieva, Philips Healthcare, Best, the Netherlands) or at 3 T (Ingenia, Philips Healthcare). CMR images for ventricular volume and function analysis were obtained using a standard balanced steady state free precession (bSSFP) sequence without any contrast. Each dataset includes 12 − 15 short-axis slices encompassing both right ventricle (RV) and left ventricle (LV) from base to apex with 20 − 30 frames per cardiac cycle. Typical scan parameters were slice thickness of 6 − 10mm , inplane spatial resolution of 1.5 − 2mm 2 , repetition time of 3 − 4ms , echo time of 1.5 − 2ms , and flip angle of 60 degrees. Images were obtained with the patients free breathing; 3 signal averages were obtained to compensate for respiratory motion. Manual image segmentation was performed by a board-certified pediatric cardiologist sub-specialized in CMR with experience consistent with Society for Cardiovascular Magnetic Resonance (SCMR) Level 3 certification. Endocardial contours were drawn on end-diastolic and end-systolic images. Ventricular volumes and ejection fraction were then computed from these contours. Manual annotations were performed according to SCMR guidelines with cvi42 software (Circle Cardiovascular Imaging, Calgary, Alberta, Canada) without the use of automated segmentation tools. The ventricular cavity in the basal slice was identified by evaluating wall thickening and cavity shrinking in systole.

Post-processing of CMR data
The original image size was 512 × 512 pixels. The original dataset was first preprocessed by center-cropping each image to the size of 445 × 445 to remove patients' identifiers. Subsequently, all images were examined to ensure that both the heart and segmentation mask are present. To reduce the dimensionality, each cropped image was subsequently resized to 128 × 128 using the imresize function in the open-source Python library SciPy. The entire process was performed using two different downsampling methods: (1) nearest-neighbor down-sampling and (2) bi-cubical down-sampling. For training data, twenty-six patients ( 10 TOFs, 4 DORVs, 4 TGAs, 4 CAAs and 4 cardiomyopathy patients) were randomly selected whereas the remaining 38 patients were used as test data.

Image segmentation using fully convolutional networks
A fully convolutional network (FCN), in comparison with a U-net [17] and cvi42, was used for automated pixelwise image segmentation. Convolutional networks are a family of artificial neural networks that are comprised of a series of convolutional and pooling layers in which the data features are learned in various levels of abstraction. These networks are mostly useful when the data is either an image or a map such that the proximity among pixels represents how associated they are. Examples of FCNs used for segmenting healthy adult CMR images include [7,18]. While these FCNs yield good segmentation accuracy for healthy adult CMR images, they show poor performance on CHD subjects [7]. Inspired by the "skip" architecture used by Long et al. [19] and the FCN model introduced by Tran [18], we designed a novel 19− layer FCN for an automated pixelwise image segmentation in CHD subjects.

FCN architecture
The design architecture of our 19− layer FCN model and the number of filters for each convolution layer are specified in Fig. 1; four max-pooling layers with pooling size of 3 are employed to reduce the dimension of the previous layer's output. Fine and elementary visual features of an image, e.g., the edges and corners, are learned in the network's shallow layers whereas the coarse semantic information is generated over the deeper layers. These coarse and fine features are combined to learn the filters of the up-sampling layers, which are transposed convolution layers with the kernel size of 4 . The FCN's input is a 128 × 128 image and the network's output is a 128 × 128 dense heatmap, predicting class membership for each pixel of the input image. The technical details of the FCN architecture are fully described in the Appendix.
Despite incorporating l 2 − regularization and dropout in the FCN architecture, as explained in the Appendix, overfitting was still present due to the lack of a large set of annotated training data. A standard solution to this problem is to artificially augment the training dataset using various known image transformations [20]. Classic data augmentation techniques include affine transformations such as rotation, flipping, and shearing [21]. To conserve the characteristics of the heart chambers, only rotation and flipping were used and the transformations such as shearing that instigate shape deformation were avoided. Each image was first rotated 10 times at the angles θ = 0 • , 20 • , 40 • , ..., 180 • . Subsequently, each rotated image either remained the same or flipped horizontally, vertically, or both. As a result of this augmentation, the number of training data was multiplied by a factor of 10 × 4 = 40.

FCN training procedure
The dataset was randomly split into training/validation with the ratio of 0.8/0.2 . The validation set was used to provide an unbiased performance estimate of the final tuned model when evaluated over unseen data. Each image was then normalized to zero-mean and unit-variance. Network parameters were initialized according to the Glorot's uniform scheme [22]. To learn the model parameters, stochastic gradient descent (SGD) with learning rate of 0.002 and moment of 0.9 was used to accelerate SGD in the relevant direction and dampen oscillations. To improve the optimization process, Nesterov moment updates [23] were used for assessing the gradient at the "look-ahead" position instead of the current position. The network was trained using a batch size of 5 for 450 epochs, i.e., passes over the training dataset, to minimize the negative dice coefficient between the predicted and manual ground-truth segmentation. While classic data augmentation techniques increased the number of training data by a factor of 40 , it did not solve the overfitting issue. To mitigate that, generative adversarial networks (GANs) were used to artificially synthesize CMR images and their corresponding chambers' segmentation. GANs are a specific family of generative models used to learn a mapping from a known distribution, e.g., random noise, to the data distribution. A DCGAN was designed to synthesize CMR images to augment the training data. The architecture of both generator and discriminator networks along with their training procedures are described next.

DCGAN architecture
The generator's architecture is shown in Fig. 2. The input to the generator network is a random noise z ∈ R 100 drawn from a standard normal distribution N (0, I) . The input is passed through six 2D transposed convolution, also known as fractionally-strided convolution, layers with kernel size of 4 × 4 to up-sample the input into a 128 × 128 image. In the first transposed convolution layer, a stride of 1 pixel is used while a stride of 2 pixels is applied to the cross-correlation in the remaining layers. The number of channels for each layer is shown in Fig. 2. All 2D transposed convolution layers except the last one are followed by a rectified linear unit (ReLU) layer. The last layer is accompanied by a Tanh activation function. The generator network's output includes two channels where the first is used for the synthetic CMR image and the second contains the corresponding chamber's segmentation mask.
The discriminator network's architecture is a deep convolutional neural network (CNN) as shown in Fig. 2. The discriminator network's input is a 2 × 128 × 128 image whose output is a scalar representing the probability that the input is a real pair of image with its corresponding segmentation mask. The model includes six 2D convolution layers with kernel size of 4 × 4 and stride of 2 pixels except for the last layer for which a 1− pixel stride value is used. The number of channels for each convolution layer is shown in Fig. 2. All layers except the last one

DCGAN training procedure
The training data was normalized to zero-mean and unitvariance to stabilize the DCGAN learning process. Each training sample was then rotated 19 times at angles θ = 0 • , 10 • , 20 • , ..., 180 • while each rotated image either remained the same or flipped horizontally, vertically or both. As a result of this augmentation process, the number of training data was multiplied by a factor of 19 × 4 = 76. The DCGAN's two known issues are mode collapse and gradient vanishing [24]. Mode collapse attributes to the case in which too many values of the input noise are mapped to the same value in the data space. This happens when the generator is over-trained with respect to the discriminator. Alternatively, gradient vanishing refers to the situation in which the discriminator becomes too successful in distinguishing the real from synthetic images with no gradient is backpropagated to the generator. In this case, the generator network cannot learn to generate synthetic images that are similar to the real images. To address these concerns, first, the network parameters were initialized according to a Gaussian distribution with zero-mean and variance of 0.02 . To learn the network parameters, Adam optimizer [25] was used for both generator and discriminator networks. Additional information is provided in the Appendix. Each iteration of the learning procedure included the following two steps: First, a single optimization step was performed to update the discriminator: A batch of 5 real image samples and their corresponding segmentation masks from the training data was randomly selected. Label 1 was assigned to them since they are real samples. These pairs of real images and their masks were then passed through the discriminator network and the gradient of the loss, i.e., the binary cross entropy between predicted and true labels, was backpropagated to accordingly adjust the discriminator weights. Then, a batch of five noise samples was drawn from the standard normal distribution and passed through the generator network to create five pairs of images and their corresponding masks. These pairs were then labeled with 0 since they were synthetic samples. This batch of synthetic data was then passed through the discriminator and the gradient of the loss was backpropagated to fine-tune the discriminator weights.
Second, an additional optimization step was performed to update the generator: Each pair of synthetic image and its corresponding segmentation mask from the previous step was labeled 1 to mislead the discriminator and create the perception that the pair is real. These samples were then passed through the discriminator and the gradient of the loss was backpropagated to adjust the generator weights.
In summary, in the first step, the discriminator was fine-tuned while the generator was unchanged, and in the second step, the generator was trained while the discriminator remained unchanged. The training process continued for 40, 000 iterations, or until the model converged and an equilibrium between the generator and discriminator networks was established.

DCGAN post-processing
The pixel value in each real mask is either 1 or 0 implying whether each pixel belongs to one of the ventricles or not. Therefore, the value of each pixel in a synthesized chamber mask was quantized to 0 when it was less than 0.5 and rounded up to 1 otherwise. To avoid very small or large mask areas, only the synthetic samples for which the ratio of the mask area to the total area was within a certain range were retained. For nearest-neighbor down-sampling, the range was between 0.005 and 0.025 while for the bi-cubical down-sampling, the range was between 0.02 and 0.05 . Finally, the connected components in each binary mask were located using the MAT-LAB (Mathworks, Natick, Massachusetts, USA) function bwconncomp. If there were more than one connected component and the ratio of the area of the largest component to the second largest component was less than 20 , that pair of image and mask would be removed from the set of synthetically-generated data.

Network training and testing Fully convolutional networks using real dataset
For each chamber, one FCN was trained on the CMR images of 26 patients and their augmentation via geometric transformations. Each model was jointly trained on both end-diastolic (ED) and end-systolic (ES) images for each heart chamber. These networks are called LV-FCN and RV-FCN in the results section.

Fully convolutional networks using synthetically augmented dataset
Two separate DCGAN models were designed for LV and RV to further augment the training data. The designed DCGAN was used to generate 6, 000 pairs of synthetic images and their corresponding segmentation masks. Applying the DCGAN post-processing step, a set of 2, 500 synthetic images, out of the 6, 000 generated pairs, was used for each chamber. Each of the 2, 500 selected images was then either remained the same, or flipped horizontally, vertically, or rotated 4 times at angles θ = 45 • , 90 • , 135 • , 180 • . Thus, 2, 500 × 7 = 17, 500 synthetic CMR images and their corresponding segmentation masks were generated for each ventricle. Finally, our synthetically augmented repertoire included the CMR images of 26 patients and their augmentation via geometric transformations plus the generated 17, 500 synthetic CMR images. Using this synthetically augmented dataset, another FCN was trained for each chamber. Each model was jointly trained on both ED and ES images. The networks designed using the synthetically augmented dataset (SAD) are called LV-FCN-SAD and RV-FCN-SAD in the results section.

U-Net architecture
In addition to our network architecture described above, a traditional U-Net model was designed to compare its results with those of our designed FCN. For this purpose, a customized U-Net architecture with the input size of 128 × 128 was used. The architecture of the U-Net model is shown in Fig. 3 and its code is available at https ://githu b.com/karol zak/keras -unet. Similar to the case of our FCN, for each chamber, a network was trained on the training set of 26 patients and its augmentation via geometric transformations. In the results section, these networks are referred to as LV-UNet and RV-UNet. For each chamber, another network was trained on the synthetically segmented CMR images, as was used for designing FCN-SAD. These networks are referred to as LV-UNet-SAD and RV-UNet-SAD. Each network was jointly trained on both ED and ES images for each chamber.

Commercially available segmentation software
The results generated by our models were compared with the results from cvi42 (Circle Cardiovascular Imaging Inc) on our test set that included CMR images from 38 patients. All volumetric measures were calculated using OsiriX Lite software (Pixmeo, Bernex, Switzerland). To calculate the volume, small modifications were applied to the open source plugin available at https ://githu b.com/ chris chute /numpy 2roi to make the format consistent with our dataset. The segmented CMR images were converted into OsiriX's .roi files using the modified plugin. The resulted .roi files were imported to the OsiriX Lite software for volume calculation through its built-in 3D construction algorithm. Our method was developed using the Python 2.7.12 and performed on a workstation with Intel(R) Core (TM) i7 − 5930 K CPU 3.50 GHz with four NVIDIA GeForce GTX 980 Ti GPUs, on a 64 − bit Ubuntu platform.

Metrics for performance verification
Our results were compared head-to-head with U-Net and cvi42. Two different classes of metrics are used to compare the performance of cardiac chamber segmentation methods.
One class uses the clinical indices, such as volumetric data that are crucial for clinical decision making. These indices may not identify the geometric point-by-point differences between automated and manually delineated segmentations.
Another class of indices uses geometric metrics that indicate how mathematically close the automatic segmentation is to that of the ground-truth. These include the average Dice metric, Jaccard index, Hausdorff distance (HD) and mean contour distance (MCD).

Generalizability to additional training and test subjects
To evaluate the generalizability of our framework on subjects not included in our dataset, our method was tested on the 2017 MICCAI's Automated Cardiac Diagnosis . For a consistent image size, five subjects were removed and the remaining 95 subjects were zero-padded to 256 × 256 , and then down-sampled to 128 × 128 using nearest-neighbor down-sampling method. Three subjects from each group were randomly selected as training data and the remaining 80 subjects were left as the test data.
For each chamber, one FCN was trained on the combined CMR images of both training sets, i.e. 26 patients from our dataset and 15 from the ACDC dataset, and their augmentation via geometric transformations. For each heart chamber, another FCN is trained on the dataset that is further augmented via previously generated set of synthetically segmented CMR images. Each model was jointly trained on both ED and ES images for each heart chamber. The first and second segmentation networks are referred to as FCN-2.0 and FCN-SAD-2.0, respectively. FCN-2.0 and FCN-SAD-2.0 were evaluated on the combined set of test subjects, i.e. 38 patients from our dataset and 80 patients from the ACDC dataset.

Statistical methods
Paired student t-test and intraclass correlation coefficient (ICC) were used for statistical analysis of predicted volumes. The p-value for the paired student t-test can be interpreted as the evidence against the null hypothesis that predicted and ground-truth volumes have the same mean values. A p-value greater than 0.05 is considered as passing the statistical hypothesis testing. The intraclass correlation coefficient describes how strongly the measurements within the same group are similar to each other. The intraclass correlation first proposed by Fisher et al. [26] was used. It focuses on the paired predicted and ground-truth measurements. The guidelines proposed by Koo and Li [27] were used to interpret the ICC values, as defined below: (a) less than 0.5 : poor; (b) between 0.50 and 0.75 : moderate; (c) between 0.75 and 0.90 : good; and (d) more than 0.90 : excellent.

Results
Characteristics of the cohort are reported first. Then, our synthetically generated CMR images and the corresponding automatically generated segmentation masks are presented. Different performance metrics and clinical indices for our fully automatic method compared to those of manual segmentation (ground-truth) are reported. In addition, the same indices calculated by cvi42 software and U-Net are presented for head-tohead performance comparison.

Characteristics of the Cohort
Characteristics of the cohort are reported in Tables 1  and 2. All chamber volumes in these tables are calculated based on the manual delineation.

Real and synthetically generated CMR images
A sample batch of real CMR images, including their manually segmented LV masks is compared with a sample batch of synthetically generated CMR images with their corresponding automatically-generated LV masks in Fig. 4. Similar comparison is made for RV in Fig. 5.

Segmentation performance
As mentioned in the method section, two separate down-sampling methods-nearest-neighbor and bicubical-were practiced and their training/testing were independently performed. The results for both methods are reported here:

Segmentation performance for nearest-neighbor down-sampling
The average Dice metric, Jaccard index, Hausdorff distance (HD), mean contour distance (MCD) and coefficient of determination R 2 vol for FCN and FCN-SAD computed based on the ground-truth are reported in Table 3.
Sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) are summarized in Table 4.
For both methods, average absolute and average relative deviation of the automatically segmented volumes from manually-segmented volumes, stroke volumes and ejection fractions are reported in Table 5. A smaller deviation indicates better conformity between automatically-and manually derived contours.
The ranges of LV end-diastolic volume (LVEDV), LV end-systolic volume (LVESV), LV stroke volume (LVSV) and LV ejection fraction (LVEF) for the 38 test subjects were ( 10 mL to 202 mL ), ( 4 mL to 91 mL ), ( 6 mL to 128 mL ) and ( 30% to 75% ), respectively. The ranges of RV end-diastolic volume (RVEDV), end-systolic volume (RVESV), stroke volume (RVSV) and ejection fraction (RVEF) for the 38 test subjects were ( 20 mL to 265 mL ), ( 6 mL to 130 mL ), ( 12 mL to 138 mL ) and ( 32% to 84% ), respectively. The p-values for the paired sample t-test of LVEDV, LVESV, RVEDV and RVESV to test the null hypothesis that predicted and ground-truth volumes have identical expected values are tabulated in Table 6. A p-value greater than 0.05 is considered as passing the t-test and is boldfaced in Table 6. The ICC values for the paired predicted and ground-truth values of LVEDV, LVESV, RVEDV and RVESV are listed in Table 6. An ICC value greater than 0.90 is considered as an excellent agreement and is boldfaced in Table 6.
Exemplary LV and RV segmentations at ES and ED are shown in Fig. 6. Red contours correspond to the groundtruth (i.e., manual annotation) whereas green and yellow contours correspond to the predicted delineations by FCN and FCN-SAD methods, respectively.
The correlation and Bland-Altman plots are shown in Figs. 7, 8, 9 and 10. The FCN-SAD results are depicted by blue dots. As shown in Figs. 7 and 8, the points deviated from the line y = x are due to the mismatch between prediction and ground-truth. The Bland-Altman diagrams are commonly used to evaluate the agreement among clinical measures and identifying any systematic difference (i.e., fixed bias, outliers etc.). The bias values of the FCN for LVEDV, LVESV, RVEDV and RVESV were 3.9mL , 3.0mL , 8.9mL and 3.3mL , respectively, whereas the bias values of the FCN-SAD for LVEDV, LVESV, RVEDV and RVESV were 1.9mL , 0.5mL , 3.1mL and −0.8mL , respectively. The 95% confidence interval of difference between automatic segmentation and groundtruth is shown as dashed lines representing ±1.96 standard deviation.

Segmentation performance for bi-cubical down-sampling
The results for the bi-cubical down-sampling method are reported in Table 7. FCN-SAD method's Dice metrics for LVED, LVES, RVED and RVES were 91.0% , 86.8% , 84.7% and 80.6% , respectively. The FCN-SAD's t-test p-values for LVED, LVES, RVED and RVES are 0.27 , 0.09 , 0.08 , and 0.66 , respectively. FCN-SAD method unequivocally passes the paired sample t-test for LV and RV at both ED and ES phases.
The correlation and Bland-Altman plots for ES and ED ventricular volumes, ejection fractions and stroke volumes for the bi-cubical down-sampling method are depicted in Figs. 11, 12, 13 and 14.

Segmentation performance for cvi42
The cvi42-associated Dice metrics were 73.2% , 71.0% , 54.3% and 53.7% for LVED, LVES, RVED and RVES, respectively. The corresponding sensitivity, specificity, PPV and NPV are summarized in Table 4. The absolute and relative deviations of automatically-from manuallysegmented results for LV and RV volumes at ED and ES as well as SV and EF are summarized in the third column of Table 5.

Segmentation performance for U-Net with nearest-neighbor down-sampling
Simulations were carried out on the images that were down-sampled using nearest-neighbor method. The average Dice metric, Jaccard index, Hausdorff distance, mean contour distance, and R 2 vol for U-Net and U-Net-SAD  computed based on the ground-truth are reported in Table 3. The Dice metrics for U-Net method were 84.5% , 79.4% , 77.7% and 71.3% for LVED, LVES, RVED and RVES, respectively. The corresponding Dice metrics for U-Net-SAD method were 87.1% , 82.3% , 81.8% and 74.8% , respectively.
Sensitivity, specificity, PPV and NPV for U-Net and U-Net-SAD are summarized in Table 4.
The absolute and relative difference between predicted and ground-truth volumes for LV and RV chambers at ED and ES as well as SV and EF are summarized in the last two columns of the Table 5.

Segmentation performance for U-Net with bi-cubical down-sampling
Using the images that were down-sampled according to the bi-cubical method, the average Dice metric, Jaccard index, relative volume difference and R 2 vol for U-Net and  Table 7. The Dice metrics for U-Net method were 85.5% , 81.6% , 76.5% and 70.2% for LVED, LVES, RVED and RVES, respectively. The corresponding Dice metrics for U-Net-SAD method were 87.4% , 83.9% , 81.8% , and 74.8% , respectively.

Discussion
Many challenges currently exist for segmenting cardiac chambers from CMR images, notably in pediatric and CHD patients [12,[28][29][30]. In the past few years, a great deal of activities involved CMR segmentation using the learning-based approaches [5][6][7][8]. Despite their relative successes, they still have certain limitations. Small datasets incur a large bias to the segmentation, which makes these methods unreliable when the heart shape is outside the learning set (e.g., CHDs and post-surgically remodeled hearts). In brief, in pediatric cardiac imaging, learning-based methods remain computationally difficult and their predictive performance are less than optimal, due to the complexity of estimating parameters, as their convergence is not guaranteed [31]. While traditional deep-learning methods achieve good results for subjects with relatively normal structure,   they are not as reliable for segmenting the CMR images of CHD patients [7,8]. It is believed that the absence of large databases that include CMR studies from heterogeneous CHD subjects significantly limits the performance of these traditional models [32]. To address this shortcoming, our new method simultaneously generates synthetic CMR and their corresponding segmented images. Our DCGAN-based FCN model was tested on a heterogeneous dataset of pediatric patients with complex CHDs. Current software platforms designed for adult patients, such as cvi42 by Circle Cardiovascular Imaging Inc, were previously reported to have many shortcomings when used for pediatric or CHD applications. Children are not scaled little adults; pediatric patient characteristics, such as cardiac anatomy, function, higher heart rates, degree of cooperativity, and smaller body size, all affect post-processing approaches to CMR, and there is currently no CMR segmentation tool dedicated to pediatric patients. Our major motivation for this study was the fact that current clinically available segmentation tools cannot be reliably used for children.

Table 5 Mean (SD) of the volume/stroke volume (SV)/ ejection fraction (EF) differences between predicted and manual segmentations for nearest-neighbor downsampling
The LV and RV volumes were computed using our automatic segmentation methods, U-Net model and the cvi42 (version 5.10.1.) were compared with the groundtruth volumes. As reported in Table 5, cvi42′s rendered volumes led to a significant difference between the predicted and true values of volumetric measures although it uses the original high quality and high resolution CMR images coming from the scanner for its predictions. Synthetic data augmentation also improved volume prediction for the U-Net. In addition, as shown in Table 5, FCN-SAD method outperforms U-Net-SAD for both chambers at end-systole and end-diastole. As reported in  Table 7, our FCN-SAD passed the t-test's null hypothesis that the predicted and ground-truth volumes have identical expected values for LVED, LVES, RVED and RVES. However, cvi42 only passed the t-test for LVED. Since the p-value is largely affected by the sample size etc., the ICC values are also reported for all models in Table 6. Our FCN and FCN-SAD models led to an excellent correlation coefficient for both LV and RV at ED and ES. U-Net-SAD also resulted in ICC values greater than 0.90 ; however, U-Net failed to achieve the excellent threshold for LVES. All cvi42′s ICC values are below the excellent threshold as well. Although the exact deep learning architecture of cvi42 is not known to us, in our opinion, the main reason for the relatively poor performance of cvi42 on pediatric CHD patients is the training of its neural network on the UK Biobank (as declared on their website), which is limited to the adult CMR images. More precisely, UK Biobank dataset does not represent features that are inherent to the heart of children with CHD. Tables 3 and 4, our method outperforms cvi42 in Dice metric, Jaccard index, HD, MCD, volume correlation, sensitivity, specificity, PPV and NPV. For LV segmentation, FCN-SAD improved Dice metric from 73.2% to 90.6% and from 71.0% to 85.0% over cvi42 at end-diastole and end-systole, respectively. Similar improvement was observed for RV segmentation where Dice metric was improved from 54.3% to 84.4% and from 53.7% to 79.2% at end-diastole and end-systole, respectively. FCN-SAD also reduced the average Hausdorff and mean contour distances compared to cvi42, which improved alignment between the contours as observed for both LV and RV at ED and ES. Similar improvement was observed for FCN-SAD over U-Net-SAD. For LV segmentation, FCN-SAD improved the Dice metric over U-Net-SAD from 87.1% to 90.6% for ED, and from 82.3% to 85.0% for ES. Similarly, FCN-SAD improved U-Net-SAD for RV segmentation from 81.8% to 84.4% for ED, and from 74.8% to 79.2% for ES. FCN-SAD also led to The data augmentation using DCGAN improved the Dice metric values by about 3% in FCN-SAD compared to our FCN method. Improvement was observed for Jaccard index, HD, MCD, volume correlation, sensitivity, specificity, PPV and NPV as well.

As indicated in
As shown in Table 3, synthetic data augmentation improved both Dice and Jaccard indices by about 3% for U-Net, which shows that synthetic data augmentation can improve the performance of FCN methods regardless of the type. Compared to the U-Net method, similar improvement was observed in U-Net-SAD for both HD and MCD as well. Table 3 reveals that our FCN method outperforms U-Net. Similarly, our FCN-SAD method outperforms U-Net-SAD in all metrics for LVED, LVES, RVED and RVES.
Synthetic data augmentation also improved both Dice and Jaccard indices by about 4% for FCN-2.0. Similar improvement was observed in FCN-SAD-2.0 for both HD and MCD, which indicates better alignment between predicted and manual segmentation contours.
As expected, for all methods, RV segmentation proved to be more challenging than LV segmentation due to the complex RV shape and anatomy. The sophisticated crescent shape of RV as well as the considerable variations among the CHD subjects make it harder for the segmentation models to learn the mapping from a CMR image to its corresponding mask. Another major limiting factor that affects the performance of RV segmentation is the similarity of the signal intensities for RV trabeculations and myocardium.
Our methodology has overcome some of these limiting issues by learning the generative process through which each RV chamber is segmented. This information is then passed to the segmentation model via synthetic samples obtained from that generative process. Corroborating the fact suggested by Yu et al., [33], larger contours can be more precisely delineated compared to the smaller ones. Segmentation of the CMR slices near the apex, particularly at the end-systole, is more challenging due to their small and irregular shape. Table 3 shows that both Dice and Jaccard indices are higher at ED versus ES for both ventricles. Another possible reason for lower performance at ES could be attributed to their small mask area and the smaller values of denominator at Eq. (3), which can lead to a major effect on the final values of these metrics, in case of even a few misclassified pixels. Figures 7a and b show that the results generated by our FCN-SAD model leads to high correlation for LVEDV and LVESV. This in turn leads to high correlation in EF and SV as shown in Figs. 8a and c in addition to R 2 vol values in Table 3. Similarly, a high correlation was observed for RVEDV and RVESV in Figs. 7c and d, which subsequently leads to high correlation in EF and SV as shown in Figs. 8b and d as well as the R 2 vol scores in Table 3. Bland-Altman analyses in Figs. 9 and 10 show negligible bias for the results due to FCN-SAD model trained on the synthetically augmented data. Bland-Altman plots show that applying the FCN-SAD method reduced the mean and standard deviation of error in predicted volumes and tightened the confidence interval compared to other methods.
The average elapsed times to segment a typical image in our GPU-accelerated computing platform is 10ms . Overall, our model takes 0.1s to process each patient's CMR data. Simulations show that even on a common CPUbased computing platform, our method requires about 1.3s to segment each patient's CMR images, which indicates the clinical applicability of our automated segmentation model. Similar quantitative and volumetric results were observed when the whole training and validation procedures were repeated with a different random split of training and test subjects. This indicates that no noticeable bias has occurred by the way subjects are categorized into training and test set.
Finally, we would like to emphasize on the significance of the choice of down-sampling method over the segmentation performance. The entire process of training and testing was repeated using both nearest-neighbor and bi-cubical down-sampling methods. Compared to the nearest-neighbor down-sampling method, the bicubical down-sampling provides a better performance for almost all studied models, except for the segmentation of the RVED using U-Net and U-Net-SAD. For example, the bi-cubical FCN-SAD results unequivocally passed the t-test for all chambers denoting the predicted and ground-truth volumes have identical expected value for LVED while the nearest-neighbor FCN-SAD did not. In our opinion, the main reason behind the superior performance of the bi-cubical down-sampling method is its larger mask area compared to the nearest-neighbor method.

Limitations
As a limitation, our method applied to the CMR datasets of patients with two ventricles, and was not yet trained to analyze patients with a systemic RV. Overall, to the computer, CMR images of hypoplastic left heart syndrome hearts are considered totally different objects. Therefore, a new training algorithm is needed to analyze the single ventricle hearts. We are currently designing a new model for that, which is beyond the scope of the present work. A second limitation of our method is that it must be calibrated before it can be applied to CMR images acquired from another scanner and with different cohort characteristics. It should also be mentioned that we have used Fréchet Inception Distance (FID) to discriminate between real and synthetic CMR images. While the FID is commonly used, human judgment is still the best measure, although it is subjective and depends upon the experience. To derive a statistically significant validation, a large cohort of imaging physicians are needed which we aim to accomplish in near future.
We used OsiriX Lite software to calculate the volumes; however, OsiriX Lite may underestimate the volume if one image slice has no predicted segmentation due to its small chamber size. This was the case for the outliers at the bottom of Figs. 7c and d. Since our dataset did not include epicardial ground-truth contours, the cardiac mass was not calculated. Another limitation of this work is the lack of intra-and inter-observer variability assessments since only one set of manual segmentation was available. Finally, the loss of resolution, caused by the down-sampling, was an inevitable limitation, which led to a compromise among speed, accuracy of the model and the data dimension.

Conclusions
Manual segmentation is subjective, less reproducible, time consuming and requires dedicated experts. Therefore, fully automated and accurate segmentation methods are desirable to provide precise and reproducible clinical indices such as ventricular ejection fraction, chamber volume, etc. in a clinically actionable time-frame. Our learning-based framework provides an automated, fast, and accurate model for LV and RV segmentation, and its outstanding performance in children with complex CHDs implies its potential to be used in clinics across the pediatric age group. Contrary to many existing automated approaches, our framework does not make any assumption about the image or the structure of the heart, and performs the segmentation by learning features of the image at different levels of abstraction in the hierarchy of network layers. To improve the robustness and accuracy of our segmentation method, a novel generative adversarial network is introduced to enlarge the training networks are based on a zero-sum non-cooperative game, i.e., a two-player minimax game in which the generator and discriminator are trained by optimizing the following objective function [34]: where E[.] represents expectation. The adversarial model converges when the generator and discriminator reach a Nash equilibrium, which is the optimal point for the objective function in Eq. (1). Since both G and D strive to undermine each other, a Nash equilibrium is achieved when the generator recovers the underlying data distribution and the output of D is ubiquitously 1 2 , i.e., the discriminator cannot distinguish between real and synthetic data anymore. The optimal generator and discriminator at Nash equilibrium are denoted by G * and D * , respectively. New data samples are generated by feeding random noise samples to the optimal generator G * .

DCGAN optimization
The learning rate, parameter β 1 , and parameter β 2 in Adam optimizer were set to 0.0002 , 0.5 , and 0.999 , respectively. The binary cross entropy between the target and the output was minimized. Since Adam, like any other gradient-based optimizer, is a local optimization method, only a local Nash equilibrium can be established between the generator and discriminator. A common method to quantify the quality of the generated synthetic samples is the FID, originally proposed by Heusel et al. [35]. In FID, features of both real and synthetic data are extracted via a specific layer of Inception v3 model [36]. These features are then modeled as multivariate Gaussian, and the estimated mean and covariance parameters are used to calculate the distance as [35]: where (µ s , � s ) and (µ r , � r ) are the mean and covariance of the extracted feature from the synthetic and real data, respectively. Lower FID values indicate better image quality and diversity among the set of synthetic samples.
Once the locally optimal generator was obtained, various randomly selected subsets of the generated synthetic images were considered and the one with the lowest FID distance to the set of real samples was chosen.

Metrics definition
The Dice and Jaccard, as defined in Eq.
FID(s, r) = �µ s − µ r � 2 2 + Tr � s + � r − 2(� s � r ) 1 2 , one where a higher index value indicates a better match between the predicted and true contours: where A and B are true and predicted segmentation, respectively. Hausdorff and mean contour distances are two other standard measures that show how far away the predicted and ground-truth contours are from each other. These metrics are defined as: where ∂A and ∂B denote the contours of the segmentation A and B , respectively, and d(a, ∂B) is the minimum Euclidean distance from point a to contour ∂B . The lower values for these metrics indicate better agreement between automated and manual segmentation. The ICC for paired data values x i , x ′ i , for i = 1, · · · , N , originally proposed in [26], is defined as: where ICC is a descriptive statistic that quantifies the similarity of the samples in the same group.