Fully automated quantification of biventricular volumes and function in cardiovascular magnetic resonance: applicability to clinical routine settings

Background Cardiovascular magnetic resonance (CMR) represents the clinical gold standard for the assessment of biventricular morphology and function. Since manual post-processing is time-consuming and prone to observer variability, efforts have been directed towards automated volumetric quantification. In this study, we sought to validate the accuracy of a novel approach providing fully automated quantification of biventricular volumes and function in a “real-world” clinical setting. Methods Three-hundred CMR examinations were randomly selected from the local data base. Fully automated quantification of left ventricular (LV) mass, LV and right ventricular (RV) end-diastolic and end-systolic volumes (EDV/ESV), stroke volume (SV) and ejection fraction (EF) were performed overnight using commercially available software (suiteHEART®, Neosoft, Pewaukee, Wisconsin, USA). Parameters were compared to manual assessments (QMass®, Medis Medical Imaging Systems, Leiden, Netherlands). Sub-group analyses were further performed according to image quality, scanner field strength, the presence of implanted aortic valves and repaired Tetralogy of Fallot (ToF). Results Biventricular automated segmentation was feasible in all 300 cases. Overall agreement between fully automated and manually derived LV parameters was good (LV-EF: intra-class correlation coefficient [ICC] 0.95; bias − 2.5% [SD 5.9%]), whilst RV agreement was lower (RV-EF: ICC 0.72; bias 5.8% [SD 9.6%]). Lowest agreement was observed in case of severely altered anatomy, e.g. marked RV dilation but normal LV dimensions in repaired ToF (LV parameters ICC 0.73–0.91; RV parameters ICC 0.41–0.94) and/or reduced image quality (LV parameters ICC 0.86–0.95; RV parameters ICC 0.56–0.91), which was more common on 3.0 T than on 1.5 T. Conclusions Fully automated assessments of biventricular morphology and function is robust and accurate in a clinical routine setting with good image quality and can be performed without any user interaction. However, in case of demanding anatomy (e.g. repaired ToF, severe LV hypertrophy) or reduced image quality, quality check and manual re-contouring are still required. Electronic supplementary material The online version of this article (10.1186/s12968-019-0532-9) contains supplementary material, which is available to authorized users.


Introduction
Cardiovascular magnetic resonance (CMR) imaging is the gold standard for the assessment of cardiac function and morphology [1,2]. Left ventricular (LV) ejection fraction (EF) is the most established parameter for cardiac functional assessments in clinical routine and is used for the evaluation of disease severity, treatment follow-up and risk assessment for adverse events [3,4].
To extract clinically relevant information such as LV mass, LV and right ventricular (RV) end-diastolic and end-systolic volume (EDV/ESV), stroke volume (SV) and EF, accurate post-processing of the cine CMR images is required. In daily clinical routine, post-processing is typically performed manually by delineating endocardial and epicardial LV borders as well as endocardial RV borders in all short-axis (SAX) slices covering the ventricles from atrioventricular ring to apex, in both end-diastolic and end-systolic phases. This task is time-consuming, tedious and subject to observer-variability [2,[5][6][7]. Emerging post-processing software based on deep-learning algorithms using convolutional neural networks now offer a fully automated approach for LV and RV volume assessments and have recently become commercially available [8]. Initial evaluations of these automated approaches are promising [9], however often based on pre-selected cases with excellent image quality or 'cropped' data [9,10], i.e. SAX stacks are manually triaged to only include end-diastolic and end-systolic images effectively covering the ventricles before applying the automatic algorithm. Importantly, most of the observer-variability results from discrepancies in defining the most apical and basal short-axis SAX slices [11], which is whitewashed by previous manual 'cropping' of SAX stacks.
At the current time, it remains unknown whether fully automated quantification of biventricular morphology and function is feasible and accurate in clinical CMR routine. Accordingly, the aim of the present study was to evaluate the feasibility and accuracy of fully automated biventricular assessment of morphology and function in a variety of CMR data (neither pre-selected nor pre-processed) taken from a real-world data base of a tertiary care CMR unit.

Study design
The study population consisted of 300 randomly selected patients referred to CMR within clinical routine care between 2016 and 2018. The CMR imaging protocol was employed on clinical 1.5 or 3 Tesla (Magnetom Symphony or Magnetom Skyra, Siemens Healthineers, Erlangen, Germany) CMR scanners. Protocols were employed as appropriate for clinical routine, all of which including electrocardiogram (ECG)-gated balanced steady-state free precession (bSSFP) cine sequences for a SAX stack. Typical imaging parameters were as follows: 25 frames/cardiac cycle, pixel spacing 0.8 mm × 0.8 mm, 8 mm slice thickness as well as inter-slice gap, TE 1.5 ms, TR 3 ms. The study was approved by the Ethics Committee of the University Hospital Goettingen and complied with the Declaration of Helsinki. The Ethics Committee gave permission to waive informed consent for this retrospective analysis. Furthermore, agreement was assessed between the fully automated algorithm and expert consensus contours based on the Society for Cardiovascular Magnetic Resonance (SCMR) consensus data consisting of 15 cases with different pathologies [12].

CMR analyses
Volumetric analyses were performed manually in shortaxis orientations by an experienced investigator according to standardized recommendations [11] using commercially available post-processing software (QMass®, Version 3.1.16.0, Medis Medical Imaging Systems, Leiden, Netherlands). Fully automated segmentation was performed employing dedicated commercially available software (suiteHEART®, Version 4.0.6, Neosoft, Pewaukee, Wisconsin, USA). The papillary muscles were included within the myocardium, trabecular tissue was excluded from the myocardial mass using both, the manual (QMass®) and the automated (suiteHEART®) software. Manual segmentation was performed by simple delineation of the LV endocardial-and epicardial borders and the RV endocardial border with Bézier smoothing at end-diastole and end-systole. No thresholding or edge detection was applied. Cross-referencing from long-axis locations was used to adjust for systolic atrioventricular ring descent. Fully automated segmentation was done overnight without any user-interaction neither by pre-processing the acquired short axis stack nor by post-processing automatically traced borders. Analyses included LV mass, and biventricular EDV, ESV, SV and EF. Agreement was tested between manual and fully automated analyses. Reproducibility was tested by reapplying the fully automated tracking algorithm on 20 randomly selected patients and by manual volumetric analyses by two experienced investigators including intra-and inter-observer reproducibility. All operators were blinded to each other's results. Furthermore, the analysis time needed to perform manual segmentations was measured in the subset of 20 patients. The presence and relevance of artefacts impacting image quality was graded adopting the criteria proposed by Klinke et al. [13] taking wrap around, respiratory ghost, cardiac ghost, image blurring, metal and shimming artefacts into account ( Table 1). One point was given if the artefact impeded the visualization of > 1/3 of the ventricular endocardial border at end-systole and/or end-diastole on a single SAX slice. If such artefact involved 2 slices or ≥ 3 slices, 2 and 3 points were given, respectively. Furthermore, accurate short-axis orientation was evaluated, resulting in an image quality score between 0 (= excellent quality) and 6 (= poor quality). Image quality scores were separately assessed for the LV and RV myocardium.
For the SCMR consensus data, only LV parameters were compared between automated analyses and manual expert consensus parameters, since RV parameters were not provided. According to the method described by Suinesiaputra et al. [12], papillary muscles and trabecular tissue were excluded from the myocardial mass.

Statistics
Continuous variables were checked for normal distribution using the Shapiro-Wilk test and are presented as median with interquartile range (IQR). Biventricular volumes and LV mass were indexed to body surface area. Dependent variables were tested using the Wilcoxon signed-rank test. Agreement of manual and automated analyses as well as intra-and inter-observer variability was assessed first using Bland-Altman analysis [mean difference between measurements with 95% confidence interval (CI)] [14], second intra-class correlation coefficients (ICC) based on a model of absolute agreement, considered excellent if ICC > 0.74, good between 0.60 and 0.74, fair between 0.4 and 0.59 and poor below 0.4 [15], and third the coefficient of variation (CoV, = standard deviation [SD] of the differences divided by the mean) [16,17]. P-values provided are two-sided, an alpha level below 0.05 was considered statistically significant. Statistical analyses were performed using IBM SPSS Statistic Software Version 24 (International Business Machines, Armonk, New York, USA) and Microsoft Excel (Microsoft, Redmond, Washington, USA).

Study population
Patient characteristics and cardiac volumes for both manual and automated assessments are presented in Table 2. Biventricular automatic segmentation was feasible in all 300 cases. In comparison with manual evaluations, automatic assessments depicted higher LV volumes, lower LVEF, higher LV mass as well as higher RV EDV, RV SV and RVEF (p < 0.001 for all). The study population consisted of 100 referrals to evaluate ischemic heart disease, 120 patients with myocardial disease, 70 patients with congenital heart disease and 10 others. Table 3

Image quality and post-processing
LV-image quality was graded with 1.0 (SD 1.3) (Score 0 n = 168, Score 1 n = 19, Score 2 n = 46, and Score 3 n = 67 Table 1 Quality assessment of cine short-axis (SAX) images. The image quality score corresponds to the sum of qualitative scoring based on 6 criteria (range of score: 0-5). One point was given if an artefact impeded the visualization of > 1/3 of the ventricular endocardial border at end-systole and/or end-diastole on a single SAX slice. If such artefact involved 2 slices or ≥ 3 slices, 2 and 3 points were given, respectively. Incorrect short-axis orientation was graded with 2 points   Continuous variables are expressed as median and interquartile range (IQR) and were tested using the Wilcoxon signed-rank test. LV/RV left/right ventricle, EDV/ESV end-diastolic/systolic volume, SV stroke volume, EF ejection fraction Numbers in bold type indicate a significant difference points. RV-image quality was graded with 1.1 (SD 1.3) (Score 0 n = 151, Score 1 n = 37, Score 2 n = 39, and Score 3 n = 73 points. Appropriate short-axis orientation was fulfilled in 298 case, the highest image quality score assigned was 3. Manual post-processing took on average 11.3 ± 1.5 min as opposed to automated pre-processing with < 1 min/ SAX stack. Representative examples of high and low segmentation accuracy are given in Fig. 1. Corresponding videos including automatic segmentation of all phases and SAX slices can be found in Additional file 1.

Agreement of manual and automated analyses
Results comparing automated and manual volume assessments including mean differences with corresponding SD, ICC and CoV are presented in Tables 4 and 5 (Table 3.).
If classified according to image quality score, 0 to 1 point was associated with considerable better agreement than 2 to 3 points, both for LV and RV automated analyses. Considering an image quality score of ≤1, both LV and RV agreements were excellent for all variables with a bias of − 0.6% (LOA -7.6 to 6.4%) and 3.0% (LOA -9.2 to 15.2%) for LV EF and RV EF, respectively. However, large differences were observed in case of reduced image quality (score ≥ 2) with a bias of − 5.6% (LOA -20.6 to 9.4%) and 10.6% (LOA -13.6 to 34.8%) for LV EF and RV EF, respectively. LV and RV stroke volumes were very consistent in automated analyses, LV 48.1 ml/ m 2 compared to RV 47.6 ml/m 2 in median, p = 0.435.
Results from the comparison between automatically and manually derived expert consensus LV parameters based on the SCMR consensus data are provided in the Additional file 1. In accordance with the study's results, agreement was excellent in the majority of cases (ICC ≥ 0.95 for all LV parameters) (Additional file 2: Tables S1 and S2). There was one patient with a 20% difference in LV EF between automatic and manual expert results (Case # 15), which was a patient with severe LV hypertrophy (Additional file 2: Table S1, Figure S1), similar to the case shown in Fig. 1c.

Reproducibility
Reproducibility for manual segmentations was better for LV than for RV measurements. The automated algorithm yielded exactly the same results when being reapplied. Table 6 shows ICC, CoV and mean differences (SD) within and between observers.

Discussion
The present study demonstrates the feasibility of fully automated quantification of biventricular morphology and function and reveals its current pitfalls and limitations in a 'real-world' clinical setting. Several notable findings should be considered. First, automatically and manually derived volumes agree well in case of good image quality; however, severe differences occur in case of reduced image quality. Second, agreement is better for LV than for RV volumes. Third, demanding anatomical circumstances (e.g. in patients with repaired ToF) result in lower agreement. Forth, different field strengths or the presence of valve replacements do not impede automated assessments as long as image quality is preserved.

Agreement of automated and manual assessment
CMR represents the reference standard for cardiac volume assessment [1] with incremental accuracy and reproducibility as compared to echocardiography [18]. However, CMR acquisition time is long and further requires time-consuming post-processing to extract clinically relevant information. Thus, efforts have been directed towards automated post-processing analyses based on deep-learning algorithms within the last decade [8,19,20]. The current literature demonstrates excellent agreement for automated and manual LV volume assessments [6,10]; however, studies concerned with automatic RV segmentation are scarce [21]. Noteworthy, the study by Queirόs et al. [10] applied an automatic algorithm on cropped data, that is after manually defining the most  basal and apical slices with subsequent cropping the SAX stack to include images effectively covering the LV before applying the automatic algorithm. Furthermore, ED and ES time points were manually pre-selected. However, the correct definition of the most basal slice is amongst the most challenging steps in SAX volume assessments and one of the most important source of observer variability [22], therefore representing a clear bias in testing the reliability of an automated algorithm. In the present study, we sought to simulate a real-world clinical scenario by randomly selecting patients from clinical routine imaging. We applied a commercially available automatic algorithm on clinically acquired SAX stacksoccasionally comprising both atria and ventricleswithout any manual pre-or post-processing. The final data was acquired on 1.5 and 3.0 T scanners. Our data elaborates on the excellent agreement between automatically and manually derived volumes in case of good image quality, with overall better agreement for LV than for RV measurements. Indeed, quantification of RV volumes is generally more challenging as opposed to LV volumes due to the complex RV anatomy [23,24]. Nevertheless, LV and RV stroke volumes were consistent in automated analyses in this patient group without intra-or extracardiac shunt. Manual post-processing time took on average more than 11 min as compared to fully automated assessments with < 1 min. Importantly, automatic analyses of several CMR examinations (in this case 300 scans) run completely user-independent and were performed overnight. Furthermore, automated analyses promise to overcome limitations in observer variability, since the algorithm yields exactly the same measures when being reapplied by different users. Thus, the automated frame-work provides a highly reproducible approach and is able to extremely shorten post-processing times of CMR examinations with subsequent potential to improve cost-effectiveness [25]. Furthermore, the framework may provide 'on-the-fly' post-processing parallel to finishing the CMR scan (e.g. during late gadolinium enhancement acquisitions). Table 4 Agreement between manual and automated segmentations. Agreement was analysed in the entire study group (n = 300) as well as in subgroups according to field strength, aortic valve replacement and repaired Tetralogy of Fallot (Continued)

Parameter
Mean Difference (

Impact of image quality
Our data demonstrate that image quality is the leading determinant of accuracy for fully automatic volume assessment. In case of good image quality (image quality score of ≤1 adopted to the criteria proposed by Klinke et al. [13], Table 1), the bias of both LV and RV function was within acceptable limits. However, in case of reduced image quality (image quality score ≥ 2), a large bias of > 5% was observed for both LV and RV EF with wide LOA, particularly for RV EF. Importantly, the relevance of RV function and volumes is increasingly recognized in various diseases [26]. For example, the diagnosis of arrhythmogenic right ventricular cardiomyopathy is challenging and heavily relies on the assessment of RV EDV and RV EF [27]. If considered for clinical use and decision making, a precise volume assessment is of utmost importance, and cannot be achieved with the proposed fully automatic algorithm in case of impaired image quality yet.

Technical and anatomical considerations
To further elucidate limitations of the commercially available software, we compared the agreement of automated and manually derived volumes for subgroups according to field strengths, the presence of aortic valve replacements as well as repaired ToF. Agreement was better at 1.5 T compared to 3 T scans; however, at 3 T considerably more artefacts (mainly due to inadequate breath-holding and shimming) were present. Reduced agreement at 3 T is therefore more likely a result of lower image quality. Due to the growing number of percutaneously implanted aortic valves [28] and increasing indications for CMR imaging [29] including aortic valve stenosis [30], the presence of valve replacement in CMR studies is likely to grow. As long as image quality was preserved in these patients, agreement of LV volumes remained acceptable, enabling the use of automated algorithms in this group of patients. In contrast, in patients with repaired ToF, both RV and LV agreement were considerably decreased, despite low image quality solely affecting the RV (metal artefacts resulting from sternal wires). Since LV image quality was good, reduced agreement is most likely due to the more demanding anatomy in these patients (distinctly larger RV than LV volumes), which points out the current limitations of fully automated analysis. Here, it remains to be investigated whether or not the proposed automatic deeplearning frame-work is able to further learn from these cases with subsequent improvement of accuracy.

Limitations
The study's conclusions are derived from the comparison of 300 automatically and manually quantified clinical CMR examinations from a single centre. Although manual contouring was performed by experienced observers, intra-and inter-observer variability may limit its use as a reference standard. Details of the automatic algorithm are Fig. 4 Agreement of automatically and manually derived right ventricular parameters according to image quality. Bland Altman plots (automatic manual) are shown for studies with good image quality (score ≤ 1, n = 188) and for studies with reduced image quality (score ≥ 2, n = 112). RV: right ventricle, EDV/ESV: end-diastolic/systolic volume, SV: stroke volume, EF: ejection fraction, Δ: difference not disclosed by the software vendor and therefore cannot be reported. RV mass was not measured, since the automatic algorithm does not provide RV mass quantification.

Conclusion
Fully automated quantification of biventricular morphology and function is feasible and accurate in the majority of cases in a clinical routine setting and has the potential to extremely accelerate post-processing times and to improve reproducibility. However, in case of limited image quality or in patients with demanding anatomy (e.g. in patients with repaired ToF) the proposed fully-automatic frame-work does not yet provide satisfying results and still requires manual re-contouring.