MVnet: automated time-resolved tracking of the mitral valve plane in CMR long-axis cine images with residual neural networks: a multi-center, multi-vendor study

Background Mitral annular plane systolic excursion (MAPSE) and left ventricular (LV) early diastolic velocity (e’) are key metrics of systolic and diastolic function, but not often measured by cardiovascular magnetic resonance (CMR). Its derivation is possible with manual, precise annotation of the mitral valve (MV) insertion points along the cardiac cycle in both two and four-chamber long-axis cines, but this process is highly time-consuming, laborious, and prone to errors. A fully automated, consistent, fast, and accurate method for MV plane tracking is lacking. In this study, we propose MVnet, a deep learning approach for MV point localization and tracking capable of deriving such clinical metrics comparable to human expert-level performance, and validated it in a multi-vendor, multi-center clinical population. Methods The proposed pipeline first performs a coarse MV point annotation in a given cine accurately enough to apply an automated linear transformation task, which standardizes the size, cropping, resolution, and heart orientation, and second, tracks the MV points with high accuracy. The model was trained and evaluated on 38,854 cine images from 703 patients with diverse cardiovascular conditions, scanned on equipment from 3 main vendors, 16 centers, and 7 countries, and manually annotated by 10 observers. Agreement was assessed by the intra-class correlation coefficient (ICC) for both clinical metrics and by the distance error in the MV plane displacement. For inter-observer variability analysis, an additional pair of observers performed manual annotations in a randomly chosen set of 50 patients. Results MVnet achieved a fast segmentation (<1 s/cine) with excellent ICCs of 0.94 (MAPSE) and 0.93 (LV e’) and a MV plane tracking error of −0.10 ± 0.97 mm. In a similar manner, the inter-observer variability analysis yielded ICCs of 0.95 and 0.89 and a tracking error of −0.15 ± 1.18 mm, respectively. Conclusion A dual-stage deep learning approach for automated annotation of MV points for systolic and diastolic evaluation in CMR long-axis cine images was developed. The method is able to carefully track these points with high accuracy and in a timely manner. This will improve the feasibility of CMR methods which rely on valve tracking and increase their utility in a clinical setting. Supplementary Information The online version contains supplementary material available at 10.1186/s12968-021-00824-2.


Background
The mitral valve (MV) is a fibrous region that separates the left ventricle (LV) and the left atrium with two leaflets. In the normal heart, the MV remains closed during systole and the MV plane rapidly descends with contraction of the ventricles; in early diastole the MV opens, and the MV plane quickly springs back to an equilibrium plane where it pauses during diastasis, and then ascends further during the left atrial kick in late diastole. The analysis of MV plane motion provides structural and functional systolic and diastolic information [1]. Its measurement yields peak displacement of the plane during systole, known as mitral annular plane systolic excursion (MAPSE), and the LV early diastolic velocity, known as LV e' , which is itself a key metric of diastolic function in echocardiography [2].
Cardiovascular magnetic resonance (CMR) is a reproducible imaging modality and is considered the reference standard for cardiac volume assessment. Its accuracy and reliability hold promise for serial examinations of MAPSE and LV e' , as already reported in our previous work [3]. These metrics were obtained by tracking the MV insertion points in every frame in long-axis cine images. MV plane tracking has also been used to enable slice-following for assessment of valvular flow with a phase-contrast sequence, either retrospectively [4] or prospectively [5], where it allows an evaluation of mitral regurgitation, which would not be possible without valve tracking. The MV plane location can assist any automated segmentation of the LV or left atrium, as it demarcates these chambers [6], and its tracking can provide an estimate of global longitudinal strain [7]. Finally, MV plane dynamics could be useful in providing information on the timing of the cardiac rest-periods, which is important in CMR [8][9][10].
Even with significant improvements in semi-automated tracking methods of the MV points in cine CMR [11][12][13][14][15][16], and validation of clinical metrics against echocardiography [3,16], MV plane tracking still requires a manual initialization and refinement. A fully automated, fast and accurate method for tracking the MV points using standard clinical CMR images is lacking. The rapid evolving field of deep learning has great potential to provide such method. In CMR deep learning applications [17], and deep learning in general, it is strongly suggested to have a large cohort of training data coming from different centers, vendors, pathologies, and manually labeled by different experts. Such diversity in the data leads to a consistent, robust method to perform a determined task.
In this study, we develop MVnet, a dual-stage deep learning approach for MV tracking using residual neural networks, trained and tested in a multi-center, multivendor population of 703 patients, with a wide range of pathologies, and manually labeled by 10 experts. Additionally, we show the importance of using a rich training dataset by structurally analyzing two main scenarios with direct impact on the learning and application: a singlecenter, single-vendor, single-expert dataset compared to a multi-center, multi-vendor, multi-expert dataset. We also evaluate the derived clinical parameters (MAPSE and LV e') in contrast with their counterpart by experts. Finally, with this dual-stage pipeline, we present technical novelty by applying an automated linear transformation to the images after the first stage, which substantially reduces variance and boosts performance in accuracy.

Manual annotation
Using an analysis tool, freely available in the software Segment [34], 10 trained human observers with different backgrounds and levels of CMR experience, ranging from 4 to 20 years, performed these annotations in all temporal frames of 458 subjects and only at end-diastolic and end-systolic frames of 245 subjects (belonging to the Lund dataset). The experts placed the MV insertion points in two-chamber view data, as anterior and inferior points, and in four-chamber view data, as lateral and septal points, at each phase in the cardiac cycle, resulting in two points for each view in each image (Fig. 1a). With no society recommendations on how to annotate and track the MV points, the Yale and Lund datasets were annotated according to two different principles. A single observer at Yale University (Yale observer) placed the MV points in the intersection between the MV and the LV myocardium, whereas the 9 observers at Lund University (Lund observers) placed the MV points at the most basal part of the compact LV myocardium. Both ways yielded similar MV motions but with a different reference point. However, due to the inherent observer bias and the high number of Lund observers, these annotations presented more variability. Such differences in manual annotation are very common in deep learning applications.

Dual-stage residual neural network
Based on our recent work to track the tricuspid valve [19] as proof of principle, we adapted and expanded our dual-stage residual neural network to track the MV. The proposed framework (Fig. 2) involved two stages for each chamber view. The first stage uses a trained network to track the MV points with sufficient accuracy to define the MV plane, and the second stage uses these points to perform a linear transformation on the original images to standardize the images regarding location and orientation of the valve plane. The second network then predicts the MV points in the automatically preprocessed image with higher accuracy. Each stage uses an artificial Table 1 Multi-vendor, multi-center database (n = 703) description by pathology, gender and age The mean ± standard deviation are reported for age.
Subjects were scanned at 1.5T magnetic resonance scanners (n = 661). †  neural network with a residual framework of 50 layers, ResNet-50 [35], adapted to predict a series of four numbers (representing two pairs of coordinates {x, y} ) on an individual grayscale cine image (Fig. 3).

Stage 1
The first stage only involves the use of one network to perform a coarse annotation in all temporal frames. The network was trained on manually annotated images resized to 160 × 160 with cubic interpolation, leading to anisotropic pixel dimensions. This initial coarse annotation serves to localize the MV plane and follow its motion.

Stage 2
The second stage uses the output coordinates of the previous stage to apply a linear transformation task, i.e., standardize the cine, by (i) interpolating each image to a spatial isotropic resolution of 0.75 mm, (ii) rotating each image such that the MV is oriented horizontally with the apex pointing down and with the anterior and lateral points placed on the left and the inferior and septal points placed on the right, for two-chamber and four-chamber views, respectively, and (iii) cropping each image around the MV center for a size of 118 × 162.

Pipeline
As an overview, a given input cine of a chamber view with different parameters (Fig. 2a) is fed into the first stage with a fixed size for the network to read (Fig. 2b). The first trained network (Fig. 3) outputs the predicted points with acceptable accuracy. The coarse annotation on the first temporal frame is used as a reference to determine the orientation and centering tasks, i.e., to horizontally orient the MV in the center, whereas the coarse annotation on the remaining temporal frames determine the motion direction, i.e., to ensure the apex points down. The automated linear transformation task from the second stage standardizes the spatial resolution, the heart orientation and positioning and the type of cropping (Fig. 2c). The second network (Fig. 3), trained on linearly transformed images, is then used to track the MV points with high accuracy (Fig. 2d).
These new predicted points are readjusted, with an inverse standardization transformation, to match the original input cine images (Fig. 2e). This stage can be performed in an iterative manner as indicated, i.e, the points predicted by the second stage can initialize again the second stage (linear transformation task and second network) to yield a more accurate annotation.

Network training
Both networks were trained with the same process but independently for each stage and chamber view. Transfer learning for weights initialization was applied to reduce convergence time and aid the learning process, using a ResNet-50 pretrained on more than one million images from the ImageNet database [36] for a classification task  into 1000 different categories of objects and animals photographs. Standard data preparation involved pixel distribution normalization by the median and interquartile range to ensure generalizability [37]. Training data was augmented 10 times by scaling ±10%, rotating ±10 • and translating ±3 pixels to add more inherent variability in the first stage, and to compensate for any error introduced from the first stage to the second stage, i.e., a slight misalignment of the MV plane center from the ground truth. The mean square error loss function was optimized by the Adam method [38] with a learning rate of 1 × 10 −4 , for 20 epochs and mini-batch size of 8. The pipelines were developed on MATLAB R2019b (Mathworks, Natick, MA) with a NVIDIA Titan RTX GPU.

Clinical metric derivation
The MV plane displacement was calculated for each chamber view as the average perpendicular distance of the MV points to the initial plane set in end-diastole (Fig. 1b). The resultant MV plane displacement was measured as the average from both chamber views. MAPSE was derived from the maximal MV displacement and LV e' was the second global peak of the time-derivative of the displacement curve.

Data variability analysis
Both Yale and Lund datasets were trained and tested separately and collectively to assess the influence of employing a single-center, single-vendor, single-expert dataset (Yale) and a multi-center, multi-vendor, multiexpert dataset (Lund

Spatial annotation accuracy
The test set of each MVnet was evaluated against manual annotation. The spatial annotation error was measured in each chamber view with: (i) the Euclidean distance, which measured in millimeters the distance error between the ground-truth and predicted annotations, (ii) the angular distance, which measured in degrees the inner intersection angle of the groundtruth and automated planes defined by both MV points, and (iii) the MV displacement, which measured in millimeters the difference in the MV plane displacement in every temporal frame between the ground-truth and automated displacement curves. All metrics were calculated comparing the ground-truth with the predicted annotations on the original input images.

Clinical-metric accuracy
Clinical metric (MAPSE and LV e') comparisons were performed using linear regression analysis, Bland-Altman plots, and the intra-class correlation coefficient (ICC) between the automated and manual measures.
As 47 subjects (out of 111) from the Lund test set were only annotated at end-diastole and end-systole, minimum amount of temporal frames required for MAPSE, LV e' comparisons were not performed in that subset. The threshold for statistical significance was considered to be p<0.05 for this study.

Dual-stage influence
Both spatial annotation and clinical-metric accuracy evaluations were performed for the results of the first stage (stage 1), both stages (stage 1 + 2) and an additional iteration (stage 1 + 2 + 2), to show the influence of the additional stage (Fig. 2c, d).

Inter-observer variability analysis
For inter-observer variability analysis, an additional pair of observers performed manual annotations in a randomly chosen of 50 subjects. Specifically, a second observer of Yale dataset (Yale observer 2) and Lund dataset (Lund observer 2) manually annotated a subset of 25 subjects each on its corresponding dataset and the same evaluation was assessed.

Implementation
MVnet was implemented in the medical image analysis software Segment v3.

Dual-stage influence
The annotation accuracy of each MVnet after the first stage (stage 1), the second stage (stage 1+2) and an iteration of the second stage (stage 1+2+2) is reported in Fig. 4, in terms of spatial annotation and clinical-metric accuracy, between every ground-truth and predicted measures of an MVnet with its corresponding test set.
In terms of Euclidean and angular distance agreement (Fig. 4a, b), the mean percentage error of both metrics from the first to the second stage decreased 34%, 14%, and 10% for MVnet Yale , MVnet Lund and MVnet Mixed , respectively. In a similar manner, the overall error from the first stage to the second iteration of the second stage decreased 41%, 15%, and 11%, respectively. Accuracy and agreement were consistently improved after each stage. No substantial differences were found in a specific MV point and chamber view (Fig. 5), meaning they all achieved a similar accuracy. However, the angular distance error on the four-chamber view was larger with a higher discordance in the septal point placement from the difference in the two annotation principles. Regarding MV displacement, MAPSE and LV e' agreement (Fig. 4c-e), with the addition of the second stage, the mean percentage error decreased 42%, 29%, and 11% for MVnet Yale , MVnet Lund and MVnet Mixed , respectively, whereas the iteration of the second stage reduced the initial error 39%, 32%, and 18%. This iteration improved the agreement with clinical metrics, except for MVnet Yale which showed a small reduction in agreement. Although the improvement of the iteration was moderate in the spatial annotation accuracy for MVnet Lund and MVnet Mixed , the clinical-metric accuracy was further improved.

Dataset variability influence
With the output of each MVnet considered to be the predictions after an iteration of the second stage (stage 1+2+2), the accuracy of every model evaluated on every test set (the Yale, Lund and mixed test sets) is reported as a heatmap for each metric in Fig. 6, where the best performance among the models is highlighted in blue and the worst in red. Although both MVnet Yale and MVnet Lund performed well on their corresponding datasets, with generally superior performance for Lund, the accuracy was noticeably reduced when the Yale (or Lund) network was applied to Lund (or Yale) test dataset, with Lund (or Yale) annotations. Then there was a 2 to 3 times-fold error increase in Euclidean and angular distances, mainly as a result of difference in annotation pattern. The MV displacement metric, however, achieved a better agreement in this scenario, but still with lower performance.
While MVnet Yale achieved the lowest Euclidean distance error (although trained on less data), its error on the Lund test set was 3.3 times higher. Vice versa,

Clinical-metric accuracy
Choosing MVnet Mixed as the proposed pipeline, the accuracy of clinical metrics for MAPSE and LV e' of the automated method compared against the manual metrics, evaluated in the mixed test set, are shown in Table 3.
The model estimated both metrics with excellent agreement with a mean error of −0.2±1.3 mm (ICC = 0.94) and 0.0±1.5 cm/s (ICC = 0.93), for MAPSE and LV e' , respectively. The regression and Bland-Altman plots for the MV parameters between the automated and manual measurements are presented in Fig. 7, where an excellent correlation and good agreement were observed for each of the three parameters, including the MV displacement. All reported correlation values are significant (p<0.0001).

Inter-observer variability analysis
The inter-observer clinical-metric agreement is shown in Table 4. The results were on par with the automated predictions, with a mean error of −0.3±1.2 mm (ICC = 0.95) and 0.3±1.7 cm/s (ICC = 0.89), for MAPSE and LV e' , respectively. In a similar manner, the regression and Bland-Altman plots for the clinical parameters between the second and first group of observers are presented in Fig. 8. All reported correlation values are also significant (p<0.0001).

Spatial annotation accuracy comparison
The spatial annotation agreement of the automated annotations by MVnet Mixed against manual annotations as well as the inter-observer spatial variability are presented in Table 5. Interestingly, while the Euclidean and angular distance errors seem to lower the pipeline performance, the automated reproducibility of clinical metrics is very high resulting in a very low MV displacement error. This shows that tracking the motion is more reproducible, and more relevant, than tracking the spatial location of each individual point. Therefore, the individual distance errors are not necessarily a good metric for evaluating the performance of a valve plane movement and may be misleading when the accuracy of the annotation model is very high. This discrepancy was also present in the studied inter-observer variability analysis as the Yale observer 2 yielded a Euclidean distance error of 2.7 ± 2.6 mm against ground truth, whereas the error of the Lund observer 2 was 3.9 ± 3.0 mm. Although this difference may flag a potential pitfall in the manual annotation, the clinical-metric agreement showed the opposite as the latter achieved an average ICC = 0.97, whereas the former an average ICC = 0.82, indicating that annotation consistency along temporal frames prevails above a specific annotation pattern.
Additional movie files demonstrated MV tracking in cines from the Yale test set (Additional file 1) and in the Lund data set (Additional file 2), using the automated annotations by MVnet Mixed . Additional files also show further analysis of the inter-pipeline variability (Additional file 3) and clinical-metric agreement of LV s' and LV a' , compared with the inter-observer variability (Additional file 4).

Discussion
In this work, we proposed a dual-stage residual learning framework, MVnet, for time-resolved annotation of the MV in two-chamber and four-chamber views from standard long-axis cine CMR images. The proposed method was fast, fully automated and showed excellent agreement with manual annotation by expert readers in term of valve points positioning as well as with the subsequently extracted LV function parameters. Tedious manual labor is not needed, reducing the processing time from 8 to 20 minutes to 5 or 1 second with batch processing. This enables fully-automated, accurate, fast, and reproducible assessment of LV function in clinical routine. Moreover, this method can be applied retrospectively to any two and four-chamber image acquisition, which are routinely acquired in a standard CMR exam. Additionally, we have systematically shown the advantages and pitfalls of using a single-center, single-vendor, single-expert dataset compared with a multi-center, multi-vendor, multi-expert dataset. The technical contribution of this work includes the accuracy improvement provided by the second stage. While both networks were trained under the same domain and task, meaning each of them predicts two pairs of coordinates in a given image, the second network processes only highly standardized images, obtained by applying the proposed linear transformation. We showed how much each additional stage, up to one iteration (stage 1+2+2), improved the overall performance. While this assessment could be performed with more iterations (e.g., 1+2+2+2), the accuracy does not further improve, as evaluated in our proof-of-principle work [19]. This adoption of a dual-stage deep learning pipeline in biomedical applications has recently gained some interest to compensate for the technical limitation of one single model, even when trained with a large amount of data. For instance, some dual-stage pipelines with a segmentation task [39,40] first localize the region of interest with a bounding box and then segment the bounded image, which improved the accuracy from single models. One limitation for an approach consisting of two different tasks is that an error for the localization task can hamper the performance. In our case, we employed the same annotation task for both stages, with a good accuracy on the first stage and an increased performance with the second one. The first stage is enriched by the large-scale study, from different centers, vendors and observers, and data augmentation to further increase the sample diversity [41] allowing model generalization [42]. This approach is commonly employed by one-stage deep learning applications where it is believed that data diversity will solve the image processing task. The specific technical contribution of our work is that we used the benefit of data diversity but also further increased the accuracy by using these results to standardize the images in a novel way. Our proposed method achieved human-level performance with high robustness and consistency across centers, vendors and observers in a diverse range of conditions. The value of including images from different centers, vendors and conditions aids the generalization of the trained model through seeing all potential variations of the images in a real case scenario [42], whereas including different observers reduces the bias of a single observer [43]. Although this is not the first data diversity study in CMR, as another multi-vendor, multi-center study [44] also evaluated the incremental training strategy for a LV segmentation task with a thorough assessment, our work additionally assessed the annotation pattern diversity. We showed how a model trained on one center could yield a high accuracy in its own dataset but underperformed in datasets from different centers, to achieve generalizable multi-center development [37], and how a different annotation pattern generated discordance against another pattern, even with diverse training inputs.
We evaluated the performance against ground truth with a wide range of parameters including the Euclidean distance, angular distance, MV displacement, MAPSE and LV e' . We showed that the proposed pipeline yielded high reproducibility in a very demanding task with excellent ICC for MV displacement, MAPSE and LV e' , and how the Euclidean distance error may be misleading. This error discrepancy between training labels and clinical metrics has been noted by others. A recent learningbased approach for myocardial segmentation on T1 maps [45] achieved a near-perfect accuracy on estimating T1 values of LV myocardium, even while the segmentation accuracy only yielded a Dice similarity coefficient [46] of 0.85. Although this metric could be misleading, the clinical-metric agreement prevails over image processing performance, as the most important metric. In our study, we showed how a Euclidean distance error could be higher, but the MV plane displacement error was lower.

Limitations
One limitation of our study is the lack of consensus on how to annotate the MV points as a specific pattern   Table 5 Automated and manual inter-observer spatial annotation agreement The mean error ± standard deviation are reported for Euclidean and angular distances, and mitral valve displacement error between ground-truth and automated ( MVnet Mixed ) or a pair of second observers manual annotations. MV mitral valve would have homogenized the bias and yielded a lower error. However, our thorough analysis demonstrated that the difference in the annotation patterns did not hamper the performance but instead the MVnet reduced such biases and learned a consensual pattern from the diverse observers, confirming the value of multiple observers for a deep learning application [43]. Another limitation of our study is the missing benefit from incorporating a recurrent neural network architecture to learn spatiotemporal dependencies across the cardiac cycle instead learning from one temporal frame at a time. However, the technical contribution of this work relies on the automated image standardization algorithm to boost both image processing performance and clinical metric agreement, implementing this pipeline to a recurrent neural network architecture may also benefit its learning.
The clinical value of this work is that it provides an automated method for MV plane motion, including established metrics such as MAPSE and LV e' , and also utility for slice-following applications, automated cardiac rest-period identification, among others. Additionally, it does not need any added work in the clinical routine and the post-processing cost is negligible.

Conclusion
MVnet is a deep learning approach for automated delineation of MV points for MV plane displacement evaluation in CMR long-axis cine images. The method is able to track the MV points, accurately, rapidly and consistently. This will improve the feasibility of CMR methods which rely on valve tracking, such as measurement of e' , or slicefollowing phase-contrast, and increase their utility in a clinical setting.