Quantification of LV function and mass by cardiovascular magnetic resonance: multi-center variability and consensus contours

Background High reproducibility of LV mass and volume measurement from cine cardiovascular magnetic resonance (CMR) has been shown within single centers. However, the extent to which contours may vary from center to center, due to different training protocols, is unknown. We aimed to quantify sources of variation between many centers, and provide a multi-center consensus ground truth dataset for benchmarking automated processing tools and facilitating training for new readers in CMR analysis. Methods Seven independent expert readers, representing seven experienced CMR core laboratories, analyzed fifteen cine CMR data sets in accordance with their standard operating protocols and SCMR guidelines. Consensus contours were generated for each image according to a statistical optimization scheme that maximized contour placement agreement between readers. Results Reader-consensus agreement was better than inter-reader agreement (end-diastolic volume 14.7 ml vs 15.2–28.4 ml; end-systolic volume 13.2 ml vs 14.0–21.5 ml; LV mass 17.5 g vs 20.2–34.5 g; ejection fraction 4.2 % vs 4.6–7.5 %). Compared with consensus contours, readers were very consistent (small variability across cases within each reader), but bias varied between readers due to differences in contouring protocols at each center. Although larger contour differences were found at the apex and base, the main effect on volume was due to small but consistent differences in the position of the contours in all regions of the LV. Conclusions A multi-center consensus dataset was established for the purposes of benchmarking and training. Achieving consensus on contour drawing protocol between centers before analysis, or bias correction after analysis, is required when collating multi-center results.


Background
Left ventricular (LV) mass and volumes are essential for the management of patients with cardiovascular disease. In particular, LV mass (LVM) is an independent predictor of cardiovascular events [1], and end-diastolic volume (EDV) and end-systolic volume (ESV) are associated with adverse remodeling [2]. Cardiovascular magnetic resonance (CMR) is currently the most accurate and reproducible method for quantifying LV mass and volumes [3]. CMR is non-invasive, does not require geometrical assumptions, is non-ionising and has high signal-to-noise ratio. Because of these advantages, CMR is becoming widely used for the measurement of ventricular volumes, function and mass in many clinical centers, as well as in large research studies including the Multi-Ethnic Study of Atherosclerosis (MESA) [4] and the UK Biobank [5].
LV mass and volume quantification requires accurate delineation of the blood pool and myocardium. Although the contrast between flowing blood and the myocardium in steady-state free precession (SSFP) images is typically excellent, the precise placement of the contours is reader dependent [6]. High reproducibility of LV mass and volume measurement based on cine CMR has been shown within single centers [7,8], but differences in training and standard operating procedures may occur between centers [6]. A multi-center consensus ground truth dataset would be valuable for evaluating sources of variability, training new readers, establishing standard protocols for multicentre studies and validating new computer algorithms for automated contouring. Such a dataset is difficult to establish, due to the time-consuming nature of manual contouring. Although there are several recourses that offer large breadth [5,9,10], no resource offers depth of expert analysis from multiple centers. Greater depth of readers is valuable in evaluating sources of variation and an unbiased consensus. The aim of this study was to develop a consensus ground truth LV contour dataset for SSFP cine images, derived from expert readers representing seven independent CMR centers from many countries around the world, in accordance with the SCMR post-processing guidelines [11].

Participants
Cine CMR images from 15 subjects (five healthy volunteers, six patients with myocardial infarction, two patients with heart failure, and two patients with LV hypertrophy) were included in this study. CMR images were acquired with contiguous short axis slices and 2-3 long axis slices in accordance with SCMR guidelines using three different scanners (4 GE, 5 Siemens and 6 Philips). Spatial resolution varied with FOV, ranging from 92 × 72 to 280 × 280 mm 2 (see Table 1). Temporal resolution was typically 20-30 frames, except for one case with 60 cardiac frames. Slice thickness was either 8 or 10 mm. Shortaxes view series covering the LV from apex to base were defined in 10-15 slices. Anonymized images were contributed to the Cardiac Atlas Project database with the approval of local institutional review boards. Written informed consent was obtained from all participants.

Analysis
Seven expert readers, representing seven CMR core laboratories from six countries around the world (two USA, Canada, two UK, Germany and Netherlands) analyzed all cases independently. The contours were reviewed by the core laboratory principal investigator and represented the standard practice of each core laboratory. Each center was able to use their usual software, with the condition that all contours were placed manually, or manually corrected if initial contours were found automatically. Contouring was performed in accordance with the SCMR guidelines on standardized image interpretation and postprocessing of CMR images [11]. Trabeculae and papillary muscles were included in the blood pool and excluded from the LV mass. No attempt was made to train the readers, influence their analysis or achieve consistent results between readers.
Epicardial and endocardial contours were drawn on the short-axis slices at end-diastole (ED) and endocardial contours drawn at end-systole (ES). The ED and ES frames were pre-determined for all readers by the coordinating center, based on smallest area in a mid-ventricular slice. All readers were asked to contour slices covering the whole ventricle from apex to base, but there was no restriction on which slices to include or exclude. Readers used a range of software packages. Two readers used OsiriX (Pixmeo, Geneva, Switzerland), five readers used QMass (Medis, Leiden, the Netherlands), and two readers used CMR42 (Circle Cardiovascular Imaging Inc., Calgary, Canada). Two readers used two different software packages. Contours were imported from these software packages and pre-processed using Matlab R2010a (Mathworks, Natick, MA, USA). Consensus contours were only generated if most of the readers (i.e. four or more) contoured the slice; otherwise, no consensus contours were produced.

Consensus contour estimation
Consensus contours were estimated using the Simultaneous Truth and Performance Level Estimation (STAPLE) method [12]. This method calculated unique contours in each slice that maximized the conditional probability of the consensus given the readers' contours. Briefly, contours from each reader were first converted to binary images (1 for pixels within the contour, 0 for pixels outside the contour). Since the contour resolution was higher than the original images, the binary images were calculated at a resolution 4x higher than the original image. Given an estimate of the consensus contours, the sensitivity of each reader was calculated as the proportion of pixels inside the consensus contours that were also inside the reader contours. Similarly, the reader specificity was calculated as the proportion of pixels outside the consensus contours, which were also outside the reader contours. The STAPLE method uses Expectation-Maximization [13] to calculate the optimal consensus contour, as well as the reader sensitivity and specificity. There are two steps that are performed iteratively until convergence. The first step (Expectation) estimates the consensus probability given the reader contours and current estimates of sensitivity and specificity. The second step (Maximization) updates the sensitivity and specificity of each reader based on this consensus probability. The result is not the same as simple averaging or pixel voting, since specificity and sensitivity behave as weights during the optimization process, which are not assumed to be equal across all readers. Instead, the voting solution is used as the initial estimate to start the iteration. The STAPLE method has been successfully applied in several medical imaging applications, and was recently used to estimate consensus ground truth contours from automated CMR analysis methods [9].

Cavity volumes and myocardial mass
LV cavity volumes at ED (EDV) and ES (ESV) were computed by slice summation. Two-dimensional cavity areas were multiplied by the inter slice distance to compute a slice volume. The myocardial mass (LVM) was defined at ED by subtracting EDV from epicardial volume and multiplying by 1.05 g/ml.

Functional assessment
Root mean squared errors (RMSE) in volumes and mass were computed to measure the agreement between each reader and all the other readers. This was defined as where E i indicates the RMSE for reader i, j indicates all other readers, F indicates either EDV, ESV, LVM or ejection fraction (EF), R is the number of readers, k indicates the cases, and N is the number of cases. A similar RMSE was applied also to the consensus, denoted by E C (F), to measure the agreement between the consensus and all readers: Smaller values of E C (F) compared to E i (F) for all i = 1, 2, … , R were taken to indicate functionally acceptable consensus contours.

Visual assessment
An independent reader with over 10 years experience in CMR, who was not affiliated with any of the participating core laboratories, visually assessed the consensus contours by scoring as either acceptable or unacceptable according to whether the contour was clinically plausible.

Statistics
Bland-Altman analysis was performed to evaluate variation in volumes and mass across all readers relative to the consensus. The limits of agreement were defined at 95 % of confidence interval from the bias. Individual reader bias was quantified by the mean of the differences from the consensus, and reader precision was quantified by the standard deviation of the differences. Volumes and mass estimated from the consensus contours were calculated with the standard error estimated between the readers and the consensus. Statistical analysis was performed using the open source R statistics package (The R Foundation of Statistical Computing Platform, ver. 3.1.1).

Visual assessment
Of all 15 cases, no unacceptable consensus contours were found by the independent reader. Figure 1 shows a representative case of consensus contours estimated from this study. Although large disagreements between readers could be found in some of the apex, base and outflow tract slices, the consensus contours generated from these difficult slices (shown by Fig. 2) were visually acceptable. Table 2 shows the estimated consensus LV function per case in terms of EDV, ESV, LVM and EF (ejection fraction). Standard errors between the readers and the consensus were used to indicate confidence intervals for the estimated values.

Functional assessment
The RMSE values from the consensus (E C ) were always the smallest compared to any reader RMSE values (E i ). For EDV, E C was 14.7 ml while E i was from 15.2 to 28.4 ml. For ESV, E C was 13.2 ml while E i was from 14 to 21.5 ml. For LVM, E C was 17.5 g, while E i was from 20.2 to 34.5 g. For EF, E C was 4.2 % while E i was from 4.6 to 7.5 %.
Bias and precision Figure 3 shows reader bias and precision using the estimated consensus values (Table 2) as the reference. All readers showed good precision, with low standard deviations of the differences. The precision in EDV ranged from 5.0 to 11.6 mL (average 8.9 mL); precision in ESV ranged from 7.0 to 11.8 mL (average 9.5 mL); precision in LVM ranged from 10.0 to 12.9 g (average 10.9 g); precision in EF ranged from 1.5 to 5.6 % (average 3.6 %). However, differences in analysis protocols between readers were evident, with some readers exhibiting smaller endocardial and larger epicardial contours, whilst others showed larger endocardial and smaller epicardial contours. The bias in EDV ranged from −36.6 to 40.5 mL (average 0.9 mL); bias in ESV ranged from −32.9 to 41.2 mL (average 0.8 mL); bias in LVM ranged from −44.5 to 59.6 g (average 0.7 g); bias in EF ranged from −11.7 to 12.8 % (average 0.0 %).
Individual reader reports were generated automatically. Figure 4 shows an example report demonstrating similarity of the reader contour with the consensus, with the disparity between expert readers (dark red bands).

Discussion
Readers were consistent within themselves, which was indicated by the relatively small standard deviations (precision) from the consensus within each reader (Fig. 3). However, different readers could be larger or smaller than the consensus (bias). This was due primarily to differing practices at each core lab, where contouring ED ES Fig. 1 Estimated consensus contours from a myocardial infarction case was consistently smaller or larger compared with the other readers. We also examined other sources of possible bias, in particular in the outflow and apical slices (Fig. 2), but the bias contributions from these slices to the final mass and volume estimate were insignificant. For clinical studies collating results across core labs, steps should be taken to reduce bias between core labs, either prior to the analysis (by training) or after analysis (by bias correction). This study provides a standard set of cases that could be used as a training set. Alternatively, automated post-hoc bias correction methods [14] can also be applied.

Consensus contour quality
The consensus LV function estimates had the better agreement with all readers than any individual reader. All the RMSE values for EDV, ESV, mass and EF were smallest for the consensus compared with any reader. This indicated that the consensus had better agreement with the readers than any individual reader.
As demonstrated by Fig. 1, the consensus contours were visually acceptable in all cases. Greatest disagreements between readers appeared in the areas where tissue contrast ratios were low, such as in the apical slices and in the outflow tract. Even for these slices, the estimated consensus contours were visually acceptable. Note  that the outflow tract was contoured as part of the LV cavity as recommended in the SCMR guidelines [11]. The STAPLE method was performed on each slice and each contour type independently, without taking into account any information about the geometry or anatomy of the myocardium or any pixel intensity values. The STAPLE method is not a vote counting mechanism, and therefore the functional consensus is not a simple average of mass or volume across readers. An example of the difference between voting and STAPLE consensus contours is shown in Fig. 5, demonstrating that STAPLE found an acceptable epicardial apex contour whereas voting did not.

Reader assessment
In this study, we developed a resource CMR dataset, where myocardial contours were defined through a consensus of seven independent expert readers. The dataset will be useful for training and assessing new readers on contouring CMR images. Feedback to new readers can include graphical displays of areas of maximum disagreement (Fig. 4). Quantitative measurements can be measured in terms of mean, maximum and standard deviation of the distance from the new contour to the consensus.
Differences in mass and volumes derived by new reader contours can be compared with Table 2. The CMR data are available on request from the Cardiac Atlas Project for the purpose of training new readers and benchmarking automated processing methods. Note the consensus is a fusion of a number of different core labs, and does not reflect the practice of any particular lab.

Limitations
Only fifteen cases of varying pathology were included in the study, due to the time consuming nature of manual contouring. However, the purpose of the study was to provide a resource for assessing automated methods and facilitating training. We therefore chose increased depth of readers from different centres over breadth of subjects. This provides considerable power for analysis of differences [15,16]. Although resources are becoming available with many hundreds and even thousands of studies [5,17], these provide manual contours from a small number (typically one or two) of centers. Our study therefore provides a unique resource representing the largest number of expert readers to date.
The SCMR guidelines [11] recommend either inclusion or exclusion of papillary muscles. Our study excluded papillary muscles from the LV mass, but papillary muscles are myocardial tissue, which ideally should be included. The formation of a manual ground truth from many expert readers would be very time consuming, but very valuable for validating automated methods [18,19]. In the future, more widely automated methods are likely to increase the number of centers quantifying papillary mass. Right ventricular and atria consensus contours would also be useful to establish in the future.

Conclusion
We have estimated a set of consensus myocardial contours from SSFP CMR short-axis slices from expert readers representing experienced core labs around the world. The consensus contours achieved better agreement in LV mass and volumes than between readers.