T1 mapping performance and measurement repeatability: results from the multi-national T1 mapping standardization phantom program (T1MES)

Background The T1 Mapping and Extracellular volume (ECV) Standardization (T1MES) program explored T1 mapping quality assurance using a purpose-developed phantom with Food and Drug Administration (FDA) and Conformité Européenne (CE) regulatory clearance. We report T1 measurement repeatability across centers describing sequence, magnet, and vendor performance. Methods Phantoms batch-manufactured in August 2015 underwent 2 years of structural imaging, B0 and B1, and “reference” slow T1 testing. Temperature dependency was evaluated by the United States National Institute of Standards and Technology and by the German Physikalisch-Technische Bundesanstalt. Center-specific T1 mapping repeatability (maximum one scan per week to minimum one per quarter year) was assessed over mean 358 (maximum 1161) days on 34 1.5 T and 22 3 T magnets using multiple T1 mapping sequences. Image and temperature data were analyzed semi-automatically. Repeatability of serial T1 was evaluated in terms of coefficient of variation (CoV), and linear mixed models were constructed to study the interplay of some of the known sources of T1 variation. Results Over 2 years, phantom gel integrity remained intact (no rips/tears), B0 and B1 homogenous, and “reference” T1 stable compared to baseline (% change at 1.5 T, 1.95 ± 1.39%; 3 T, 2.22 ± 1.44%). Per degrees Celsius, 1.5 T, T1 (MOLLI 5s(3s)3s) increased by 11.4 ms in long native blood tubes and decreased by 1.2 ms in short post-contrast myocardium tubes. Agreement of estimated T1 times with “reference” T1 was similar across Siemens and Philips CMR systems at both field strengths (adjusted R2 ranges for both field strengths, 0.99–1.00). Over 1 year, many 1.5 T and 3 T sequences/magnets were repeatable with mean CoVs < 1 and 2% respectively. Repeatability was narrower for 1.5 T over 3 T. Within T1MES repeatability for native T1 was narrow for several sequences, for example, at 1.5 T, Siemens MOLLI 5s(3s)3s prototype number 448B (mean CoV = 0.27%) and Philips modified Look-Locker inversion recovery (MOLLI) 3s(3s)5s (CoV 0.54%), and at 3 T, Philips MOLLI 3b(3s)5b (CoV 0.33%) and Siemens shortened MOLLI (ShMOLLI) prototype 780C (CoV 0.69%). After adjusting for temperature and field strength, it was found that the T1 mapping sequence and scanner software version (both P < 0.001 at 1.5 T and 3 T), and to a lesser extent the scanner model (P = 0.011, 1.5 T only), had the greatest influence on T1 across multiple centers. Conclusion The T1MES CE/FDA approved phantom is a robust quality assurance device. In a multi-center setting, T1 mapping had performance differences between field strengths, sequences, scanner software versions, and manufacturers. However, several specific combinations of field strength, sequence, and scanner are highly repeatable, and thus, have potential to provide standardized assessment of T1 times for clinical use, although temperature correction is required for native T1 tubes at least.


Introduction
T 1 mapping aids clinicians in the assessment and diagnosis of myocardial disease. However, measurement needs to be stable over time with transferable values. Knowledge of normal reference ranges would benefit from not requiring local healthy subject scanning, and the pooling of multiscanner datasets would have advantages such as increasing available sample sizes for the detection of subtle effects or subgroup analysis and increasing result robustness and generalizability, lowering the chance of unforeseen bias when compared to single-center data [1]. Combining results however introduces sequence, magnet, and field strength bias [2]. The field of T 1 mapping would therefore benefit from a "T 1 standard" to enable cross-center T 1 mapping data pooling and delivery [3]-like the international normalized ratio (INR) which makes it possible to adjust the dosing of vitamin K antagonists regardless of which laboratory has performed the test [4].
The T 1 Mapping and Extracellular volume (ECV) Standardization (T1MES) phantom program was established to explore T 1 mapping quality assurance at 1.5 T and 3 T and understand the feasibility of delivering a "T 1 standard" [5]. The first step toward the goal was development and mass-production of a phantom [5] and its European Union Conformité Européenne (CE) and United States Food and Drug Administration (FDA) regulatory clearance. In September 2015, cardiovascular magnetic resonance (CMR) centers worldwide joined the T1MES consortium and committed to submit a minimum of 12 months of center-specific T 1 mapping data. Data submitted have now been analyzed to explore phantom performance.
We report phantom data at 1 and 2 years using various T 1 mapping sequences, temperature sensitivity, and include platform performance, although we emphasize that comparison of different T 1 methods and systems was not the main aim, rather investigating long-term stability towards the "T 1 standard". This involved modeling some of the potential sources of the T 1 variation longitudinally and between T1MES centers to identify the most influential factors.

Methods
The development and description of the T1MES phantom ( Fig. 1) has been previously reported [5]. Briefly, the T1MES phantom was designed to be field-strength specific (i.e., separate 1.5 T and 3 T models). Each phantom contains four tubes representing human native blood/ myocardial T 1 and T 2 values (i.e., pre-gadolinium-based contrast agent [GBCA] values) and five tubes representing human post-GBCA blood/myocardial values. While the main aim of the present study was the collection and analysis of the multi-center data (see the "Methods part 2-Multi-center phantom testing" section), some other tests were applied to a small number of the phantoms during the 2 years to explore the utility of T1MES as a quality assurance device, and these tests are described here first ("Methods part 1-Evaluation of the phantom"). Imaging biomarker terms used follow the recommendations of the Quantitative Imaging Biomarkers Alliance (QIBA) of the Radiological Society of North America (RSNA) [6].

Methods part 1-Evaluation of the phantom Structural integrity
Gel integrity and aging were checked at each submission time point for participating sites through the manual inspection of localizers that formed part of the minimum dataset requirement for participation. In addition, a highresolution, isotropic, three-dimensional (3D) gradient echo sequence (0.42 mm 3 ) was run on four phantoms (three 1.5 T phantoms; one 3 T phantom) at baseline (October 2015) and at 2 years post manufacturing in each case using a 3 T MAGNETOM Skyra (Siemens Healthineers, Erlangen, Germany; software syngo MR D13C). The sequence acquired two overlapping slabs (due to scanner software constraints), each with two directions of phase encoding, a slow repetition time (repetition time, TR = 17 ms), and narrow sampling bandwidth (250 Hz/pixel) for better signalto-noise ratio (SNR). This sequence had weak T 1 and T 2 image contrast and was only for structural examination.
"Reference" rT 1 and rT 2 data Baseline (October 2015) "reference" T 1 and T 2 values (rT 1 , rT 2 ) were acquired at the Royal Brompton Hospital CMR Unit using basic single-slice TR = 10 s inversion recovery spin echo (IRSE, 8 inversion times [TI] from 25 to 3200 ms) and single-slice repetition time (TR) = 10 s SE (8 echo times [TE] from 10 to 640 ms) [5] respectively. These sequences were identically repeated at 2 years on the same three 1.5 T phantoms and on the same three 3 T phantoms sampled from the production batch. The identifying serial numbers of the three 1.5 T phantoms were 15E031, 15E033, and 15E034, and these phantoms were scanned on a 1.5 T MAGNETOM Avanto [Siemens Healthineers; software syngo MR B17A]. The three 3 T phantoms were 30E001, 30E017, and 30E018, and they were scanned on a 3 T MAGNETOM Skyra [Siemens Healthineers; software syngo MR D13C].

Temperature sensitivity
The following three methods were used: First, controlled-temperature experiments over the range 10-30°C were conducted at the United States National Institute of Standards and Technology (NIST) on six loose T1MES tubes at 1 year (Fig. 2i). T 1 and T 2 were measured at 10, 17, 20, 23, and 30°C on an VnmrJ4 30E017; 30E018 shown here) sampled out of the original batch in August 2017 (this is 2 years post manufacture) confirmed their structural integrity. The whole length of the phantom was imaged (three exemplar slices only shown here, represented by the green dashed lines). The internal tubes are labeled. Surrounding the tubes is a speckled pattern due to high density polyethylene (HDPE) macrobeads in an agarose/NiCl 2 mixture . Confluent bright patches between tubes represent patches of agarose/NiCl 2 mixture due to displacement of macrobeads. There were no signs of structural deterioration of the phantoms 2 years after manufacture. Middle panel: The nine tubes are supported on a translucent resin base composed of unsaturated polyester/styrene. The matrix fill is packed with compact HDPE pellets and agarose/NiCl 2 mixture. Right panel: The outer physical appearance (front and back surfaces) of a phantom (30E018) at 2 years post manufacture (the plastic packaging wrap around the bottle cap dates back to the time of manufacture). HDPE, high-density polyethylene; NiCl 2 , nickel chloride; PE, polyethylene; PVC, polyvinyl chloride small-bore scanner operating at 1.5 T (Varian Medical Systems, Palo Alto, California, USA) in a temperaturecontrolled environment using a fiber optic temperature probe. T 1 was measured by IRSE (TR = 10 s, TI = 50-3000 ms) and T 2 by SE (TR = 10 s, TE = 15-960 ms).
Third, for each T1MES phantom scan at all centers, temperature was measured using liquid crystal thermometers adhered to every phantom. These measurements were pooled and analyzed to derive temperaturecorrection algorithms (see Statistical Analysis).

B 0 and B 1 uniformity
These uniformities and the fundamental distortion of B 1 by water dielectric permittivity especially at 3 T had been tested at baseline (October 2015, previously reported [5]). These uniformities were mapped later to check against "cracking" of the gel and subsequent impact of air gaps on B 0 in particular, while potential "clumping" of the plastic beads over time might in theory affect the B 1 [5]. We therefore considered it prudent to check whether anything unexpected occurred over the long term.
B 0 uniformity was therefore mapped at 2 years in six phantoms, in the transverse slice, midway along the length of the tubes, using a multi-echo gradient echo sequence, based on the phase difference between known TEs [7]. A frequency range of ± 50 Hz across the phantom was considered acceptable, based on published T 1 mapping offresonance sensitivity [8]. B 1 homogeneity was similarly evaluated using flip angle (FA) maps (double angle method using FA 60°and 120°[θ1, 2 × θ1] with long TR [8 s], and 4 ms sinc [− 3π to + 3π] slice excitation profiles to minimize error due to FA variation through the slice).
Methods part 2-Multi-center phantom testing Serial, multi-center T 1 mapping data The T1MES user manual (https://doi.org/10.6084/m9.figshare.c.3610175_D1.v1) defined strict scanning instructions (scanning and shim volume strictly at isocenter, use of same supporting materials, etc.). Each contributed T1MES dataset (localizers, sets of inversion recovery images, and inline scanner-generated T 1 maps, Fig. 3) underwent initial quality assurance, checking orientation, and isocenter (through visual inspection of localizers and maps and semi-automatically by inspecting metadata contained in Digital Imaging and COmmunications in Medicine [DICOM] headers "ImagePo-sitionPatient" and "ImageOrientationPatient") and to exclude ii and iii indicate to which field-strength device the tested tubes belong. At NIST, T 1 and T 2 were measured on a Varian/Agilent small bore scanner operating at 1.5 T in a temperature-controlled environment. Temperatures were measured using a fiber optic probe. At PTB, T 1 and T 2 were measured on a 3 T MAGNETOM Verio scanner (Siemens Healthineers; software syngo MR B17A). The phantom was always stored, moved, and scanned while resting in a Styrofoam box to ensure that the temperatures picked at bottle hull reflects the tube temperature. At scan time, the box was placed in the head coil (12 ch) of the PTB 3 T scanner. Temperatures were measured using a Pt100 resistance thermometer. Similar to temperature dependency results immediately after phantom manufacture, [5] short-T 1 tubes (modeling post-GBCA myocardium and blood) are more stable with temperature than very long-T 1 tubes (native blood) where T 1 increases more significantly with temperature image artifacts. All Siemens sequences except MyoMaps product variant and all Philips (Philips Healthcare, Best, the Netherlands) sequences except CardiacQuant product variant were prototypes. Any tubes with artifacts detected by operator inspection of the submitted T 1 maps were excluded from the analysis. Software version changes were captured automatically from DICOM headers ("StationName" and "SoftwareVersion").
The T 1 measurements from T1MES datasets (directly using only the parametric maps submitted, not by any T 1 fitting applied centrally to the submitted sets of T 1 recovery images) were carried out using a bespoke MATLAB pipeline (The MathWorks Inc., Natick, Massachusetts, USA, R2012b) assembled in collaboration with the US National Institutes of Health. From the data, T 1 for each of the nine tubes was measured in identically sized regions of interest (ROI) occupying the central 50% by area of each tube (accommodating 40 independent pixels) and collated in a dedicated research electronic data capture instrument (REDCap [9,10]).

Methods part 3-Statistical analysis
Analysis was performed using R (version 3.0.1, R Foundation for Statistical Computing, Vienna, Austria). Descriptive data are expressed as mean ± standard deviation (SD) and standard error of the mean (SEM) as appropriate. Distribution of data were assessed on histograms and using the Shapiro-Wilk test.

Temperature sensitivity
Linear regression equations were used to relate temperature (predictor variable in degrees Celsius) and the response variable, phantom T 1 , by the formula: T1 = Intercept + (β * [Temperature − 21°C]), with β representing the temperature correction, and 21°C our arbitrarily chosen temperature for cross-center comparison.

Correlation with rT 1 times
Correlations between estimated and rT 1 times were derived using linear regression. Tests for significant intersequence and cross-vendor correlation differences (setting null value to 0.001) were conducted with alpha 0.01 and confidence level 0.95 [11].

T 1 repeatability
After considering the normal values for native myocardial T 1 reported in the published literature (e.g., in [12][13][14][15][16] as mean ± 1SD, though a 95% reference range is approximately ± 2SD), where 1 SD of the mean native myocardial T 1 is generally~20-30 ms at 1.5 T and~50 ms at 3 T, we arbitrarily pre-defined as repeatable (and suitable for clinical/research use), T 1 mapping approaches where the estimated variance of serial T 1 data did not exceed ½ of the above in vivo 1SD. For T 1 mapping at 1.5 T, this was ≤ 10 ms, i.e., CoV ≤ 1%; for T 1 mapping at 3 T ≤ 25 ms, i.e., CoV ≤ 2%.
The CoV between serial repeat T1MES scans was calculated as the ratio of the SD to the mean. We appraised CoV as a compound measure of all causes of change in the estimated T 1 of all nine tubes before and after temperature correction. We also appraised CoV after temperature correction separately for the four native and five post-GBCA tubes. Sequence-specific differences between the nine temperature-adjusted CoVs were calculated using paired t test with P value adjustment for multiple comparisons by the Bonferroni method (taking two-tailed P < 0.01 as significant).

Sources of T 1 variation
Using temperature-adjusted T 1 values of the "Medium" native myocardium tubes (tubes "F" and "M" respectively), we constructed linear mixed models to study the interplay of some known sources of T 1 variation in multi-center phantom data. We did this separately for 1.5 T and 3 T phantom data. Considering temperature-adjusted T 1 time as the response variable of interest, we examined the influence of phantom ID with and without the added effect of phantom age, as the combined fixed effect. With this, we then tested the following random effects: The response variable T 1 fitted a normal probability distribution, so we estimated model parameters using maximum likelihood. ANOVA function using a type II Wald chi-square test evaluated the significance of fixed effects in the model. To compare models, Akaike and Bayesian information criteria (AIC, BIC) with the "smaller-is-better" criterion as well as chi-square values from inter-model ANOVA tests were used. The formulas used for model fitting and more definitions of the applied statistical tests are provided in Table 3.

Software upgrades
To explore whether software upgrades resulted in an abrupt "step" change in the temperature-adjusted T 1 reads, we performed piece-wise linear regression to check for any segmented relationship between the covariates "scan day" and "tube T 1 " (considering tube "F" at 1.5 T and "M" at 3 T) [21]. For any broken-line relationship discovered, we defined slope parameters and break points where the linear relation/s changed and temporally correlated these with DICOM software metadata.

Results part 1-Evaluation of the phantom Structural integrity
Individual inspection of all the localizers submitted by sites with each phantom dataset revealed no visible gel rips or tears down any of the tubes. At baseline and at 2 years following batch manufacturing (Supplementary Movie 1), phantoms were free of air bubbles and susceptibility artifacts at both field strengths. High-resolution imaging showed no evidence of gel rips or tears down any of the tubes, and the gels were intact in the mid slice-the piloted location for serial mapping (Fig. 1, left panel). T 1 maps collected through the midline of the phantom, using the specified T1MES scan setup, were free from off-resonance artifacts.
"Reference" rT 1 and rT 2 data Phantom measurements (averaged across all tubes) collected at the Royal Brompton Hospital showed that temperature-corrected rT 1 and rT 2 values at 2 years were stable compared to baseline. At 1.5 T, the 2-year temperature-adjusted %T 1 change was 1.95 ± 1.39% SD, 0.37% SEM. At 3 T, the 2year temperature-adjusted %T 1 change was 2.22 ± 1.44% SD, 0.25% SEM. Three tesla measurements at PTB showed a 2year temperature-adjusted %T 1 change of 0.80 ± 0.49%, SEM 0.16% (Supplementary Figure S1A). Although not the aim of this work, all reference T 2 parameters remained stable with relative < 1% change at PTB (temperature-adjusted %T 2 change over 2 years at 3 T = 1.65 ± 0.26% | 0.09%, Supplementary Figure S1B) and < 2% change over 2 years at Royal Brompton Hospital.
Sequences from Siemens and Philips at 1.5 T collected at Royal Brompton Hospital showed strong correlation with rT 1 times with small offsets, as shown in Fig. 4  Sequences from Siemens and Philips at 3 T showed strong correlation with rT 1 times but with visibly imperfect absolute measurement of T 1 across tubes of longer T 1 times as shown in Fig. 4 (right panel). The correlations between absolute measurements of T 1 and rT 1 were similar for MOLLI and SASHA on Philips (Supplementary Table S2, panel B). On Siemens, the correlation between absolute measurements of T 1 and rT 1 was significantly stronger for SASHA when compared to MOLLI and ShMOLLI (Supplementary Table S2, panel A). The correlation between absolute measurements of T 1 and rT 1 for SASHA was significantly stronger on Siemens compared to Philips (Supplementary Table S2, panel C). Comparison of errors at 3 T relative to 1.5 T showed no significant difference (average SEM 5.49 vs. 4.04 respectively, P = 0.428).

Temperature sensitivity
Temperature experiments at both NIST and PTB using spin echo, long TR, and sequences ( Fig. 2i) consistently showed that short-T 1 tubes (modeling post-GBCA myocardium and blood) were more stable with temperature than very long-T 1 tubes (native blood), where T 1 increased more significantly with temperature. T 1 increased by~11 ms/°C for a typical MOLLI 5s(3s)3s 1.5 T dataset in tube "B" (normal native blood). Conversely, T 1 decreased by~1.2 ms/°C in tube "G" (short post-GBCA myocardium). Temperature corrections for each of the nine tubes at both field strengths from the pooled multi-center analysis are presented separately (Supplementary Table S3) from temperature sensitivity experiments of NIST and PTB due to the latter's use of bespoke standard operating procedures.
B 0 and B 1 uniformity B 0 uniformity at 2 years was delivered to within ± 9 Hz at 1.5 T and ± 10 Hz at 3 T ( Fig. 5a-b). At 2 years after manufacture, the B 1 field distortion caused by the phantom, continued to be adequately flattened to within 10% of the B 1 at the center of the phantom, at both 1.5 T and 3 T (Fig. 5c-d).

Results part 2-Multi-center phantom testing
Thirty-four of the magnet systems appraised were 1. Quality assurance of the contributed DICOM image files identified significant protocol deviations or sequence reconstruction failures in nine submissions, the majority of which took place at the start of the study (examples in Supplementary Fig. S2). These were excluded with subsequent advice sent to centers permitting later inclusion in most cases, except where a tube artifact affecting > 25% of a single center's series led to exclusion of that tube from pooled analysis (see Supplementary Data File).
Fifty-two magnets non-exclusively contributed MOLLI datasets, 16 ShMOLLI, and 12 SASHA, and 2 magnets contributed SMART T 1 maps. Center-, session-, and sequence-specific T 1 mapping contributions are detailed in Supplementary Data File. During the project, there were three software upgrades at centers: two Siemens and one GE (details in Supplementary Fig. S4). , saturation recovery single-shot acquisition (SASHA)) at 1.5 T (left) and 3 T (right) split according to vendor (Siemens, Philips). At 3 T, the contributed Philips MOLLI 5b(1b)1b(1b)1b was missing the iterative/data dropping steps in map creation as per Siemens ShMOLLI, so these data are not shown. Measured T 1 times by the 3 sequences (mean of multiple centers, no temperature correction) are represented by symbols. The line of identity, i.e., "reference" rT 1 times by slow inversion recovery is represented as a discontinuous gray line. For each sequence, correlations between absolute measurements of T 1 and rT 1 times are shown. Correlations for the much sparser GE data (MOLLI, ShMOLLI, SMART each n = 1) are not reported. Vertical error bars within each shape represent standard error of the mean. aR, adjusted R 2 ; GE, General Electric Healthcare

T 1 repeatability
For serial multi-center data, temperature-unadjusted CoV per tube, per phantom, per sequence, and per magnet are detailed in the Supplementary Data File. Temperature-adjusted CoV for the various native sequences together with inter-sequence and crossplatform differences, at 1.5 T and 3 T, are summarized in Tables 1 and 2 respectively and data for post-GBCA  in Supplementary Tables S5 and S6. Results of native and post-GBCA sequence/scanner software version repeatability are provided in Tables 1 and 2 and in Supplementary Tables S5 and S6 respectively. Based on these temperature-adjusted data, mean (x) CoV were generally higher (implying poorer repeatability) at 3 T than at 1.5 T and for the much sparser GE data over Siemens/Philips. Over 1 year, many 1.5 T and 3 T sequences/magnets were repeatable with x CoV < 1% and < 2% respectively. For sequences optimized for native T 1 mapping applied to T1MES native tubes, we observed excellent repeatability for several sequences, for example, Siemens MOLLI 5s(3s)3s 448B (x CoV = 0.27%) and Philips MOLLI 3s(3s)5s (x CoV 0.54%) at 1.5 T (Table 1) and Philips MOLLI 3b(3s)5b (x CoV 0.33%) and Siemens ShMOLLI (x CoV 0.69%) at 3 T (Table 2).
Among sequences optimized for post-GBCA T 1 mapping applied to T1MES post-GBCA tubes, excellent repeatability was observed for several sequences, for example, Siemens ShMOLLI 1041B (x CoV 0.21%) and Fig. 5 a B 0 field homogeneity at 2 years post manufacture across the nine phantom compartments as a measure of off-resonance in hertz at 1.5 T (blue, averaged for three phantoms) and 3 T (green, averaged for another three phantoms). These are extremely small shifts in frequency (e.g., 10 Hz = 0.08 ppm at 3 T) and should not be regarded as significantly different between the tube compartments. b Diagonal profiles of the B 1 field at 2 years post manufacture as per red discontinuous lines (right panels) in six phantoms: three at 1.5 T scanned on 1.5 T MAGNETOM Avanto (Siemens Healthineers; software syngo MR B17A); example field map (c) and three at 3 T scanned on 3 T MAGNETOM Skyra (Siemens Healthineers; software syngo MR D13C); example field map (d) The term sequence refers to either MOLLI, ShMOLLI, SASHA, or SMART MOLLI/ShMOLLI protocol nosology has the number of inversions per experiment as the total count of numbers outside brackets, image cycles are outside brackets, pause cycles are within brackets, and cycle lengths defined in terms of either heart beats (b) or seconds (s) CoV coefficient of variation, ID identity code, GBCA gadolinium-based contrast agent, GE General Electric, MOLLI modified Look-Locker inversion recovery, rT 1 "reference" slow inversion recovery T 1 , SASHA saturation-recovery single-shot acquisition, SD standard deviation, ShMOLLI shortened MOLLI, SMART saturation method using adaptive recovery times for cardiac T 1 mapping, T Tesla a Denotes the T 1 mapping sequence|software combination with lowest overall CoV% for a given vendor where multiples exist. P values for differences in CoV between sequences/vendors are reported for this highly repeatable sequence where multiples exist. Less favorable CoVs (> 1%, see the "T 1 repeatability" section) are in italics.
Post-GBCA tubes are not shown here as their data are reported separately in relation to post-GBCA sequences in Supplementary  Table S5).

Sources of T 1 variation
Linear mixed models which excluded centers that had experienced a software change during the time of data collection (Table 3 and Supplementary Tables S7 and  S8) indicated that temperature-adjusted T 1 differs significantly between sequences and software versions at both field strengths (P < 0.001 all) and between magnet models at 1.5 T (P = 0.011). Notably, phantom age had no significant effect on T 1 (model A2).

Software upgrades
The two software upgrades on Siemens both occurred towards the end of longitudinal data submissions (Supplementary Fig. S4A, 15E031; 4B, 30E017), so the paucity of data points post-upgrade events precluded statistical testing for significant T 1 shifts. For the software upgrade on GE, however, differences between pre-and post-linear regression slopes ( Supplementary Fig. S4C, 30E012) indicate a marginally significant T 1 shift event (P = 0.024).

Discussion
To our knowledge, this is the first multi-center study using a wide range of T 1 mapping methods to study the interplay of some of the known sources of measured T 1 variation. The basis of this work is the T1MES phantom, and a major aim of this work was to evaluate how this Less favorable CoVs (> 2%, see the "T 1 repeatability" section) are in italics. Post-GBCA tubes are not shown here as their data are reported separately in relation to post-GBCA sequences in Supplementary Table 6 sic = text is quoted exactly as it stands in the original, i.e., this is not a typo. Abbreviations as in Table 1 a Denotes the T 1 mapping sequence|software combination with lowest overall CoV% for a given vendor where multiples exist b Denotes the number of different magnets submitting that particular sequence from which the average CoVs were derived c In the absence of iterative/data dropping steps in map creation as per ShMOLLI x̅ = average CoV across the 4 native tubes for a given sequence x̅ ' = where more than one sequence type was submitted, individual CoVs were then averaged to derive x̅ ' CoV; while for single sequence submissions x̅ ' CoV is from the global mean T 1 ± SD for that one sequence performed in use at multiple centers. The phantom appears sufficiently robust for T 1 mapping quality assurance purposes. Based on our data, estimated phantom T 1 values may show substantial variation between different T 1 mapping sequences, scanner software versions, and potentially also scanner models, and temperature correction, at least for native T 1 tubes, is necessary to achieve the desired repeatability. Inspection of the Supplementary Table also suggests that there are occasional performance variations producing outliers even within a given T 1 mapping sequence prototype/software combination across different magnets. In spite of this variation, however, several specific combinations of field strength, sequence, and scanner generally exhibit excellent repeatability. Given the choices, the CMR community may prefer to standardize and use combinations with high repeatability for future clinical and research use, e.g., CoV < 1% at 1.5 T and < 2% at 3 T, while seeking to optimize less repeatable combinations. Less data were available for GE in T1MES, so the observed variability requires verification in a larger more representative sample.
Agreement with "reference" slow scanning, rT 1 times was slightly greater for SASHA compared to MOLLI or ShMOLLI, which both slightly underestimated T 1 , mainly from the known T 2 -related underestimation which is larger when measuring the myocardium [13][14][15], (Fig. 4, Supplementary Tables S1 and S2) and not due to magnetization transfer, which is negligible in these agar phantoms [22].
The phantom data presented here suggest that the "T 1 standard" framework remains possible, but the wide Table 3 Linear mixed models for 1.5 T and 3 T multi-center temperature-adjusted T 1 mapping data (normalized to 21°C, considering the "Medium" native myocardium tubes "F" and "M" respectively). The best model is "A5" at 1.5 T and "A3" at 3 T  The symbol (1|ID) in the model formulas refers to the random effect of individual phantoms (by identity number). At 1.5 T models, A5 and A6 have equal AIC/BIC but given the lack of statistically significant χ 2 improvement from A5 to A6 (P = 1.0); A5 is considered the most parsimonious model Age refers to phantom age at scanning since date of manufacture, AIC Akaike information criterion, BIC Bayesian information criterion, χ2 chi-square, ref reference, SD standard deviation a Complete list of β coefficients for model A5 at 1.5 T and A3 at 3 T are provided in Supplementary Tables 7 and 8 respectively variety of different sequence options, vendors, and field strengths distributed the data over too many categories for reliable modeling, unlike the "locked-down" approach mentioned below which does have that strong advantage. Further work will explore the transferability of clinical measurement based on in vitro phantom calibration. T1MES is but one of a number of phantom objects that have been used or proposed to support T 1 mapping quality assurance (elaborated in Table 4). Some alternatives that target native and post-GBCA myocardial and blood relaxation times in support of T 1 mapping work are (1) Brompton phantom by Vassiliou et al. [23], (2) Hypertrophic Cardiomyopathy Registry [24] [25].
At 1 year, CoV for T 1 tubes of the Vassiliou [26] phantom on a single Siemens scanner ranged from 1.0 to 3.6% using only the native MOLLI 5s(3s)3s sequence, compared to 0.27 to 3.0% in T1MES (for 1.5 T MOLLI 5s(3s)3s [448B] on Siemens to SMART on GE respectively). This within sequence precision heterogeneity, as already alluded to by Kellman et al. [27], linked to protocol modifications within MOLLI, may partly explain some of the higher CoV for specific MOLLI sequences compared to ShMOLLI where centers obligatorily scanned using a fixed 5b(1b)1b(1b)1b sequence with conditional fitting. As mentioned above for the dispersion of T1MES results, this "real-world" heterogeneity of T 1 mapping sequences highlights that while T 1 mapping research and innovation calls for multiple flexible prototypes, the resultant diversity poses a nontrivial challenge to multi-center standardization. Conversely, the more "locked-down" approach like that adopted for ShMOLLI, albeit less editable by external researchers, potentially facilitates standardization.
The design of phantoms for this purpose (although often seen as trivial) is challenging: to have a large sufficient ROI in each sample tube and a good number of sample tubes in the main jar of the phantom, the overall size of the phantom increases to the point that B 1 nonuniformity at 3 T due to the dielectric permittivity of any water-based phantom becomes problematic. While of course B 1 is non-uniform in vivo, one purpose of phantom tests is to eliminate uncertainty or at least enable 9 plastic T 1 /T 2 tubes in an amber PVC sealed jar with inter-tube gaps packed with a carefully specified agarose/ HDPE plastic macrobead/ NiCl 2 -doped fill that flattened the B 1 field and provided sufficient B 0 homogeneity to obviate the need for side phantoms.
A layer of 14 T 1 spheres and a separate layer of 14 T 2 spheres so no unified T 1 /T 2 compartment representing the relaxation parameters of the human heart.
Health and disease, but of the 14 T 1 spheres (ranging from 21 to 2038 ms), half are not useful to cardiac T 1 mapping as T 1 times are too short for either native or post-GBCA myocardium/blood. Half the T 2 spheres (ranging from 11 to 581 ms) are also anti-physiological for cardiac mapping. Also, the T 1 / T 2 ratio is not representative of cardiac tissue. controlled testing of factors such as B 1 rather than introduce their own errors. Conversely, if the phantom is made very small, then the truly acquired pixel size of in vivo T 1 mapping methods (which obviously must not be modified or adapted [28] for the phantom scans) becomes important compared to the sample tube inner diameters leading to questions over SNR [29] and the impact of Gibbs artifact, for example. Another obstacle for relaxation time phantoms (on top of basic water T 1 and T 2 temperature and pressure variations, gel instabilities [30,31], small leakages, dehydration, or ion migration effects over time) is the differing temperature dependence of each species of paramagnetic ion's relaxivity (r 1 , r 2 ) and furthermore frequency dispersion in all of these physical properties between 60 MHz and 120 MHz [32][33][34][35][36]. Efforts to use alternative chemistry have so far not provided the range of T 1 and T 2 needed for this application.
The results confirmed, as is widely known in the field, that the T 1 measured by even nominally identical mapping sequences can differ significantly between software revisions. Scanner software upgrades are not uncommon in centers, and we encountered three such events among contributing partners. There was a measurable shift on at least one system, with potentially important implications for the field as alluded to in the 2017 consensus statement on multi-parametric mapping by the Society for Cardiovascular Magnetic Resonance (SCMR) [37]. Future work using phantoms at scale, across vendors and over time, would ideally continue monitoring center-specific data before and after potential shift events to more fully understand their impact on T 1 allowing the community to set up mitigation approaches.
Clearly, sources of T 1 variance and drift over time and between centers are numerous. Some T 1 variance is related to differences between sequences and magnet platforms: to the scanner itself (equipment drift or equipment shifts from updates of software, replacement of parts, routine service recalibrations, etc.), to the environment (temperature, pressure, humidity), and to the operator (phantom positioning inside scanner bore). In a phantom study, some T 1 variance may additionally have been due to intra-phantom differences (liquid/agar status over time collectively contributing to phantom aging) and inter-phantom manufacturing variations. In T1MES, we mitigated potential inter-phantom manufacturing variations by exercising extreme caution in design and prototyping and by the strict batch manufacturing processes to medical device grade standards. Drifts in the system setup (long-term stability) would tend to be "adjusted" out by the shim and center frequency (CF) and transmitter B 1 reference (TxREF) adjustments that could otherwise bias estimated T 1 . Any receiver coil or gain changes would most likely cancel out of T 1 maps as constant across the series of input images to T 1 derivation (appearing in A and B maps instead). It is considered unlikely that gradient performance would drift significantly, and any significant sudden changes in that are also unlikely even when the eddy-current compensation is adjusted on routine service visits. However, we have shown how unforeseen software changes could potentially cause "step-wise" changes in the estimated T 1 records. To directly model the interplay between all these potential sources of T 1 variation in longitudinal phantom studies would be a daunting task, and we considered only a subset of these sources in our analyses.
Though B 0 and B 1 homogeneity tests in T1MES (Fig. 5) were stable, we did measure slightly greater field inhomogeneity at the phantom edge and corners compared to the core (phantom corners and edges are the zones which our T1MES data flagged up as areas of least B 0 /B 1 homogeneity). The tubes with the highest T 1 times are the tubes most susceptible to such field inhomogeneities, and this suggests that a future iteration of T1MES could be further optimized by rearrangement to position such tubes in the most uniform regions.
The results of this study are relevant to clinicians or researchers choosing sequences for T 1 mapping, and we highlight 3 take home messages as follows: i. Sequence/software combinations in Tables 1 and 2, with a CoV ≤ 1% and 2% respectively, are sufficiently repeatable for clinical/research use (see rationale in the "Methods part 3-Statistical analysis" and "T 1 repeatability" sections). For the native myocardium T1MES tube at 1.5 T (reference T 1~1 037 ms by MOLLI), a ≤ 1% CoV is well within ± 1SD of normal values for T 1 , meaning that small-magnitude biological changes (e.g., diffuse myocardial fibrosis), will be resolvable with high precision [37]. By contrast, CoVs of 3 or 6% at 1.5 T (variance of 31 and 62 ms respectively) will undermine a center's ability to resolve smallmagnitude changes, and they may also impact the resolution of large-magnitude biological changes [37] (like amyloid, iron overload, Fabry disease, acute myocardial injury). Our data indicate that many of the investigated sequences for Siemens and Philips at 1.5 T exhibited CoVs ≤ 1%. At 3 T, MOLLI (Siemens and Philips), ShMOLLI (Siemens), and SASHA (Siemens) exhibited CoVs ≤ 2%. The absence of GE from this list is due to limited participation. ii. For developers, the highlighting of sequence performance may encourage work to refine sequences in some cases. iii. The prospect of using phantom calibration to remove the need for local reference ranges [37] is potentially becoming more tangible than before, yet more work is needed to deliver such calibration.
These data may facilitate the use of T 1 mapping as a useful clinical biomarker. T 1 standardization will also be important to facilitate clinical research. At the present time, there are 25 registered clinical research studies using T 1 mapping (clinicaltrials.gov accessed January 2019).

Limitations
Presented results are not intended to compare the SNR obtained in the phantom setup in different machines but are aimed at studying long-term stability. In spite of our outreach to advertise enrollment at study start, GE centers are under-represented. The scarcity of multi-center data for many of the specific T 1 mapping sequence, software, and field-strength combinations has limited our ability to deliver a "T 1 standard". Eight centers enrolled, received a phantom, but then could not deliver the serial phantom scans, and four centers did not meet the 1 year minimum data submission requirement. Each center provided legitimate justifications for the missing data, and furthermore, we highlight that contributions to T1MES were entirely voluntary, with no center receiving remuneration for any staff or scanner time invested in the program.
In total, 3 magnets deviated from the prescribed nominal intervals in some of their initially submitted datasets. Phantom repeatability studies represent an in vitro experiment. Some sequences and scanners that performed well in vitro may exhibit greater variability in the face of biological tissue (magnetization transfer, flow, motion), patient characteristics (fast/slow heart rates, heart rate variability, breath-hold length, implanted metal devices), and scanning process (non-isocentering, scanning in multiple planes). The T1MES phantom does not capture magnetization transfer. The biomarker performance that was measured here accounts for T 1 mapping in vitro. While this is obviously related to diagnostic performance in vivo, cardiac motion, etc., introduce yet another degree of freedom, and different hardware/software combinations might deal with this differently. In addition, CoV as a single metric may not be the best way to pick a pulse sequence for repeatability when the data is only derived from a static phantom. A very precise (reproducible) T 1 sequence that requires long breatholds or is heart rate-variability-sensitive might perform very well with the T1MES phantom setup but perform poorly in patients with varying biology, e.g., that cannot hold their breath well or who are in atrial fibrillation. The phantom has no intracellular/extracellular component, so any researcher seeking to model "ECV" using pre-and post-GBCA T 1 mapping values derived from T1MES (see example data in Supplementary Table S9) must bear this in mind.
The phantom design, ROI size, and coil setup instructions all aimed to make the fundamental image noise SNR an irrelevant factor in the CoV results. The artificially high and "clean" SNR of a phantom setup (if without phantom-related distortions, as we have shown in the "B 0 and B 1 uniformity" section) is clearly not directly an indicator of satisfactory in vivo performance. It contains many sequence-related factors that may have different consequences in vivo [29], and such SNR comparisons are not the aim of this work. Ideally, participating sites would have proved this by sending multiple T 1 maps per session. However, contributions were already voluntary, and further demands on sites could not be supported. For the same reason, it was not feasible to demand long reference T 1 /T 2 data for each participating site, and this was never part of the original enrollment commitment. As some indication of the typical short-term thermal noise random jitter, the raw (temperature-unadjusted) T 1 values from some sites that did anyway submit several repeated T 1 maps each session are plotted in Supplementary Fig. S3.

Conclusion
The T1MES phantom developed to CE/FDA manufacturing standards for T 1 mapping is a robust quality assurance device. T 1 mapping can now be quality assured on a multi-center scale, fulfilling a key requirement for the use of T 1 mapping in clinical decision-making or as a surrogate endpoint in drug trials. A good number of T 1 measurement sequence/scanner model/vendor/field strength combinations are remarkably repeatable, but some combinations less so, with coefficients of variation exceeding 1-2% over 1 to 2 years. Given the alternatives, we recommend that combinations with poor repeatability are deprioritized for clinical and research use. In spite of the use of a large number of different sequence prototypes and products in this study, a number of agreements and encouraging results are reported. Phantom calibration of T 1 mapping which obviates the need for local reference ranges, enabling the establishment of a "T 1 standard" to facilitate multi-center T 1 studies, comes closer, with further work required to address this.