The results presented above are purely qualitative, based on the subjective assessment of many individuals. However, as is generally the case for many registration problems in practice, there was no ground truth available to test the registration against. This makes the area of quantitative assessment of methods quite difficult.
In order to test the method more quantitatively, a comparative consistency test was performed. This test aims to measure the robustness, rather than the accuracy [West et al., 1997] of the registration method. Robustness is defined here as the ability to get close to the global minimum on all trials, whereas accuracy is the ability to precisely locate a (possibly local) minimum of the cost function. For example, one method might always (say in over 99.99% of cases) be between 0.2mm and 0.6mm from the best possible solution compared to another method that was often less than 0.1mm from the best solution but would sometimes (say in 5% of cases) fail to find the global minimum and get trapped in a local minimum which could be in excess of 10mm from the best solution. In this case the former method would be considered more robust than the latter while the latter method would be more accurate but less robust. Ideally a registration method should be both.
The consistency test is designed to assess one necessary, but not sufficient, aspect of robustness. That is, the ability to find the same solution regardless of the initial position. Any robust method, which always finds the global minimum, will give the same solution each time whereas a non-robust method which can be trapped by a local minimum is likely to give different solutions depending on the initial position. However, this condition is not sufficient in determining robustness as the same, consistent, solution may just be a large local minimum, rather than the global minimum. Therefore it is also necessary to check that the registration solution is acceptable to someone trained in neuroanatomy. This aspect was addressed in the trials described above.
More specifically, the consistency test for an individual image Iinvolved taking the image and applying several pre-determined affine transformations, Aj to it. All these images (both transformed and un-transformed) were registered to a given reference image, Ir, giving transformations Tj. If the method was consistent the composite transformations should all be the same, which is illustrated in figure 6. Moreover, an RMS deviation between the composite registration and the registration from the un-transformed case allows quantification of the consistency.
The particular test used (which is also described in [Jenkinson and Smith, 1999]) used 18 different images as the floating images (like the one shown in figure 3a), all with the MNI 305 brain [Collins et al., 1994] as the reference image. The 18 images were all , T2 weighted MRI images with voxel dimensions of 0.93mm by 0.93 mm by 5mm, while the MNI 305 template is a , T1 weighted MRI images with voxel dimensions of 1mm by 1 mm by 1mm.
In addition to FLIRT, several other registration packages were tested. These were AIR [Woods et al., 1993], SPM [Friston et al., 1995], UMDS [Studholme et al., 1996] and MRITOTAL [Collins et al., 1994]. These methods were chosen because the authors' implementations were available, and so this constituted a fair test as opposed to a re-implementation of a method described in a paper, where often the lack of precise implementation details makes it difficult to produce a good working method.
The results of such a test, using six different rotations about the Anterior-Posterior axis, are shown in figure 7. It can be seen that only FLIRT and MRITOTAL were consistent with this set of images. This indicates that the other methods, AIR, SPM and UMDS, get trapped in local minima more easily, and are not as robust. In particular, rotations of only sometimes resulted in large differences in the final registrations, showing how sensitive the methods are to initial position.
A further consistency test was then performed comparing only MRITOTAL and FLIRT. This test used initial scalings rather than rotations. The reason that this is important is that MRITOTAL uses a purely local optimisation method (Gradient Descent) but relies on initial pre-processing to provide a good starting position. This pre-processing is done by finding the principle axes of both volumes and initially aligning them. However, this initial alignment does not give any information about scaling and is dependent on the FOV, since when the edges of the volume truncate the image it can have a significant impact on the principle axes that are computed.
The results of the scaling consistency test are shown in figure 8. It can be seen that, although generally consistent, in three cases MRITOTAL produces registrations that deviate by more than 20mm (RMS) from each other. In contrast, FLIRT was consistent (less than 2mm RMS) for all images.