next up previous
Next: Model order selection Up: tr02cb1 Previous: Uniqueness


Maximum Likelihood estimation

Throughout the remainder of this paper, we are going to keep the parameter $ \mu$ fixed at its ML estimate:

   $\displaystyle \mbox{\protect\boldmath$\mu$}$$\displaystyle _{{\small\sf ML}}=\langle$   $\displaystyle \mbox{\protect\boldmath$x$}$$\displaystyle _i \rangle
$

(assuming zero-mean sources) and will assume that it has been removed from the data. The mean can always be reintroduced after model estimation using $ x$$ _i=$$ A$$ ($$ s$$ _i+$$ W$$ \mu$$ _{{\small\sf ML}})+\widetilde{\mbox{\protect\boldmath $\eta$}}_i$, where

$\displaystyle \widetilde{\mbox{\protect\boldmath$\eta$}}_i = \mbox{\protect\bol...
...ect\boldmath$A$}\mbox{\protect\boldmath$W$})\mbox{\protect\boldmath$\mu$}_{ML}.$

Without loss of generality we will also assume that the sources have unit variance for we can freely exchange arbitrary scaling factors between the source signals and the associated columns of the mixing matrix $ A$.

If the noise covariance $ \Sigma$$ _i$ was known we can use its Cholesky decomposition $ \Sigma$$ _i=$$ K$$ _i$$ K$$ ^{\mbox{\scriptsize\textit{\sffamily {t}}}}_i$ to rewrite equation 2:

   $\displaystyle \mbox{\protect\boldmath$K$}$$\displaystyle _i^{\mbox{\scriptsize\textit{\sffamily {-$\!$1}}}}\mbox{\protect\...
...ox{\scriptsize\textit{\sffamily {-$\!$1}}}}{\mbox{\protect\boldmath$\eta$}}_i, $

and obtain a new representation

$\displaystyle \bar{\mbox{\protect\boldmath$x$}}_i=\bar{\mbox{\protect\boldmath$A$}}\mbox{\protect\boldmath$s$}_i+\bar{\mbox{\protect\boldmath$\eta$}}_i
$

where $ \bar{\mbox{\protect\boldmath $\eta$}}_i=\mbox{\protect\boldmath $K$}_i^{\mbox{...
...x{\protect\boldmath $\eta$}\sim{\cal N}(0,\sigma^2\mbox{\protect\boldmath $I$})$, i.e. where the noise covariance is isotropic at every voxel location. During this step, the data are pre-whitened with respect to the noise covariance, which is not to be confused with (spatial) pre-whitening of the data matrix using singular-value decomposition. In order to simplify further notation, we are therefore going to assume that the data contain isotropic noise and drop the bars.

Noise and signal are assumed to be uncorrelated and therefore

   $\displaystyle \mbox{\protect\boldmath$R$}$$\displaystyle _{{\mbox{\protect\boldmath$x$}}} -\sigma^2\mbox{\protect\boldmath...
...$A$}}{\mbox{\protect\boldmath$A$}}^{\mbox{\scriptsize\textit{\sffamily {t}}}},
$

where $ R$$ _{{\mbox{\protect\boldmath $x$}}}=\langle{\mbox{\protect\boldmath $x$}}_i{\mbox{\protect\boldmath $x$}}^{\mbox{\scriptsize\textit{\sffamily {t}}}}_i
\rangle$ denotes the covariance matrix of the observations. Let $ {\mbox{\protect\boldmath $X$}}$ be the $ p\times N$ matrix of voxel-wise pre-whitened data vectors and let $ {\mbox{\protect\boldmath $X$}}={\mbox{\protect\boldmath $U$}}{(N\mbox{\protect\boldmath $\Lambda$})^\frac{1}{2}}{\mbox{\protect\boldmath $V$}}$ be its singular value decomposition. Furthermore let the rank of the mixing matrix $ A$, i.e. the number of source processes $ q$, be known. Then

$\displaystyle \widehat{{\mbox{\protect\boldmath$A$}}}_{{\small\sf ML}}={\mbox{\...
...1 } {2}}\mbox{\protect\boldmath$Q$}^{\mbox{\scriptsize\textit{\sffamily {t}}}},$ (5)

where $ U$$ _{\! q}$ and $ \Lambda$$ _q$ contain the first $ q$ eigenvectors and eigenvalues of $ U$ and $ \Lambda$ and where $ Q$ denotes a $ q\times q$ orthogonal rotation matrix.

Estimating the mixing matrix $ A$ thus reduces to identifying the square matrix $ Q$ after whitening the data with respect to the noise covariance $ \Sigma$$ _i$ and projecting the temporally whitened observations onto the space spanned by the $ q$ eigenvectors of $ R$$ _{\mbox{\protect\boldmath $x$}}$ with largest eigenvalues. From $ \widehat{\mbox{\protect\boldmath $A$}}$, the maximum likelihood source estimates are obtained using generalised least squares:

$\displaystyle \widehat{\mbox{\protect\boldmath$s$}}_{{\small\sf ML}}=(\widehat{...
...ath$A$}}^{\mbox{\scriptsize\textit{\sffamily {t}}}}\mbox{\protect\boldmath$x$},$ (6)

and the ML estimate of $ \sigma^2$ becomes

$\displaystyle \widehat{\sigma}^2_{{\small\sf ML}}=\frac{1}{p-q}\sum_{l=q+1}^p\lambda_l,$ (7)

i.e. is the average of the eigenvalues in the minor subspace spanned by the $ p-q$ smallest eigenvectors. Solving the model in the case of a full noise covariance where the noise covariance is unknown can be achieved by iterating estimates for $ \widehat{\mbox{\protect\boldmath $A$}}$ and $ \widehat{\mbox{\protect\boldmath $s$}}$ and re-estimating the noise covariances from the residuals $ \widehat{\mbox{\protect\boldmath $\eta$}}$. The estimation process in the presence of non-isotropic noise is computationally much more involved than estimation in the standard noise free setting. The form of $ \Sigma$$ _i$ needs to be constrained, e.g. we can use the common approaches to FMRI noise modelling [Bullmore et al., 1996,Woolrich et al., 2001], and restrict ourselves to autoregressive noise. However, since the exploratory approach allows modelling of various sources of variability, e.g. temporally consistent physiological noise, as part of the signal in equation 2, the noise model itself can actually be quite simplistic. Estimation of $ \Sigma$$ _i$ from residuals in the case of autocorrelated noise is discussed in detail in [Woolrich et al., 2001] which is the approach used in this case.

The maximum likelihood solutions given in equations 5-7 give important insight into the methodology. Firstly, in the case where $ q$ and $ \Sigma$$ _i$ are known, the maximum likelihood solution for $ \widehat{\mbox{\protect\boldmath $A$}}$ is contained in the principal eigenspace of $ R$$ _{\mbox{\protect\boldmath $x$}}$ of dimension $ q$, i.e. the span of the first $ q$ eigenvectors equals the span of the unknown mixing matrix $ A$. Projecting the data onto the principal eigenvectors is not just a convenient technique to deal with the high dimensionality in FMRI data but is part of the maximum likelihood solution under the sum of square loss. Even if estimation techniques are employed that do not use an initial PCA step as part of the ICA estimation, the final solution under this model is necessarily contained in the principal subspace. Secondly, combining these results with the uniqueness results stated earlier we see that only in the case where the analysis is performed in the appropriate lower-dimensional subspace of dimension $ q$ are the source processes uniquely identifiable. Finally, equations 5-7 imply that the standard noise-free ICA approach with dimensionality reduction using PCA implicitly operates under an isotropic noise model.

The remainder of this paper illustrates that by making this specific noise model explicit in the modelling and estimation stages, we can address important questions of model order selection, estimation and inference in a consistent way.

An immediate consequence of the fact that we are operating under an isotropic noise model is that as an initial pre-processing step we will modify the original data time courses to be normalised to zero mean and unit variance. This appears to be a sensible step in that on the one hand we know that the voxel-wise standard deviation of resting state data varies significantly over the brain but on the other hand, all voxels' time courses are assumed to be generated from the same noise process. This variance-normalisation pre-conditions the data under the 'null hypotheses' of purely Gaussian noise, i.e. in the absence of any signal: the data matrix $ X$ is identical up to second order statistics to a simple set of realisations from a $ {\cal N}(0,1)$ noise process. Any signal component contained in $ X$ will have to reveal itself via its deviation from Gaussianity. This will turn out to be of prime importance both for the estimation of the number of sources and the final inferential steps.

After a voxel-wise normalisation of variance, two voxels with comparable noise level that are modulated by the same signal time course, $ a$$ _j$ say, but by different amounts will have the same regression coefficient upon regression against $ a$$ _j$. The difference in the original amount of modulation is therefore contained in the standard deviation of the residual noise. Forming voxel-wise $ Z$ statistics, i.e. dividing the PICA maps by the estimated standard deviation of $ \eta$, thus is invariant under the initial variance-normalisation.



Subsections
next up previous
Next: Model order selection Up: tr02cb1 Previous: Uniqueness
Christian F. Beckmann 2003-08-05