Model order selection

Next: Estimation of the unmixing Up: Maximum Likelihood estimation Previous: Maximum Likelihood estimation

Model order selection

The maximum likelihood solutions given in equations 5-7 depend on knowledge of the latent dimensionality . In the noise free case this quantity can easily be deduced from the rank of the covariance of the observations

$\displaystyle \mbox{\protect\boldmath$R$}$ $\displaystyle _{\mbox{\protect\boldmath$x$}}=\langle\mbox{\protect\boldmath$x$}... ...ath$A$}\mbox{\protect\boldmath$A$}^{\mbox{\scriptsize\textit{\sffamily {t}}}},$

which is of rank . In the presence of isotropic noise, however, the covariance matrix of the observations will be the sum of $^{\mbox{\scriptsize\textit{\sffamily {t}}}}$ and the noise covariance [Everson and Roberts, 2000]

$\displaystyle \mbox{\protect\boldmath$R$}$ $\displaystyle _{\mbox{\protect\boldmath$x$}}=\mbox{\protect\boldmath$A$}\mbox{\... ...mbox{\scriptsize\textit{\sffamily {t}}}}+\sigma^2\mbox{\protect\boldmath$I$}_p,$

(8)

i.e. $_{\mbox{\protect\boldmath $x$}}$ will be of full rank where the additional noise has the effect of raising the eigenvalues of the covariance matrix by $\sigma^2$ . Inferring the latent dimensionality amounts to identifying the number of identical eigenvalues. In practice, however, the mismatch between the model assumption and the structure of real FMRI data renders the observed eigenspectrum to be less well behaved and we need a statistical test for the equality of eigenvalues beyond a certain threshold [Hansen et al., 1999]. Determining a cutoff value for the eigenvalues using simplistic criteria like the reconstruction error or predictive likelihood will naturally predict that the accuracy steadily increases with increased dimensionality and as such cannot be used to infer latent dimensionality. Thus, criteria like retaining $99.9\%$ of the variability result in arbitrary threshold levels that bear no relevance to the problem of estimating the latent dimensionality correctly.

Many other informal methods have been proposed, the most popular choice being the "scree plot" where one looks for a "knee" in the plot of ordered eigenvalues that signifies a split between significant and presumably unimportant directions of the data. With real FMRI data, however, the decision as to where to choose the cutoff value is not obvious and a choice based on simple visual inspection will be ambiguous (see figure 9(ii) for an example). This problem is intensified by the fact that the data set is finite and thus $_{\mbox{\protect\boldmath $x$}}$ is being estimated by the sample covariance of the set of observations $\widetilde{\mbox{\protect\boldmath $R$}}_{\mbox{\tiny $\mbox{\protect\boldmath $X$}$}}$ . Even in the absence of any source signals, i.e. when contains a finite number of samples from purely Gaussian isotropic noise only, the eigenspectrum of the sample covariance matrix is not identical to $\sigma^2$ but instead distributed around the true noise covariance: the eigenspectrum will depict an apparent difference in the significance of individual directions within the noise [Everson and Roberts, 2000].

**Figure 2:** Estimation of the intrinsic dimensionality for 10 sources with non-Gaussian distribution embedded in a 180 dimensional space with different noise characteristics (see section 5) at different stages of the estimation process: (i) Gaussian white noise, (ii) AR(4) noise, (iii) AR(16) noise, (iv) resting state FMRI noise; estimates from the original data (top), after voxel-wise variance normalisation (middle) and after additionaly adjusting the eigenspectrum using the predictive cumulative distribution **$G^{-1}(\nu )$** (bottom). Every graph shows the eigenspectrum of the data covariance matrix and 4 different estimates of the intrinsic dimensionality: Laplace approximation to the model evidence, BIC, MDL and AIC.

In the case of purely Gaussian noise, however, the sample covariance matrix $\widetilde{\mbox{\protect\boldmath $R$}}_{\mbox{\tiny $\mbox{\protect\boldmath $X$}$}}$ has a Wishart distribution and we can utilise results from random matrix theory on the empirical distribution function $G_n(\nu)$ for the eigenvalues of the covariance matrix of a single random $p\times n$ -dimensional matrix $\tilde{\mbox{\protect\boldmath $X$}}$ [Johnstone, 2000]. Suppose that $p/n\to\gamma$ as $n\to\infty$ and $0<\gamma\leq 1$ , then $G_n(\nu)\to G_\gamma(\nu)$ almost surely, where the limiting distribution has a density

$\displaystyle g(\nu)=\frac{1}{2\pi\gamma\nu}\sqrt{(\nu-b_{-})(b_{+}-\nu)},\quad b_{-}\leq\nu\leq b_{+},$

(9)

and where $b_{\pm}=(1\pm\sqrt{\gamma})^2$ . This can be used to obtain a modification to the standard scree-plot where one compares the eigenspectrum of the observations against the quantiles of the predicted cumulative distribution $G^{-1}(\nu )$ , i.e. against the expected eigenspectrum of a random Gaussian matrix. The predicted eigenspectrum of the noise becomes a function of : the larger , the more spread the eigenspectrum. Note that equation 9 is only satisfied for $0<\gamma\leq 1$ , i.e. when the number of samples is equal or larger than the dimensionality of the problem at hand. This approach is similar to [Rowe, 2001], where an inverse Wishart prior is placed on the noise covariance matrix in a fully Bayesian source separation model.

If we assume that the source distributions are Gaussian, the probabilistic ICA model (equation 2) reduces to the probabilistic PCA model [Tipping and Bishop, 1999]. In this case, we can use more sophisticated statistical criteria for model order selection. [Minka, 2000] placed PPCA in the Bayesian framework and presented a Laplace approximation to the posterior distribution of the model evidence that can be calculated efficiently from the eigenspectrum of the covariance matrix of observations. When min, then

$\displaystyle p($ $\displaystyle \mbox{\protect\boldmath$X$}$ $\displaystyle \vert q)\approx p(U)\left(\prod_{j=1}^q \lambda_j\right)^{-N/2}\!... ...{{\small\sf ML}}^ { - N ( p - q)}(2\pi)^{(m+q)/2}\vert A_z\vert^{-1/2}N^{-q/2},$

(10)

where $m=pq-q(q+1)/2,\quad N=\vert\cal{V}\vert$ and denotes a uniform prior over all eigenvector matrices

$\displaystyle p(U)=2^{-q}\prod_{j=1}^q\Gamma((p-j+1)/2)\pi^{-(p-i+1)/2},$

and

$\displaystyle \vert A_z\vert=\prod_{i=1}^{q}\prod_{j=i+1}^{p}N(\hat{\lambda}_j^... ...a}_i^{ {\mbox{\scriptsize\textit{\sffamily {-$\!$1}}}}})(\lambda_i-\lambda_j),$

where the $\lambda_l$ denote the entries in $\Lambda$ and $\hat{\lambda}_l = \lambda_l$ for $l\leq q$ and $\hat{\lambda}_l=\sigma^2_{{\small\sf ML}}$ otherwise. As our estimate for the latent dimensionality of the data, we choose the value of that maximises the approximation to the model evidence $\vert q)$ .

In order to account for the limited amount of data, we combine this estimate with the predicted cumulative distribution and replace $\Lambda$ by its adjusted eigenspectrum $\Lambda$ $/ G^{{\mbox{\scriptsize\textit{\sffamily {-$\!$1}}}}}(\nu)$ prior to evaluating the model evidence. Other possible choices for model order selection for PPCA include the Bayesian Information Criterion (BIC, [Kass and Raftery, 1993]) the Akaike Information Criterion (AIC, [Akaike, 1969]) or Minimum Description Length (MDL, [Rissanen, 1978]).

Note that the estimation of the model order in the case of the probabilistic PCA model is based on the assumption of Gaussian source distribution. [Minka, 2000], however, provides some empirical evidence that the Laplace approximation works reasonably well in the case where the source distributions are non-Gaussian. As an example, figure 2 shows the eigenspectrum and different estimators of the intrinsic dimensionality for different artificial data sets, where 10 latent sources with non-Gaussian distribution were introduced into simulated AR data (i.e. auto-regressive noise where the AR parameters were estimated from real resting state FMRI data) and real FMRI resting state noise at peak levels of between $0.3\%$ and $1.6\%$ of the mean signal intensity. Note how the increase in AR order will increase the estimates of the latent dimensionality, simply because there are more eigenvalues that fail the sphericity assumption. Performing variance-normalisation and adjusting the eigenspectrum using $G^{-1}(\nu )$ in all cases improves the estimation. In the case of Gaussian white noise the model assumptions are correct and the adjusted eigenspectrum exactly matches equation 8. In most cases, the different estimators give similar results once the data were variance normalised and the eigenspectrum was adjusted using $G^{-1}(\nu )$ . Overall, the Laplace approximation and the Bayesian Information Criterion appear to give consistent estimates of the latent dimensionality even though the distribution of the embedded sources are non-Gaussian.

Next: Estimation of the unmixing Up: Maximum Likelihood estimation Previous: Maximum Likelihood estimation

Christian F. Beckmann 2003-08-05