By Neal Fairley
Statistical techniques are described for efficiently reducing noise from data. An algorithm for partitioning useful information from the noise within a set of measurements is presented and its implications for XPS acquisition strategies are discussed.
When constructing an experiment, the aim is to acquire data for a sufficiently long period to permit the computation of reliable quantitative values. This precision requirement must be balanced against efficient use of instrument time and therefore acquisition times should be optimal for the desire outcome. If an experiment can be constructed in which the acquisition time is split between multiple variables (energy, position, etch-time etc), all of which useful to the analysis, then statistical techniques can be used to synthesis the signal from the noise on mass, rather than obtaining a given signal-to-noise via devoting instrument time to each individual measurement.
A paradigm for such an experiment is one in which a set of XPS images are acquired over a sequence of energy setting. The result is an image, where each pixel offers a spectrum of energy information. The acquisition time for such an experiment is relatively large as a whole; however on an individual spectrum-at-a-pixel basis the acquisition time is small. As a result the signal-to-noise for individual spectra is poor and therefore the data are not suitable for determination of quantitative results. Ignoring the spatial variation in these spectra by simply summing the energy channels across the image produces a spectrum with good signal-to-noise, but without the original spatial information. The problem is therefore to synthesis the noise from the pixel spectra without loosing the spatial information originally present in the data set. Such analyses are the realms of multivariate statistics and linear algebra, since with the aid of these tools the effective noise in any one spectrum can be reduced to yield remarkable results.
To illustrate the discussion points, a set of synthetic spectra were constructed from five Gaussian line shapes, where spatial variation is simulated by adjusting the area of the Gaussians using sinusoidal functions parametrically dependent on the pixel position. Figure 1 shows the relative positions of the five Gaussian peaks with respect to energy; all five synthetic peaks were assigned the same full-width-half-maximum. Using this basic structure, simulated noise consistent with XPS data was used to generate five additional set of image data sets, where the intensity per bin is scaled by a factor of two across the set of five data sets. Since the noise for XPS data varies as a function of the square root of the counts per channel, the signal-to-noise ratio improves throughout the sequence of data sets (Figure 2).
These synthetic image data sets contain a 96 by 96 array of spectra in which each spectrum includes 100 energy channels.
While the data shown in Figure 2 are intended to look like XPS spectra, the partitioning of such data into useful information and noise is performed using techniques from linear algebra. Therefore the spectra are viewed as vectors in an m-dimensional vector space, where m is equal to the number of bins per spectrum. The objective is to determine a set of vectors representative of the useful information, yet belonging to the same m-dimensional vector space as the original set of vectors, and use this subset to model each and every vector in the original data set. The hopes is that the complementary subset contains just the noise component in the data and so by excluding these noise vectors when modelling the raw data, the resulting vectors will be a noise free set of spectra. The partitioning of the vector space is almost never perfect and so the procedure typically yields improved signal-to-noise statistics, not a complete elimination of the noise.
The conventional method for constructing a set of vectors spanning the same subspace as the original vector space is an eigenanalysis based upon the covariance matrix (Equations 1 and 2). The covariance matrix is formed from the dot product of the original vectors and is therefore a real symmetric matrix. The result of an eigenanalysis is an orthogonal set of eigenvectors, where the information content in each of the eigenvectors is ranked by their eigenvalue. This ranking of the eigenvectors allows the significant information to be collected into a relatively small number of vectors from a much larger, but over specified set of vectors. These significant vectors, when constructed via the covariance matrix, are referred to as the Principal Components and hence the decomposition into these new vectors is often called Principal Component Analysis (PCA). What is more, the decomposition of a set of vectors using the Singular Valued Decomposition shown in Equation 3 is directly related to the linear least squares principle and explains why the data is transformed into an ordered set based on information content. Computing the improved set of spectra, with respect to noise content, is achieved by choosing to set the wkk equal to zero for all k above or equal to some value in Equation 3, where the nonzero wkk below the chosen value of k correspond to the vectors containing the principal components.
The caveat in the previous discussion when applied to the sets of spectra-at-pixels is the shear quantity of spectra. Given the synthetic data set described above, the vector subspace would be of maximum dimension 100, that is, the number of energy channels per spectrum, while the total number of spectra involved in the calculation is 962, or 9216. Forming the covariance matrix from the data matrix prescribed in Equation 2 would result in a matrix of size 9216 by 9216. Alternatively the covariance matrix could be formed from post multiplication by the transpose of the data matrix resulting in a covariance matrix of size 100 by 100. The success of any noise reduction is achieved by a comparison of information in a least squares sense and, as with all least squares calculations, the number of such samples influences the precision of the result. Forming the larger of these two covariance matrices represents the situation where the least squares procedure acts on an energy-channel by energy-channel basis, while the smaller covariance matrix corresponds to the least squares criterion acting on a pixel-by-pixel basis. That is to say, the averaging effect of the least-squares criterion has considerably more data with which to work when using the 9216 by 9216 covariance matrix. However, there are few with computers capable of performing a direct SVD to a problem of this size. Yet, if the original data are numerous poor signal-to-noise spectra, the use of large numbers of data points for the least-squares procedure is highly desirable. In this sense, an alternative to direct PCA is required.
SVD sorting is an algorithm for moving significant vectors to the top of a list of vectors, where the loop invariant when scanning through the set of vectors is the following: the set of vectors all remain within the same subspace spanned by the original set of vectors. The idea is to compare adjacent vectors by computing the eigenvectors and eigenvalues, transforming the vectors in the process, which effectively moves the most significant transformed vector up the list in an analogous way to the Bubble Sort algorithm for ordering keys. The result of stepping through the set of vectors once is that the most significant information is moved up the list in the form of a transformed vector. This procedure employs the same mechanism as SVD to transform the vectors en route, however the size of the SVD is always two and therefore, for a single scan through the set of vectors, the computational complexity is O (n), where n is the number of vectors. The number of scans applied to the data set is then dependent on the number of components within the set of vectors and can be chosen for a given application. Bubble Sort is an O (n2) algorithm; similarly for SVD Sorting, after n scans the vectors would be fully ordered according to this pair wise comparison, however the difference between a full SVD based PCA and SVD Sorting is the final set of vectors are not mutually orthogonal and it is relaxing this mutually orthogonality constraint that yields the reduction in time spent on computation. The fact that the resulting set of vectors is not necessarily mutually orthogonal is only a problem if there are no dominant vector directions; in which case, the PCA would have been of little value anyway. Moving the variation in the vector set to the top of the list allows the useful vectors to be identified and a full SVD can be performed on these data to yield the PCA quantities. Since in general, the number of significant factors is small compared to the overall data set size, the use of a selective PCA via a SVD is time efficient.
The essential steps used in an SVD scan are illustrated in Algorithm 1. Repeated applications of these steps orders the information content in a set of vectors until finally, the important vectors to the description of the subspace, appear at the top of the vector list. Isolating these vectors and applying a standard SVD to only these significant vectors is equivalent to generating the matrices in Equation 3 but where the unwanted wkk are already set to zero.
Apart from achieving the same result as a massive SVD in a shorter time, the real significance is that the least-squares criterion has been applied in the most advantageous order for removing the noise on an energy channel basis. The consequence of which is that abstract factors containing minor adjustments to the overall description of the spectra will be less likely to include major features due to noise. Thus, more abstract factors can be included in the reconstruction step without reintroducing a significant noise component and so achieve a better approximation to the original data. Figure 3 illustrates the results of applying the combination of:
1. SVD sorting followed by,
2. a PCA on a small number of sorted vectors and
3. finally reconstructing the spectra from the five most significant abstract factors.
These steps have been applied to the image set in which the signal-to-noise is the worst of those shown in Figure 2. Once reconstructed spectra are computed for each pixel; simply placing integration regions on the spectra and calculating images using the intensity from these five regions (labelled P1 through P5 in Figure 1), a set of spatially resolved quantitative values result (Figure 4). The images in Figure 4 are computed for each of the five sets of data in which the signal-to-noise varies, plus the PCA enhanced spectra as well as the exact spectra (Figure1). Visual inspection shows the PCA enhanced image is equivalent to the image simulated to have four times the intensity as the data used in the PCA procedure.
For XPS there are two major implications. Firstly, the acquisition time for a given precision can be reduce by roughly a factor of four (provided sufficient information is still present in the data set as a whole) and secondly, the PCA enhanced spectra offer a superior definition for the background to the spectra and consequently the resultant images are free from artefacts often associated from background variations. What is more, the improvements in the background also open up the opportunity to use background information to extract thickness of layered structure on the sample surface. Application of simple techniques proposed by Tougaard in cases where the background above and below a peak is well define offers alternative image processing resulting in topographical maps.
Figure 1: Synthetic data envelope constructed from Gaussians.
Figure 2: These spectra correspond to the synthetic envelope in Figure 1, where random noise is added on the basis that the intensity per channel varies by a factor of two throughout the sequence bottom to top.