Multivariate data analysis is about separating the signal from the noise in data with many variables and presenting the results as easily interpretable plots. Any large complex table of data can easily be transformed into intuitive plots summarizing the essential information. The following methods are all based on mathematical projection, but have evolved to meet different needs.

Principal Components Analysis (PCA) provides a concise overview of a dataset and is usually the first step in any analysis. It is very powerful at recognising patterns in data: outliers, trends, groups etc.

With Projections to Latent Structures (PLS), the aim is to establish relationships between input and output variables, creating predictive models.

PLS-Discriminant Analysis (PLS-DA) and SIMCA are two powerful methods for classification. Again, the aim is to create a predictive model, but one which can accurately classify future unknown samples.

Let's continue with a demonstration of the principles and application of PCA.

Suppose we are investigating food consumption patterns in different European countries and have collected some data. The numbers in the table refer to the percentage of households consuming a particular type of food.

By just eyeballing the data, it is not possible to understand the relationships among and between the countries and the foods. The human brain has not evolved to interpret tables of numbers in this way. We are much better at absorbing pictorial information.

By using multivariate projection methods, however, we are able to transform the numbers into pictures.

The first plot is analogous to a map of Europe showing how the countries relate to each other based on their food consumption profiles. Countries with similar profiles lie close to each other in the map while countries with different profiles lie far apart. For example, the Scandinavian countries cluster together in the top right quadrant while the Mediterranean countries form a cluster to the left of the plot.

The co-ordinates of the map are the first two principal components calculated from all of the data. We call this map a score plot. Later in this tutorial we will see how this plot is created.

Having examined the score plot, it is natural to ask why the Scandinavian and Mediterranean countries cluster together. This information is revealed by the second plot which we call a loading plot.

In the top right quadrant (direction of Scandinavian cluster) we find crisp bread, frozen fish and frozen vegetables all of which epitomize the Scandinavian eating habit. In contrast, the Mediterranean countries are characterized by garlic and olive oil.

To transform a table of numbers into score and loading plots we use a mathematical technique called projection.

We start with some observations (objects, individuals, molecules, time points, …) and variables (measurements, spectra, facts, …) in a data table.

If we look at just one variable, we can plot the observations on a single scale to visualize them. In this figure we have also reduced the number of observations to one – the yellow dot.

With two variables, we only need to add an orthogonal axis.

And with three variables, we get a three-dimensional space.

Beyond three dimensions, we find it very difficult to visualize what's going on. In mathematics however, there’s no such limit and we can go on adding more and more variables.

In our example, each of the 20 types of food is represented as a co-ordinate axis in 20-dimensional space.

If we now plot all the observations (countries in our example), we get a swarm of points lying in this 20-dimensional space (only three dimensions shown in our simplified picture.)

The positions of the countries are determined solely by their values on the variable axes. Hence, countries with similar food consumption profiles will appear close together in the 20-dimensional space. Countries that are very different will be far apart.

But we are still not able to see the patterns because we cannot visualize 20-dimensional space.

The first step in multivariate projection is to draw a new co-ordinate axis representing the direction of maximum variation through the data (line of best fit).

This is known as the first principal component (PC1 in the picture). All the observations (countries) are projected down onto this new axis and the score values are read off.

This first principal component explains, as best as possible, the patterns of the 16 countries in 20-dimensional space. In our foods example, it explains 32% of the original variation.

We now add a second principal component – PC2 in the picture. After PC1, this defines the next best direction for approximating the original data and is orthogonal (at right angles) to PC1.

The observations (countries) are projected down onto the plane defined by these two principal components to create the score plot that we saw in picture 3. This plot represents the best possible two-dimensional window into our original 20-dimensional data, accounting for 51% of the original variation.

Another way of thinking about it is to imagine the shadow cast by a three-dimensional (or multi-dimensional) swarm of points onto a wall.

With the light source positioned optimally, the underlying structure of the data is revealed and the effective dimensionality of the data dramatically reduced.

But what happened to our original variables during the projection?

The new axes PC1 and PC2 replace the original variables but we can still relate back to these variables by measuring the angles between them and the principal components.

A small angle indicates that the variable has a large impact because it is almost aligned with a principal component. A large angle indicates less influence. For example, a variable lying at 90º to a principal component has no influence on that component whatsoever. The influence of the variables is summarized in the loading plot.

Multivariate methods facilitate analysis and visualization of large complex datasets and, unlike univariate approaches, provide a holistic summary of the data:

  • Represent correlations and trends pictorially
  • Separate systematic behaviour from noise
  • Handle missing data
  • Detect outliers
  • Highlight clusters and patterns

The quickest way to get started with multivariate data analysis is to take a basic, three-day training course with MKS Data Analytics Solutions.

Look up our Course Calendar to find a convenient location and date.

You can also find more reading about multivariate data analysis, or order literature for self-training.

Our state-of-the-art software for multivariate data analysis is SIMCA. Read more about it here.