Centering and Scaling; When and when not?

The pre-processing phase of data analytics involves a number of steps. And, in fact, a prudent pre-processing may make the difference between no model at all and a useful model. Attend this webinar to get a first introduction into the secrets of a proper data pre-processing.

For many types of data, centering and scaling are intertwined. Centering corresponds to a subtraction of a reference vector (often represented by the mean values of the variables or the settings of the set point). Scaling corresponds to a multiplication by a vector. The choice of scaling vector is crucial. In many areas, such as QSAR, multivariate design, sensory analysis, etc., where variables of different origin and numerical range are encountered, the scaling vector is chosen as the inverse spread (reciprocal standard deviation) of the variables. In other situations, for example with process data, the scaling vector is defined relative to a tolerable spread in the variables.

Sometimes, however, no scaling (but mean-centering) is the desired method for “scaling” the data. Usually, this option is deployed when all variables are expressed in the same unit, such as with spectroscopic data. Moreover, in recent years an alternative technique called Pareto scaling has become more common. Pareto scaling gives each variable a variance numerically equal to its initial standard deviation instead of unit variance. Pareto scaling is the default choice in many omics applications.

Topics of this webinar

  • Centering, Pareto and UV- how they are related
  • When only to center data
  • When only to scale data (but not to center)
  • Battery-scaling, or block-scaling, of data
  • Trimming and winzorizing as tools for enhancing the effect of centering and scaling