Visualize Missing Data Patterns in ProMV

By |2019-01-18T09:07:24+00:00January 28th, 2015|Multivariate News|


When was the last time you received a perfect set of data? As problem solvers in manufacturing, we are constantly analyzing data that contains correlated varables, missing values, and other messy problems that aren’t handled well (or at all) by traditional regression methods.

ProMV has always handled missing data implicitly, but our latest version gives you a Missing Data Map so you can spot the missing data patterns at a glance.

Why visualize missing data patterns?

ProMV’s Missing Data Map first helps you catch mistakes. An all-white map indicates you have no missing data at all, and the darker shades of grey show high percentages of missing data for given observations (rows) in given data blocks (columns). If you have several sources of data to import separately and you omit one by mistake, it will be obvious. You can go back and import more data before continuing. If the Missing Data Map still shows big blocks of missing data, then you need to find out why.

Missing data comes in two flavours

Multivariate analysis can handle data that is missing at random, but no algorithm can truly compensate for missing data that is not missing at random.

  1. Missing at Random

    In manufacturing data, missing values occur for a number of reasons – sensor malfunctions, database connectivity issues, and the simple fact that some things are measured more frequently than others. For these causes of missing data, the data is missing at random. That is, the data is missing because of a reason unrelated to what its value would be if the value was known.

    Some QA lab results may be available infrequently (e.g. once per hour) while the process data is being collected every few minutes. When you stitch the two data sets together, the QA data appears to be missing in a non-random manner, i.e. the QA values are missing for every row in the process data except one per hour. However, for analysis purposes, this QA data is still considered to be missing at random – again, it is missing because of a reason unrelated to what its value would be if the value was known.

  2. Not Missing at Random

    On the other hand, suppose you have a QA test for which it is sometimes impossible to report a value because it is below a detectable limit but not zero. Or perhaps you have a non-homogenous sample for which the viscosity can’t accurately be measured. In rubber manufacturing, if the rubber is very elastic, the Mooney viscosity cannot be measured with some instruments, simply because they don’t have enough range to stretch the rubber until it breaks. These are examples of missing data that is not missing at random. The reason the data is missing is related to what its value would be if the value was known.

ProMV handles missing data implicitly

One of the key benefits of using multivariate analysis is that you don’t have to throw away entire observations just because some of the variable values are missing. ProMV handles missing data implicitly – it uses the correlations among the variables to impute the missing values. Still, no algorithm can effectively handle data that is not missing at random, so it’s important to recognize when your data set has this type of missing data.

How to view the Missing Data Map in ProMV

missingdatamap annotated825In the New Model dialog, select the Observations tab. In the Display Options box on the right, check the “Show Missing Data Map” checkbox. If you have several Secondary ID’s in your data set, you may also wish to uncheck the “Show Secondary/Class IDs” box. The Missing Data Map has one row for each observation and one column for each data block. An all-white map means you have no missing data. Shades of grey show what percentage of values are missing for each observation in each data block.

To take advantage of ProMV’s Missing Data Map, download the latest version<. To learn more about using multivariate analysis to solve manufacturing problems, sign up for one of our upcoming courses.

About the Author:

Emily Nichols
Emily was a project leader at ProSensus from 2011-2015. Emily holds a Bachelor’s degree in Engineering Systems & Computing and a Master’s Degree in Applied Science from McMaster University. During her time at ProSensus, she was involved in many client projects using multivariate analysis to solve challenging problems.