3 Ways to Organize Data for Multivariate Analysis

By |2014-05-27T09:54:57-04:00May 27, 2014||

In our courses and previous blog posts, we’ve described the benefits of dividing your variables into logical groups and building multi-block multivariate models. But did you know there are 3 ways you can organize your data in Excel to prepare for multi-block modeling in ProMV?

We added flexibility to ProMV’s data import process in version 13.08 to give you these options. This 15-minute video tutorial demonstrates the 3 different ways of organizing your data and how each strategy affects the import process. Be sure to watch it in full screen mode to see all of the details.

Download ProMV and Register for a multivariate analysis course

Full Video Transcript:

Hi, this is Emily Nichols at ProSensus and today we’re going to talk about 2 things:

  1. How to organize your industrial process data
  2. How to import it into ProMV for multivariate analysis.

Industrial process data usually has problems that aren’t handled very well by traditional statistical methods – problems like multi-collinearity and missing data. So if you’d like to harness your industrial process data to monitor, predict & improve product quality, multivariate analysis methods like PCA and PLS are good choices.

ProMV enables you to visualize and interact with your industrial process data in a way that helps you learn about how the process has operated in the past. You can see what changed between periods of good and bad operation, you can see which variables have the biggest impact on product quality,and you can build models for real time monitoring and prediction. But in order to take advantage of these capabilities, you first need to organize your data. ProMV is flexible so I’m going to show you a few different options.

Here is a very basic Excel spreadsheet containing data from a low density polyethylene process. Each variable is a column and each row is an observation. Notice that this worksheet includes process variables like temperatures, pressures and flows as well as product quality data like molecular weights. Putting all of these variables together in one table is fine. You will have the chance to assign them to blocks as you import them into ProMV. What do I mean by blocks?

Well for starters, if you plan to build a PLS model you will need to divide the variables into X’s and Y’s. Furthermore, ProMV is built to handle multi-block models. This allows you to evaluate the importance of groups of variables. In this low density polyethylene data we have some variables from reactor 1, some from reactor 2 and we have some that are not specific to either reactor. So we have three groups plus we have the product quality variables, which we will want to use as a Y block during analysis.

Introducing the 3 Ways to Organize Your Data

We could import this data exactly as it is shown here and manually tell ProMV which variables belong to which blocks. Or we could add a Block ID row, like this. Later, I’ll show you how a Block ID row is useful during data import. Another way to organize your blocks of variables is to put each block into its own worksheet, like this. Our process variables are located in the first three worksheets and our product quality variables are in the fourth one. You might be wondering

“Is it necessary to group my variables into blocks at all?”

For a PCA model, no. For a PLS model you need at least two blocks. One to hold your X variables and one to hold your Y variables. Further subdividing the variables into blocks won’t impact the calculation of your multivariate models but it can be very helpful during interpretation. The choice is up to you.

1. Importing a Simple Data Table

Now let’s look at data import. When you open ProMV, the Data Manager will appear automatically. If you’ve accidentally dismissed this window, find it in the Edit menu. Notice that there are two tabs – one for standard data and one for batch data. We’ll limit this discussion to standard data because our time ordered observations come from a continuous manufacturing process for low-density polyethylene.

This list of blocks is empty because we haven’t imported data yet so go ahead and click the Add/Edit Data button. It’s a spreadsheet type interface that allows you to paste or type in data directly but these methods are prone to human error and should be used with caution. To import our data from Microsoft Excel click the Import from File button in the top left. This brings up the standard Microsoft Windows File Browser for your operating system.

First we’ll look at the simple case where all of the data is in one worksheet and we don’t have a block ID row. I select the file and click open. We are now in the import data from file window. Notice the row at the top. Because all the values in the first row of my spreadsheet are unique, ProMV automatically identifies the first row as the variable ID row and makes it yellow. Likewise, ProMV identifies the first column as the observation ID column because all of the values in it are also unique. In this simple case all I need to do is click OK and we are returned to the standard data specification window. Notice that the list of blocks has been updated. We have a block called LDPE which was the name of the worksheet we imported, and a block called Unassigned which is there by default.

Now if we click OK at this point all the variables we imported will be in one block. If we eventually want to build a PLS model we need to divide the variables into X’s and Y’s. In this dataset, the last five variables on the right are the Y variables so I highlight them and I right-click. When I right-click, I choose new block and now I give the new block a name. I call it product quality. Notice that my list of blocks on the left is now updated. If I click on a block I can see a list of which variables are assigned to that block.

If I want to rename the process variables block I double click LDPE, rename it process variables and click OK. If I want to further subdivide the process variables into three blocks reactor 1, reactor 2, and common I select the appropriate variable columns and then I right-click them and choose new block but I’m actually not going to do that here because I’m going to go on to show you two easier ways of subdividing the variables. So now I click OK and I return to the data manager.

Now you can see my list of blocks is populated and I have one more change that I would like to make. The product quality block will usually be used as a Y block for this project so I set the default block type to Y. Now this is the default for my overall project but I can change this setting on a per-model basis, later. When I click OK on the Data Manager, I see the Observation Summary window. In our case here we don’t need to do anything with this window. It becomes very useful later on if you need to reorder your observations after import.

But usually with a simple data set we don’t to do anything here. Project data preview is the next step and this is a chance for you to inspect your data one variable at a time to look for obvious outliers and other data problems. Finally, you need to give your project a file name. I’m going to call it LDPE_onetab, click save, and then the new model dialogue comes up automatically so you can start right into building your first model.

So far we’ve looked at importing the simplest excel file with just one block of data. Now here’s a worksheet with a block ID row – a second row text labels below the variable ID’s.

2. Importing Data with a Block ID Row

Now let’s open ProMV – we’re going to go to File, New Project. Again, we’re importing the same data as before so it’s a standard block. Click the Add/Edit Data button and that brings up the Standard Data Specification window. So we’d like to import this data from a file so we click that button, select the file that has data with a block ID row, click open, and again like before ProMV has identified that the first row are the variable ID’s. It has identified that the first column are the observation ID’s, and again that’s because both contain a list of unique values.

You’ll notice that the second row where we’ve identified the block names…right now the entire row is red which means that ProMV thinks these values are missing data. So what we need to do is tell ProMV that actually we don’t want to import this row as a data row we would like to import it as a Block ID row. To do that we simply select the row and then click the Block ID button which makes the row orange. Everything looks good so we can click OK and return to the Standard Data Specification Window.

Okay now you’ll notice if you look at the left hand side we have this block list. So because we have a block ID row our variables are automatically assigned to different blocks. We have the Common block, the Quality block and we have separate blocks for Reactor 1 and Reactor 2. When we click OK it takes us back to the Data Manager, we can see our list of blocks, the number of variables in them, and we’re going to change the quality block to be a Y block by default. I’m going to click OK there.

The next thing is the Observation Summary so just a reminder the point of this window is so that you can move observations up or down in the order if you need to change something. We’re not going to change anything here so just click OK and this is the Project Data Preview which is a univariate view of your data so that you can check for any obvious outliers, look at the histograms, etc. We also have a trim and windsorize tool but for this simple data we’re not going to use any of those things. Next thing is to name your project. The new model dialogue will come up automatically so that you can start right into building your first model.

3. Importing Multiple Worksheets as Blocks

So to recap we looked at a very simple Excel spreadsheet that had all the data in one block in one worksheet called LDPE and we looked at this improved case where we include a block ID row and again a block ID row is row of text labels placed underneath the variable ID row and finally we have the method where you put each block of variables into its own worksheet.

The names of these worksheets will become the names the blocks as you import them. So once again we’re in ProMV, we’re going to go to File, New Project and that brings up the Data Manager. We are importing standard blocks, the same as we did before so we click Add/Edit Data Now I just wanted to remind you that you can copy and paste data into this window if you like just by right-clicking, you can do insert, or paste but actually we’re going to import our data from an Excel sheet. We go to Import from File and that brings up the Import Data from File browser. Find the file with the four worksheets. Now here is a window we haven’t seen before.

It is going to show up anytime we choose an Excel file with multiple worksheets and we can choose from this list which worksheets we’d like to import. In our case we want them all. We click OK now I’d like to take a minute to show you the difference between the Import Data from File window, which you’re going to see four times, once for each block and the Standard Data Specification behind it which is going to have a block list when we’re finished.

So in the Import Data from File window we don’t need to do anything ProMV has automatically identified the first row and column as being the variable ID row and the observation ID column and again it’s going to do that anytime those columns and rows have unique values. So we just basically have to click OK once for each data block. Since we don’t have any changes to make with this data. When we’re finished, we return to the Standard Data Specification window and the block list is populated with the block names and again if you click on a block name it’s going to show us the variables that are assigned to that block. Now the rest of the steps in the data import from here on are going to look exactly the same as they did the previous time when we had the Block ID row.

Final Recap

So, final recap: we talked about the simplest way to organize your data where everything is in one Excel worksheet and you don’t have any block information that means more work when you’re importing into ProMV because you have to tell it manually which variables belong to which blocks.

Or you can add a Block ID row, which is a second row of text labels underneath your variable ID row and that makes the import a little faster or you can do it third way which involves putting each of your groups of variables, each block, into its own Excel worksheet and this may be a benefit to you depending on how you like to work with your data while it’s still in Excel, although I will say you need to be careful to make sure that your list of observation ID’s is the same across all of the different worksheets.

And that brings us to the end of our tutorial on data organization and import.

About the Author: Emily Nichols

Emily Nichols
Emily Nichols, MASc
Project Leader
Emily was a project leader at ProSensus from 2011-2015. Emily holds a Bachelor’s degree in Engineering Systems & Computing and a Master’s Degree in Applied Science from McMaster University. During her time at ProSensus, she was involved in many client projects using multivariate analysis to solve challenging problems.