Most problems in the world we deal with have multiple variables. To analyze these variables before they can be fed to a machine learning framework, we need to analytically explore the data. A fast and easy way to do this is bivariate analysis, wherein we simply compare two variables against each other. This can be in the form of simple two-dimensional plots and t-tests.
However, comparing only two variables at a time does not give deep insights into the nature of variables and how they interact with each other. This is where the need to understand and implement multivariate analysis techniques comes in.
NASA launched the Curiosity rover which will explore the mineral-rich Gale Crater region of Mars, the keys to analyzing the chemical composition of the rocks and soil is the use of laser-induced breakdown spectroscopy (LIBS). LIBS data, with over 6000 variables per sample, are highly multivariate.
We now look at some of these techniques in detail.
Multiple Regression Analysis
Regression is one of the simplest yet powerful techniques to analyze data. While simple regression maps one variable as a function of the other, multiple regression maps one variable (called the dependent variable) as a function of several other variables (called independent variables or predictors). Doing such an analysis gives us an equation of the form
where, α is the intercept, βi are the coefficients, y is the dependent variable, and xi are the predictors. We can read this equation as: For every unit increase in xi, the value of y increases by βi units. Thus, this equation shows how the behavior of the dependent variable changes with respect to other variables.
Logistic Regression Analysis
This is similar to multiple linear regression with the difference being that, instead of predicting the absolute values of a certain metric, we compute the probability of a binary event happening. Thus, it is used when we are expecting a binary outcome such as ‘good/bad’ or ‘yes/no’. So, if we want to predict the volume of sales from a marketing campaign, multiple regression would be the suitable method; whereas if we want to predict the likelihood of a customer going delinquent, logistic regression would be more apt. This likelihood or probability is given by the formula:
where e is the Euler’s number or Exponent, and the meanings of other symbols remain the same.
This is used to classify two or more groups of data and differentiate among them. The best use of this technique is when the dependent variable is categorical and the independent variables are metric. Discriminant analysis develops discriminant functions, which are linear combinations of the independent variables. These functions help in distinguishing between the categories in the dependent variable. They enable the analyst to quickly look at whether the differences between the groups are significant.
For example, it can help distinguish between heavy, moderate and low spenders depending upon customer attributes like age, gender, income, etc.
MANOVA (Multivariate Analysis of Variance)
This technique is best suited for use when we have multiple categorical independent variables; and two or more metric dependent variables. While the simple ANOVA (Analysis of Variance) examines the difference between groups by using t-tests for two means and F-test otherwise, MANOVA assesses the relationship between the set of dependent features across a set of groups. For example, this technique is suitable when we want to compare two or more dishes in a restaurant against each other, in terms of the level of spiciness, the time taken to cook and value for money, etc.
Principal Component Analysis and Factor Analysis
Although machine learning is a game of predicting the result given multiple predictors, there can be times when the number of these predictors is too large. Not only is such a data set difficult to analyze, but the models formed using this are susceptible to overfitting. Therefore, it makes sense to have the number of these variables reduced. Principal component analysis (PCA) and Factor analysis are two of the common techniques used to perform such a dimension reduction. PCA reduces the existing number of variables, such that the new set of reduced variables capture most of the total variance present in the existing set of variables and is technique applied to multispectral and Airborne hyper spectral remotely sensed data.
Therefore, PCA is such a powerful tool for analysts since they now have a much smaller feature set to deal with, and at the same time having preserved most of the information which was initially present. While PCA extracts factors based on the total variance, the Factor Analysis Method extracts factors based on the variance shared by the factors. By providing the factors based on the variance they share, Factor Analysis enables data scientists to examine the underlying trends in the data.
In many business scenarios, the data belongs to a lot of different types of entities; and fitting all of them into a single model might not be the best thing to do. For example, in a bank dataset, the customers might belong to multiple income groups which leads to different spending behaviors. If we use the data having all these customers into a single model, we would be comparing apples to oranges. In that regard, clustering provides analysts a good way to segment their data and therefore avoid this problem. K-means clustering is a well-renowned approach used by a lot of data analysts and scientists. This separates the data points into clusters such that the inter-cluster distances are maximized. What this means is that each point in a particular cluster is similar to every other point in that cluster; and, points in a particular cluster are very different from every point in any other cluster. Other popular approaches for clustering include the hierarchical clustering algorithm, the DBSCAN algorithm, Partitioning Around Medoids (PAM) algorithm, etc.
Conjoint analysis, also known as trade-off analysis, is a very important tool used in marketing. It helps in identifying whether customers like different attributes of a product/service or not. It also helps in identifying the preference of customers to a particular feature over others. Smartphone companies often use this analysis to understand the combination of attributes such as features, color, price, dimensions, etc. that customers favor. They use the results of such analyses in their strategies to drive profitability.
At certain times, it gets difficult to visualize multi-dimensional data on a single computer screen. This is where the pairwise plots come into the picture. As shown in the figure below, it allows the analysts to view all combinations of the variables, each in a two-dimensional plot. In this way, they can visualize all the relations and interactions among the variables on one single screen.
Often, data sets contain variables that are either related to each other or derived from each other. In statistical terms, correlation can be defined as the degree to which a pair of variables are linearly related. In some cases, it is easy for the analyst to understand that the variables are related, but in most cases, it isn’t. Thus, performing a correlation analysis is very critical while examining any data. Furthermore, feeding data which has variables correlated to one another is not a good statistical practice, since we are providing multiple weightage to the same type of data. To prevent such issues, correlation analysis is a must.
The figure below shows a correlation heatmap for some hypothetical data. The scales represent the amount of correlation that the variables have. A correlation of +1 between two variables means that if one of the variables increases, the other variable also increases in the same ratio. A correlation of -1 is similar, with the difference being that an increase in one variable implies a decrease in the other in the same ratio.
While there are various ways of visualizing multi-dimensional data, spider plots are one of the easiest ways to decipher the meaning of data. From the figure below, we can see how easily we can compare three mobile phones based on attributes such as their speed, screen, camera, memory and apps.
Summing up, we have handpicked the top multivariate analysis techniques used in the data science industry. It is no surprise that data analysis and data processing comprises the majority of the work that goes into the development of a machine learning model. In that regard, the techniques explained in this article are a go-to reference for all data analysts, engineers and scientists out here. Further, to understand the management of multivariate models through A/B testing for live inference and batch tasks, please visit Datatron