The tutorial explains the basics of correlation in Excel, shows how to calculate a correlation coefficient, build a correlation matrix and interpret the results.
One of the simplest statistical calculations that you can do in Excel is correlation. Though simple, it is very useful in understanding the relations between two or more variables. Microsoft Excel provides all the necessary tools to run correlation analysis, you just need to know how to use them.
Correlation is a measure that describes the strength and direction of a relationship between two variables. It is commonly used in statistics, economics and social sciences for budgets, business plans and the like.
The method used to study how closely the variables are related is called correlation analysis.
Here are a couple of examples of strong correlation:
And here the examples of data that have weak or no correlation:
An essential thing to understand about correlation is that it only shows how closely related two variables are. Correlation, however, does not imply causation. The fact that changes in one variable are associated with changes in the other variable does not mean that one variable actually causes the other to change.
If you're interested to learn causality and make predictions, take a step forward and perform linear regression analysis.
The numerical measure of the degree of association between two continuous variables is called the correlation coefficient (r).
The coefficient value is always between -1 and 1 and it measures both the strength and direction of the linear relationship between the variables.
The larger the absolute value of the coefficient, the stronger the relationship:
The coefficient sign (plus or minus) indicates the direction of the relationship.
For better understanding, please take a look at the following correlation graphs:
In statistics, they measure several types of correlation depending on type of the data you are working with. In this tutorial, we will focus on the most common one.
Pearson Correlation, the full name is the Pearson Product Moment Correlation (PPMC), is used to evaluate linear relationships between data when a change in one variable is associated with a proportional change in the other variable. In simple terms, the Pearson Correlation answers the question: Can the data be represented on a line?
In statistics, it is the most popular correlation type, and if you are dealing with a "correlation coefficient" without further qualification, it's most likely to be the Pearson.
Here's the most commonly used formula to find the Pearson correlation coefficient, also called Pearson's R:
At times, you may come across two other formulas for calculating the sample correlation coefficient (r) and the population correlation coefficient (ρ).
Calculating the Pearson correlation coefficient by hand involves quite a lot of math. Luckily, Microsoft Excel has made things very simple. Depending on your data set and your goal, you are free to use one of the following techniques:
To compute a correlation coefficient by hand, you'd have to use this lengthy formula. To find correlation coefficient in Excel, leverage the CORREL or PEARSON function and get the result in a fraction of a second.
The CORREL function returns the Pearson correlation coefficient for two sets of values. Its syntax is very easy and straightforward:
The two arrays should have equal length.
Assuming we have a set of independent variables (x) in B2:B13 and dependent variables (y) in C2:C13, our correlation coefficient formula goes as follows:
Or, we could swap the ranges and still get the same result:
Either way, the formula shows a strong negative correlation (about -0.97) between the average monthly temperature and the number of heaters sold:
To calculate the correlation coefficient in Excel successfully, please keep in mind these 3 simple facts:
The PEARSON function in Excel does the same thing - calculates the Pearson Product Moment Correlation coefficient.
Because PEARSON and CORREL both compute the Pearson linear correlation coefficient, their results should agree, and they generally do in recent versions of Excel 2007 through Excel 2019.
In Excel 2003 and earlier versions, however, the PEARSON function may display some rounding errors. Therefore, in older versions, it is recommended to use CORREL in preference to PEARSON.
On our sample data set, both functions exhibit the same results:
When you need to test interrelations between more than two variables, it makes sense to construct a correlation matrix, which is sometimes called multiple correlation coefficient.
The correlation matrix is a table that shows the correlation coefficients between the variables at the intersection of the corresponding rows and columns.
The correlation matrix in Excel is built using the Correlation tool from the Analysis ToolPak add-in. This add-in is available in all versions of Excel 2003 through Excel 2019, but is not enabled by default. If you have not activated it yet, please do this now by following the steps described in How to enable Data Analysis ToolPak in Excel.
With the Data Analysis tools added to your Excel ribbon, you are prepared to run correlation analysis:
When done, click the OK button:
Your matrix of correlation coefficients is done and should look something like shown in the next section.
In your Excel correlation matrix, you can find the coefficients at the intersection of rows and columns. If the column and row coordinates are the same, the value 1 is output.
In the above example, we are interested to know the correlation between the dependent variable (number of heaters sold) and two independent variables (average monthly temperature and advertising costs). So, we look only at the numbers at the intersection of these rows and columns, which are highlighted in the screenshot below:
The negative coefficient of -0.97 (rounded to 2 decimal places) shows a strong inverse correlation between the monthly temperature and heater sales - as the temperature grows higher, fewer heaters are sold.
The positive coefficient of 0.97 (rounded to 2 decimal places) indicates a strong direct connection between the advertising budget and sales - the more money you spend on advertising, the higher the sales.
Building the correlation table with the Data Analysis tool is easy. However, that matrix is static, meaning you will need to run correlation analysis anew every time the source data change.
The good news is that you can easily build a similar correlation table yourself, and that matrix will update automatically with each change in the source values.
To have it done, use this generic formula:
Important note! For the formula to work, you should lock the first variable range by using absolute cell references.
In our case, the first variable range is $B$2:$B$13 (please notice the $ sign that locks the reference), and our correlation formula takes this shape:
=CORREL(OFFSET($B$2:$B$13, 0, ROWS($1:1)-1), OFFSET($B$2:$B$13, 0, COLUMNS($A:A)-1))
With the formula ready, let's construct a correlation matrix:
As the result, we've got the following matrix with multiple correlation coefficients. Please notice that the coefficients returned by our formula are exactly the same as output by Excel in the previous example (the relevant ones are highlighted):
As you already know, the Excel CORREL function returns the correlation coefficient for two sets of variables that you specify. The main challenge is to supply the appropriate ranges in the corresponding cells of the matrix. For this, you enter only the first variable range in the formula and use the following functions to make the necessary adjustments:
To better understand the logic, let's see how the formula calculates the coefficients highlighted in the screenshot above.
First, let's examine the formula in B18, which finds correlation between the monthly temperature (B2:B13) and heaters sold (D2:D13):
=CORREL(OFFSET($B$2:$B$13, 0, ROWS($1:3)-1), OFFSET($B$2:$B$13, 0, COLUMNS($A:A)-1))
In the first OFFSET function, ROWS($1:1) has transformed to ROWS($1:3) because the second coordinate is relative, so it changes based on the relative position of the row where the formula is copied (2 rows down). Thus, ROWS() returns 3, from which we subtract 1, and get a range that is 2 columns to the right of the source range, i.e. $D$2:$D$13 (heater sales).
The second OFFSET does not change the specified range $B$2:$B$13 (temperature) because COLUMNS($A:A)-1 returns zero.
As the result, our long formula turns into a simple CORREL($D$2:$D$13, $B$2:$B$13) and returns exactly the coefficient we want.
The formula in C18 that calculates a correlation coefficient for advertising cost (C2:C13) and sales (D2:D13) works in a similar manner:
=CORREL(OFFSET($B$2:$B$13, 0, ROWS($1:3)-1), OFFSET($B$2:$B$13, 0, COLUMNS($A:B)-1))
The first OFFSET function is absolutely the same as describe above, returning the range of $D$2:$D$13 (heater sales).
In the second OFFSET, COLUMNS($A:A)-1 changes to COLUMNS($A:B)-1 because we've copied the formula 1 column to the right. Consequently, OFFSET gets a range that is 1 column to the right of the source range, i.e. $C$2:$C$13 (advertising cost).
When doing correlation in Excel, the best way to get a visual representation of the relations between your data is to draw a scatter plot with a trendline. Here's how:
For the detailed step-by-step instructions, please see:
For our sample data set, the correlation graphs look like shown in the image below. Additionally, we displayed R-squared value, also called the Coefficient of Determination. This value indicates how well the trendline corresponds to the data - the closer R2 to 1, the better the fit.
From the R2 value displayed on your scatterplot, you can easily calculate the correlation coefficient:
For example, the R2 value in the second graph is 0.9174339392. So, you can find the correlation coefficient for Advertising and Heaters sold with one of these formulas:
As you can make sure, the coefficients calculated in this way are perfectly in line with the correlation coefficients found in the previous examples, except the sign:
The Pearson Product Moment Correlation only reveals a linear relationship between the two variables. Meaning, your variables may be strongly related in another, curvilinear, way and still have the correlation coefficient equal to or close to zero.
The Pearson correlation is not able to distinguish dependent and independent variables. For example, when using the CORREL function to find the association between an average monthly temperature and the number of heaters sold, we got a coefficient of -0.97, which indicates a high negative correlation. However, you could switch around the variables and get the same result. So, someone may conclude that higher heater sales cause temperature to fall, which obviously makes no sense. Therefore, when running correlation analysis in Excel, be aware of the data you are supplying.
Besides, the Pearson correlation is very sensitive to outliers. If you have one or more data points that differ greatly from the rest of the data, you may get a distorted picture of the relationship between the variables. In this case, you'd be wise to use the Spearman rank correlation instead.
That's how to do correlation in Excel. To have a closer look at the examples discussed in this tutorial, you are welcome to download our sample workbook below. I thank you for reading and hope to see you on our blog next week!
Calculate Correlation in Excel (.xlsx file)
Table of contents