The tutorial explains the basics of regression analysis and shows a few different ways to do linear regression in Excel.
Imagine this: you are provided with a whole lot of different data and are asked to predict next year's sales numbers for your company. You have discovered dozens, perhaps even hundreds, of factors that can possibly affect the numbers. But how do you know which ones are really important? Run regression analysis in Excel. It will give you an answer to this and many more questions: Which factors matter and which can be ignored? How closely are these factors related to each other? And how certain can you be about the predictions?
In statistical modeling, regression analysis is used to estimate the relationships between two or more variables:
Dependent variable (aka criterion variable) is the main factor you are trying to understand and predict.
Independent variables (aka explanatory variables, or predictors) are the factors that might influence the dependent variable.
Regression analysis helps you understand how the dependent variable changes when one of the independent variables varies and allows to mathematically determine which of those variables really has an impact.
Technically, a regression analysis model is based on the sum of squares, which is a mathematical way to find the dispersion of data points. The goal of a model is to get the smallest possible sum of squares and draw a line that comes closest to the data.
In statistics, they differentiate between a simple and multiple linear regression. Simple linear regression models the relationship between a dependent variable and one independent variables using a linear function. If you use two or more explanatory variables to predict the dependent variable, you deal with multiple linear regression. If the dependent variable is modeled as a non-linear function because the data relationships do not follow a straight line, use nonlinear regression instead. The focus of this tutorial will be on a simple linear regression.
As an example, let's take sales numbers for umbrellas for the last 24 months and find out the average monthly rainfall for the same period. Plot this information on a chart, and the regression line will demonstrate the relationship between the independent variable (rainfall) and dependent variable (umbrella sales):
Mathematically, a linear regression is defined by this equation:
Where:
The linear regression equation always has an error term because, in real life, predictors are never perfectly precise. However, some programs, including Excel, do the error term calculation behind the scenes. So, in Excel, you do linear regression using the least squares method and seek coefficients a and b such that:
For our example, the linear regression equation takes the following shape:
Umbrellas sold = b * rainfall + a
There exist a handful of different ways to find a and b. The three main methods to perform linear regression analysis in Excel are:
Below you will find the detailed instructions on using each method.
This example shows how to run regression in Excel by using a special tool included with the Analysis ToolPak add-in.
Analysis ToolPak is available in all versions of Excel 2019 to 2003 but is not enabled by default. So, you need to turn it on manually. Here's how:
This will add the Data Analysis tools to the Data tab of your Excel ribbon.
In this example, we are going to do a simple linear regression in Excel. What we have is a list of average monthly rainfall for the last 24 months in column B, which is our independent variable (predictor), and the number of umbrellas sold in column C, which is the dependent variable. Of course, there are many other factors that can affect sales, but for now we focus only on these two variables:
With Analysis Toolpak added enabled, carry out these steps to perform regression analysis in Excel:
If you are building a multiple regression model, select two or more adjacent columns with different independent variables.
As you have just seen, running regression in Excel is easy because all calculations are preformed automatically. The interpretation of the results is a bit trickier because you need to know what is behind each number. Below you will find a breakdown of 4 major parts of the regression analysis output.
This part tells you how well the calculated linear regression equation fits your source data.
Here's what each piece of information means:
Multiple R. It is the Correlation Coefficient that measures the strength of a linear relationship between two variables. The correlation coefficient can be any value between -1 and 1, and its absolute value indicates the relationship strength. The larger the absolute value, the stronger the relationship:
R Square. It is the Coefficient of Determination, which is used as an indicator of the goodness of fit. It shows how many points fall on the regression line. The R2 value is calculated from the total sum of squares, more precisely, it is the sum of the squared deviations of the original data from the mean.
In our example, R2 is 0.91 (rounded to 2 digits), which is fairy good. It means that 91% of our values fit the regression analysis model. In other words, 91% of the dependent variables (y-values) are explained by the independent variables (x-values). Generally, R Squared of 95% or more is considered a good fit.
Adjusted R Square. It is the R square adjusted for the number of independent variable in the model. You will want to use this value instead of R square for multiple regression analysis.
Standard Error. It is another goodness-of-fit measure that shows the precision of your regression analysis - the smaller the number, the more certain you can be about your regression equation. While R2 represents the percentage of the dependent variables variance that is explained by the model, Standard Error is an absolute measure that shows the average distance that the data points fall from the regression line.
Observations. It is simply the number of observations in your model.
The second part of the output is Analysis of Variance (ANOVA):
Basically, it splits the sum of squares into individual components that give information about the levels of variability within your regression model:
The ANOVA part is rarely used for a simple linear regression analysis in Excel, but you should definitely have a close look at the last component. The Significance F value gives an idea of how reliable (statistically significant) your results are. If Significance F is less than 0.05 (5%), your model is OK. If it is greater than 0.05, you'd probably better choose another independent variable.
This section provides specific information about the components of your analysis:
The most useful component in this section is Coefficients. It enables you to build a linear regression equation in Excel:
For our data set, where y is the number of umbrellas sold and x is an average monthly rainfall, our linear regression formula goes as follows:
Y = Rainfall Coefficient * x + Intercept
Equipped with a and b values rounded to three decimal places, it turns into:
Y=0.45*x-19.074
For example, with the average monthly rainfall equal to 82 mm, the umbrella sales would be approximately 17.8:
0.45*82-19.074=17.8
In a similar manner, you can find out how many umbrellas are going to be sold with any other monthly rainfall (x variable) you specify.
If you compare the estimated and actual number of sold umbrellas corresponding to the monthly rainfall of 82 mm, you will see that these numbers are slightly different:
Why's the difference? Because independent variables are never perfect predictors of the dependent variables. And the residuals can help you understand how far away the actual values are from the predicted values:
For the first data point (rainfall of 82 mm), the residual is approximately -2.8. So, we add this number to the predicted value, and get the actual value: 17.8 - 2.8 = 15.
If you need to quickly visualize the relationship between the two variables, draw a linear regression chart. That's very easy! Here's how:
This will insert a scatter plot in your worksheet, which will resemble this one:
As you may notice, the regression equation Excel has created for us is the same as the linear regression formula we built based on the Coefficients output.
At this point, your chart already looks like a decent regression graph:
Still, you may want to make a few more improvements:
And this is how our improved regression graph looks like:
Microsoft Excel has a few statistical functions that can help you to do linear regression analysis such as LINEST, SLOPE, INTERCPET, and CORREL.
The LINEST function uses the least squares regression method to calculate a straight line that best explains the relationship between your variables and returns an array describing that line. You can find the detailed explanation of the function's syntax in this tutorial. For now, let's just make a formula for our sample dataset:
=LINEST(C2:C25, B2:B25)
Because the LINEST function returns an array of values, you must enter it as an array formula. Select two adjacent cells in the same row, E2:F2 in our case, type the formula, and press Ctrl + Shift + Enter to complete it.
The formula returns the b coefficient (E1) and the a constant (F1) for the already familiar linear regression equation:
y = bx + a
If you avoid using array formulas in your worksheets, you can calculate a and b individually with regular formulas:
Get the Y-intercept (a):
=INTERCEPT(C2:C25, B2:B25)
Get the slope (b):
=SLOPE(C2:C25, B2:B25)
Additionally, you can find the correlation coefficient (Multiple R in the regression analysis summary output) that indicates how strongly the two variables are related to each other:
=CORREL(B2:B25,C2:C25)
The following screenshot shows all these Excel regression formulas in action:
That's how you do linear regression in Excel. That said, please keep in mind that Microsoft Excel is not a statistical program. If you need to perform regression analysis at the professional level, you may want to use targeted software such as XLSTAT, RegressIt, etc.
To have a closer look at our linear regression formulas and other techniques discussed in this tutorial, you are welcome to download our sample Regression Analysis in Excel workbook.
96 responses to "Linear regression analysis in Excel"
this was so helpful. hoping to learn more from this site
So helpful!! Thank you!
Thanks! Good easy to follow guide.
For the regression analysis..where do I down load the date for the 'Ranfall-Umbrella" example..? Please advise..Thanks
Hi Shankar,
You can find all the source data in our Regression Analysis Sample Workbook.
Thanks Ms.Svetlana for the quick response. Your tutorial was very easy to understand as it went step by step, hand holding a novice..which is very important..! I really appreciate your effort in making complex issues simple. Do you have similar tutorial on Multiple regression, Pricing optimization ,Price bundling etc( in Excel), Decision tree Analysis etc.
Would appreciate your advise.
Regards
Shankar
The article is very nice keep posting
hi! the article is very helpful, just a quick question, why is it that i cannot add a trendline? :( pls help
thank you ms. for your help
very useful guide.thanks
So I completed the regression analysis for my assignment. So how do I figure out what actually goes on the graph? I tried two different sets of data, one produced two points on the graph and the other did not produce anything close to what your examples were. SOS
Wow! Thank you. I had to learn this for my statistics homework.
Very good notes
Why does # of observations equal # data points minus 1? It is not immediately apparent in your example because you include the title rows in your input ranges.
Besides this one issue the article is great and extremely informative
Hi Braden,
It's a very good question.
I included the header row in the input ranges to make it easier to interpret the regression analysis output, based on the column headers. The number of observations equals the number of data points (24 observations, rows 2-25); the header row is not counted because the Labels box was selected.
do you have any post about what is difference between standardized versus non-standardized coefficient? i don't get it
so helpful! will be using this site more often
Hi Svetlana,
Your article is very nice and its self explanatory for beginners like me.
Highly informative. I love that. God blesses you.
that was really really helpful.
Am definitely getting an A with the explanation of these awesome work. Thank you
Very clear, helped me a ton. Thank you so much :)
my Significance F value is 6.07596E-31
what does it mean?
Hi Ali,
It is a scientific notation that replaces part of the number with E+n, where E (exponent) multiplies the preceding number by 10 to the nth power. That is, 6.07596E-31 equals 6.07596x10^-31 (6.07596 times 10 to the -31st power).
The Significance F value measures the reliability of the results. If it is less than 0.05 (5%), your model is OK. In your case, Significance F is far less than 5%, so your results are statistically significant.
Hey! Can you also provide this data?
Around 500 enteries if you have! please upload that too
Hey Mona, what data hey ! You send data 15000.
Thanks a lot, it was too useful!
Thanks a lot! It`s very interesting and useful!
Really helpful and easy to understand. Thanks.
Mrs. Svetlana,
Congratulations for great work on this topic.
You made it easy to understand in short time.
BRAVO
Thanks a lot.
Great explanation, much appreciated Excel Functions there!
Thank you so much
Please also send the link. How to do multiple regression, non-linear regression
With your best Regards
Tanveer
Very helpful. Detailed and clear explanation.
I have two variables say time and one predictor variable. However, the predictor variable is classified into regions. Is there a possibility of having that captured in a regression output as a grouping variable in excel?
This was so useful.. Thanks a ton
THIS IS EXCELLENT, IT HAS HELP ME DOING MY PROJECT.
So helpful. Thanks!
Hi Svetlana,
Your post is amazing; it must have helped millions including me.
One simple confusion; now i can i forecast using this regression concept. So if have to predict future, i can use this concept.can you take one example.
Many thanks in advance.
Hi Syed,
Microsoft Excel has special functions and features to predict future values, and we have a couple of tutorials on those too:
How to forecast in Excel: linear and non-linear forecasting methods
Excel FORECAST and other forecasting functions with formula examples
Thanks a lot ma'am
Hi Svetlana,
Thanks for your detailed and well-written article on regression in the Analysis Toolpak and also for mentioning RegressIt as a professional-grade alternative. I'd like your readers to know that RegressIt is free (unlike XLSTAT) and has many features that are designed to help users learn and apply best practices of regression modeling. Variables are selected from a list of names (rather than by entering coordinates of cell ranges), and there are tools for testing model assumptions, comparing models side by side, and sharing results with others in presentation quality format. The graphs and tables that it produces are far superior to what you get with the Analysis Toolpak, both in terms of design and in terms of the set of options that are available. RegressIt also includes very detailed built-in teaching notes that can be embedded in the model worksheets, and it has features that help instructors to grade and verify the originality of work submitted by students. Also, unlike the Analysis Toolpak, it has the capability to forecast from a regression model (including an option for interactive confidence limits on forecast charts), which addresses the question raised by one of your respondents (Syed, post #34). And one more thing: it includes a user-friendly interface with R that allows users to run both linear and logistic regression models in R without writing any code. This feature allows more sophisticated model testing to be carried out and provides a gentle introduction to the R environment for those without programming experience. I encourage you to mention some of these features to your readers, either in this list of comments or a separate post.
Cheers,
--Bob
The tutorial was easy to understand and was also helpful. I am no longer a novice as far as regression analysis is concern. Thanks.
Thank you so much @ Svetlana. It's been really helpful
Hi Mam,
I find it extremely helpful for my M.Tech project where i have to perform regression analysis. But can you please tell me if we can use this LINEST function can be implemented for lognormal equations as ln(y)=a+bln(x)....
Thank you very much for the information . My assignment became very easy and understandable through the information provided here. It gave me the conceptual clarity.
Heaps of thanks!
Thanks for this.
Any link for Logistic Regression?
Very helpful. Thanks for that.
Very helpful, so clear to understand though I've already studied statistic for several weeks. I've already saved it. Thanks very much!
Thanks a lot, so easy to understand
Indeed very detailed and helpful. Thanks so much
Awesome!
Very well explained
It was very helpfuk to me and easy to learn method
Wrong information given here:
Written 'If you use two or more explanatory variables to predict the independent variable, you deal with multiple linear regression.'
It is Not the independent variable you predict. It is the dependent variable.
Hi!
Of course, you predict the dependent variable. Sorry for that typo, fixed.
Very helpful for uni thank you very much - so much information which really helped and explained things I could not find the answers to anywhere but here, many thanks :)
bob the bacteria
Simple and well explained!
Hi Svetlana,
I really like your explanations for linear regression, but I am confused about your explanation on Significance F value. In the example you provided, you explained that If Significance F is less than 0.05 (5%), your model is OK. If it is greater than 0.05, you'd probably better choose another independent variable.
But When responding to Ali's question whose Significance F value is 6.07596E-31, you said " in your case, Significance F is far less than 5%, so your results are statistically significant." So between (0.05) and (6.07), which one is greater than the other one?
I have run my regression and my Significance F value is 0.005590647. Is this value (0.005) greater or less than (0.05)?
Please, I am confused.
Hi!
6.07 is of course greater than 0.05. However, 6.07596E-31 is not 6.07, it is 0.000000000000000000000000000000607596! 6.07596E-31 is a special format (Scientific notation) used by Excel to display very large and very small numbers in a compact way. When responding to Ali's question I briefly explained about the Scientific format, you can find more info here: Scientific notation format in Excel.
In your case, Significance F (0.005590647) is also less than 0.05 - the more zeros after the decimal point the smaller the number.
Wow, first excel tutorial I read that is clear from A to Z...nice!
I agree with Andre above. this is the clearest tutorial ever. So easy to follow.
Further question, how do you deal with blank spaces in the data? If I chart it, I can still get a trend line and equation but if I try to work directly from the data it balks.
Thank you very much
This write-up has really helped me, but I'm left with one other question:
If I am using linear regression on a standard curve (say response over concentration) to obtain an equation that I can use to determine the concentration of unknown samples, how do I determine the uncertainty of the concentration value that this equation yields?
Thank you for this tutorial, it was a lifesaver!
Hi!
Also for me it was really helpful!
Unfortunately I cannot produce a graph for a multiple linear regression. Is Excel not able to show it or do you have a tutorial about it, which could help me?
Thanks!
Very well explained! Thanks.
thanks a lot. really helped.
Best explanation of regression ever. Very simple, clear and easy to understand. Thank you so much and will keep a tab on your tutorials.
Hi Svetlana. I never place my comments but your tutorial is worth it! All other pages either just show how its done or explain it very "statistically". But you explained it like teaching to a child. Thank you again for this!
Thank you very much. The best explanation I've found. Wonderful!!!
Thanks a lot....best explanation ever.
I just want to say thank you for this well descriptive explanation of how to add in the data analysis, as well as explaining what to type in for the x and y ranges and what I could learn from the analysis once it was completed. This has helped me tremendously. Thank you!!!!
required to chart a linear regression line, but it makes creating statistics tables simpler. To verify if installed, select Data from the toolbar. If Data Analysis is an option, the feature is installed and ready to use. If not installed, you can request this option by clicking on the Office button and selecting Excel options .
This was an amazing explanation, thank you very much ! :)
Thank you so much for this very clear and helpful tutorial! It was really excellent.
Hello,
I would like to know the mathematical formulas that Excel uses to calculate the linear regression coefficients. Can you help me?
Umar
Very nice explanation. I have data in row wise instead of column wise. I am unable to run the regression. Please can you help me out in performing regression with row wise data
1.True or false: In simple regression analysis, if the intercept is negative, then there is a negative
correlation between the dependent variable and the independent variable .
2. True or false: The range of values that indicates that there is a significant difference between the
value of the sample statistic and the proposed parameter value is called the rejection
region.
well presented ,made so simple to get complete idea!
Thanks Ms, great introduction specially for excel regression model
thank you so much very helpful!
thanks very much, it was very very helpful. it aided me complete my assignment.
This is the best clarification I have ever received in recent times. Thank you very much as this has just assisted with my data analysis for my MSc dissertation. I was at a crossroads until I saw this.
Thanks very much you are a saving grace.
Hi
How can I run the Data Analysis regression and ignore data inputs such as "-" in the middle of the data table?
Great job on equation types To .be used in régressions.
However in computing coefficient values, I cannot find the numeric equivalence of ^(1,2) in the formula, say,
LINEST(C2:C13,B2:B13^(1,2),1).
In other words, replacing the above two vactors by their row correspondance how does the above formula computes the coefficient values.
Thank you
very interesting and precise guide.
Simple and easy to understand. Well explained. Thank you!
Very user friendly resource to understand.
How can you calculate the average monthly rainfall in here ..
Thnak you for sharing this knowledge with us.
Thanks for sharing the ARTICLE. Very helpful
Super helpful! Thanks!
Mrs. Svetlana,
Thank you for this post, it really helped me. Now if you do not mind I would like to ask a further question. If the alpha value of this equation was 0.05, should I accept or reject the null hypothesis? Any help or guidance in this matter would be much appreciated. Thank you for your time.
Best Regards,
Justin
very helpful
Thank you!! Svetlana Cheusheva
You are an absolute beast! Thank you so much. This helped a ton. Excel is so powerful!
Thank you very much. I wish you more wins in all your affairs. Regards from Nigeria. 🙏