top of page

Statistics - Day 9

  • Writer: supriyamalla
    supriyamalla
  • Jul 19, 2021
  • 2 min read

Today we'll learn about Regression Analysis!


Most common method to predict.


Rule #1: Correlation doesn't always imply Causation


Linear Regression is a linear approximation of the causal relationship of two or more variables. Like how changing one variable affects another.


Differences between correlation and regression:

  1. Correlation is a symmetric relation by Regression is one way

  2. Regression is a best fit line while correlation is a point


There are three types of regression:

  1. Sum of squares total (SST, TSS): measures the total variability of the dataset (y-y bar)**2

  2. Sum of squares regression (SSR): measures of well your line fits the data (y hat - y bar)**2

  3. Sum of squares error (SSE, RSS - residual sum of error): measures difference between observed and predicted value

SST = SSR+SSE


Now, how do we know our regression is good enough?

Luckily - we have a metric called "R squared". It is a measure to quantify the variability of your regression. Has values from 0,1.

It is SSR/SST

For example: r squared value of 0.41 means, x explains 41% of variability of y.


OLS (Ordinary least squares) is the most common way of estimate linear regression equation

It is the min SSE

It denotes the line with the min error


Other methods of regression estimation:

  1. Generalized least squares

  2. Maximum likelihood estimation

  3. Bayesian regression

  4. Kernel regression

  5. Gaussian process regression

In regression - (ANOVA) if p<0.05, you can assume that the coefficient is statistically significant


For multivariate regression, we have something called "adjusted R square". This is always usually less than R square and penalizes the equation with multiple variables.


However, before beginning with any kind of regression, there are below assumptions among variables which need to be considered:

1. Linearity - a linear relation exists between x and y

If it doesn't follow a linear relationship, here are fixes:

a. Run a non linear regression

b. Exponential transformation

c. Log transformation

2. No Endogeneity - Omitted variable bias. Basically, you have included only the relevant variables to predict.


3. Normality and Homoscedasticity - assumed that error is normally distributed. Homoscedasticity is equal variance.

Fixes:

a. Omit bias variance

b. Look for variance

c. Log transformation


4. No autocorrelation between error variables.

Autocorrelation is not observed in cross-sectional data. You usually spot it at time series data, which is a subset of panel data.

Fixes:

a. No fixes. instead use autoregressive models/moving average models.

5. No multicollinearity - high correlation between variables

Fixes:

a. Drop one of the two variables

b. Transform them into one


How to include categorical data in regression?

The most common way is to assign numerical to each category.


This way you can predict effectively.


Alright, let's do one hands-on exercise and we are ready to take on the world! :)






Comments


Post: Blog2 Post

Subscribe Form

Thanks for submitting!

©2020 by Learn Data Science with me. Proudly created with Wix.com

bottom of page