Statistics - Day 9

supriyamalla
Jul 19, 2021
2 min read

Today we'll learn about Regression Analysis!

Most common method to predict.

Rule #1: Correlation doesn't always imply Causation

Linear Regression is a linear approximation of the causal relationship of two or more variables. Like how changing one variable affects another.

Differences between correlation and regression:

Correlation is a symmetric relation by Regression is one way
Regression is a best fit line while correlation is a point

There are three types of regression:

Sum of squares total (SST, TSS): measures the total variability of the dataset (y-y bar)**2
Sum of squares regression (SSR): measures of well your line fits the data (y hat - y bar)**2
Sum of squares error (SSE, RSS - residual sum of error): measures difference between observed and predicted value

SST = SSR+SSE

Now, how do we know our regression is good enough?

Luckily - we have a metric called "R squared". It is a measure to quantify the variability of your regression. Has values from 0,1.

It is SSR/SST

For example: r squared value of 0.41 means, x explains 41% of variability of y.

OLS (Ordinary least squares) is the most common way of estimate linear regression equation

It is the min SSE

It denotes the line with the min error

Other methods of regression estimation:

Generalized least squares
Maximum likelihood estimation
Bayesian regression
Kernel regression
Gaussian process regression

In regression - (ANOVA) if p<0.05, you can assume that the coefficient is statistically significant

For multivariate regression, we have something called "adjusted R square". This is always usually less than R square and penalizes the equation with multiple variables.

However, before beginning with any kind of regression, there are below assumptions among variables which need to be considered:

1. Linearity - a linear relation exists between x and y

If it doesn't follow a linear relationship, here are fixes:

a. Run a non linear regression

b. Exponential transformation

c. Log transformation

2. No Endogeneity - Omitted variable bias. Basically, you have included only the relevant variables to predict.

3. Normality and Homoscedasticity - assumed that error is normally distributed. Homoscedasticity is equal variance.

Fixes:

a. Omit bias variance

b. Look for variance

c. Log transformation

4. No autocorrelation between error variables.

Autocorrelation is not observed in cross-sectional data. You usually spot it at time series data, which is a subset of panel data.

Fixes:

a. No fixes. instead use autoregressive models/moving average models.

5. No multicollinearity - high correlation between variables

Fixes:

a. Drop one of the two variables

b. Transform them into one

How to include categorical data in regression?

The most common way is to assign numerical to each category.

This way you can predict effectively.

Alright, let's do one hands-on exercise and we are ready to take on the world! :)

Statistics - Day 9

Recent Posts

Comments

Subscribe Form