Statistics - Day 4

supriyamalla
Jun 29, 2021
2 min read

Using Frequency Distribution Table for numerical variables. It is often best to create bins and place numbers in intervals for numerical variables for it to make sense. You can also create "relative frequency" (frequency/total frequency) to understand what percent a certain interval is out of the total.

Learnt how to create histograms, bar charts, pie charts, scatter plot, etc. And also did some hands-on exercises.

Now, on to the elephant in the room - Measures of central tendency!

Mean - simple average; easily affected by outliers; not enough to make conclusions

Median - mid value of the dataset. If total items is 10, then take mean of 5th and 6th positioned item ; not affected by outliers

Mode - highest frequency; in case all items appear once - 0 modes.

So, the best central tendency? No best, using only one is definitely the worst.

Calculating Skewness:

Skewness indicates whether the data is concentrated on one side.

Left skew/negative skew - tail of the curve is towards left. => Mean<Median

No skew - Normal distribution =>Mean=Median

Right skew/Positive Skew - tail of the curve is towards the right =>Mean>Median because outliers are towards right and as we know mean is drawn towards the outliers

You can determine skewness in chart by using skew() in Excel. You can also visualize skewness via histogram however sometimes with less bins it might be difficult to identify skewness. Hence, use skew() formula to measure it accurately.

Variance : Measures the dispersion of a set of data points around their mean

Q1: Why do we square the difference of mean and a data point?

So that all numbers are positive (distance is always positive) and that negatives and positives don't cancel out
To amplify the small differences

Here's a really important video on this: https://www.youtube.com/watch?v=sHRBg6BhKjI&ab_channel=StatQuestwithJoshStarmer

Standard deviation since its a squared value, is difficult to compare and comprehend. And often for this, we use Standard Deviation.

Standard Dev: most common measure of variability for a SINGLE dataset (sqrt of variance)

Coefficient of Variation: used for comparing multiple dataset (std. deviation/mean)

Measures of relationship between variables:

Covariance: measure of correlation between two variables. Can be positive, 0 or negative.

How to calculate ?

Correlation Coefficient: Covariance/(Std dev of dataset1*std dev of dataset2) -> ranges from -1 to 1

CORRELATION DOESN'T IMPLY CAUSATION!

Stay Tuned for more information!

AND it's a wrap for today! See you tomorrow :)

Statistics - Day 4

CORRELATION DOESN'T IMPLY CAUSATION!

Recent Posts

Comments

Subscribe Form