Correlation vs Causation in Data Science
Let’s jump into it right away.
Correlation means relationship and association to another variable. For example, a movement in one variable associates with the movement in another variable. For example, ice-cream sales go up as the weather turns hot.
A positive correlation means, the movement is in the same direction (left plot); negative correlation means that variables move in opposite direction (middle plot). The farther right plot is when there no correlation between the variables.
Causation means that one variable causes another to change, which means one variable is dependent on the other. It is also called cause and effect. One example would be as weather gets hot, people experience more sunburns. In this case, the weather caused an effect which is sunburn.
Correlation vs Causation Difference
Let’s try another example with this visualization. Your computer running out of battery causes it to shut down. It also causes video player to shut down. Now, computer and video player shutting down events are correlated; the actual cause is running out of battery.
Why is this important in data science?
How many times have you seen studies that imply A causes B. For example, going to the gym results in higher productivity and focus. Is this really causation?
As a data scientist, you should not let the correlation force your into bias because it can lead to faulty feature engineering and incorrect conclusions.
Correlation does not imply causation.
If you were to write a machine learning model for gym and productivity relationship, instead of focusing on features that are correlated (going to gym), you should focus on actual causes of high performance (hard work, perseverance, routine, etc) to validate cause-and-effect.
Correlation in R
Let’s say you have a dataset and you want to evaluate if certain features in the dataset are correlated. I am using mtcars dataset, one of the built-in datasets in R.
library(ggcorrplot)#read mtcars, one of the built in dataset in R
data(mtcars)#use cor function get correlation
corr <- cor(mtcars)#build correlation plot
ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE)
Try it yourself. Copy & paste the above code in R.
When you run the code, you should get an output with a correlation plot and values. A value closer to +1 means positive correlation and negative correlation if closer to -1. In the above example, you can observe that disp and wt have a positive correlation of +0.89; whereas, mpg and cyl have a negative correlation of -0.85.
Causal Impact Methods
Causation is harder to conclude than correlation but possible. One of the most common methods of determining causal impact is through experimentation and incremental studies.
Continue learning causal impact methods with this video. It covers causal impact methodologies, specifically digital experimentation (A/B testing) and randomization techniques with real-world examples.
Thanks for Reading!
If you like my work and want to support me…