Correlation Coefficient

**Correlation Coefficient** is a unit-less measure of [[Correlation]] between two datasets ranging from 0 (no correlation) to 1 (perfectly correlated). Or, if you prefer, you could say they range from -1 to +1 for things that are _inversely_ correlated. You can think of the correlation coefficient as showing how close to _linear_ the relationship is. One way of defining this, per [[Thinking Fast and Slow]] is the proportion of the determining factors that the two measures have in common. Correlation coefficient is usually denoted with `r`. In [[Scatter Chart]]s, you usually see `r^2`, which is called the **correlation of determination**. > [!tip] Usage > A correlation coefficient of `r = 0.5` would have a correlation of determination of `r^2 = 0.5^2 = 0.25`. This means 25% of the variability in set `B` is explained by variation in set `A`. So, if `A` happens and `B` typically has a 50/50 chance, you could say it now has a **62.5%** chance of happening. The typical cutoff for 'predictive power' is `r = 0.7` , where ~0.50% of variation is correlated. ![[Pasted image 20250118114402.png]] # Real-world Examples For positive correlation examples, at least. ## Perfect - Measurements in imperial & measurements in metric - **exactly 1.0, perfectly correlated** ## Strong ~ 0.8 to .999 - Time spent running & calories burned - Height & weight - Height & shoe size ## Moderate ~ 0.4 to 0.8 - Years of education & income - Temperature & ice cream sales - Exercise amount & health outcomes - Time spent studying & test results ## Weak ~ 0.1 to 0.4 - Shoe size & IQ - Number of books read & vocabulary size - [[Income-Fulfillment Curve]] → money & happiness ## None - Height & phone number - Favorite color & (most things) # Calculation ## Manual This would be tedious. It involves calculating the means of both datasets, then calculating difference-from-the-mean of each datapoint from each dataset. Then you go pairwise and multiple together each the differences together from each dataset of each term minus its successor... then do some other stuff. Frankly no part of the algorithm is *hard*, it's just a lot. In essence, if you're comparing X & Y, you're looking at the differences between successive terms in X, as compared to the mean of X, and comparing that to the difference between successive terms in Y, compared to the mean of Y. ## Practical - Excel's `correl()` function. - [[Python]]'s panda's library has a function for it: ```python import pandas as pd # Sample DataFrame data = {'col1': [1, 2, 3, 4, 5], 'col2': [5, 4, 3, 2, 1], 'col3': [2, 4, 6, 8, 10]} df = pd.DataFrame(data) # Calculate correlation between 'col1' and 'col2' correlation = df['col1'].corr(df['col2']) print(f"Correlation between col1 and col2: {correlation}") ``` **** # More ## Source - [[Myself]] - [[Thinking Fast and Slow]] - Some examples from Gemini - ChatGPT for the table ## Related -