So far, our discussion of statistics has been limited to the analysis of a single population. In many cases, we often want to compare 2 or more population sets. In polling for example, a common problem is to compare the answers given on two or more questions asked. In this post, we’ll discuss how to quantify the relationship between two random variables.
In the post on Expectation, we defined the variance of a random variable X as:
where μx is the population mean of X. Loosely defined, Var(X) is a measure of the spread of the values of X around the mean. If we defined a second random variable Y, we could perform a similar calculation for Var(Y).
Now let’s create a measure that compares X and Y, or the covariance between the two.
But what exactly are we measuring? Let’s assume that X and Y are random variables such that, when X is far from it’s mean, so is Y. As a result, (X-μx)(Y-μy) is large. But if X is far from it’s mean when Y is close, (X-μx)(Y-μy) is small. Cov(X, Y) is therefor a measure of how well X and Y track each other, relative to their means.
This form of Cov(X, Y) is intuitive, but difficult to calculate. However a little algebra can fix that.
Drawing Cards from a Deck
Even in this form, calculating covariances is tricky. Let’s illustrate this with an example. Consider drawing 5 cards from a well-shuffled poker deck without replacement. Let J be the number of Jacks drawn and S be the number of Spades. To calculate E[JS] we need to fill out the Joint Probability Density Function as a table with all possible values of J (0-4) and S (0-5).
For example, using what we know of Combinatorics, the probability of not drawing any Spades or Jacks, P(S=0, J=0) is:
Since there are 36 cards in the deck that are neither Spades nor Jacks. Similarly, we can compute the probability of drawing 3 Spades and no Jacks, as well as no Spades and 3 Jacks, as:
Since there are 12 Spades in the deck (that aren’t Jacks) and 3 Jacks in the deck (that aren’t Spades). It get’s even trickier when computing the probabilities of drawing multiple Spades and Jacks since one card, the Jack of Spades, falls into both categories. For example:
since this combination can be achieved with either 1 Spade, 3 Jacks, and 1 other card, or 1 Jack of Spades, no additional Spades, 2 additional Jacks, and 2 other cards. Still other combinations are trivial, such as:
since it’s not possible to draw 5 Spades if 4 of the cards are Jacks. Following this logic, we can fill out the complete joint distribution table.
We can have some confidence in these results by noting that the sum of the individual rows and columns equal the probabilities from the single-dimension marginal PDFs. For example:
Which match the sum of the 2nd column and 5th row respectively.
So after all that, we can now compute the covariance between J and S. First we need:
The covariance’s sign indicates a negative relationship between the number of Jacks drawn and the number of Spades; the more of one you have, generally the less you have of the other. The magnitude however is suspiciously small, but since the covariance is unbounded, it’s hard to determine its significance.
It can be shown that the covariance is in fact bounded by the standard deviations of the marginal PDF’s, so we can use this to make a more meaningful measure of correlation. The Correlation Coefficient, ρ, is defined as:
As with the covariance, the sign of the Correlation Coefficient indicates the ‘direction’ of the correlation and a value of 0 implies there is no linear relationship between X and Y. ρ’s magnitude is bounded by 1, indicating that the two values have a perfect linear relationship. The best way to illustrate the effect of the Correlation Coefficient on two random variables is to plot them.
As the magnitude of the Correlation Coefficient approaches 1, the scatter plot of the two values forms a more distinct line. A negative coefficient indicates a line with negative slope.
Returning to our card example:
An interesting result indeed. A negative Correlation Coefficient implies that the more Jacks your draw, the less Spades you’ll draw, and vice versa. This makes sense, but the correlation is so weak, compared to a minimum value of -1, the relationship can be considered uncorrelated, despite the fact that J and S are dependent.
As we learned in Sampling a Population, it’s rare that we have complete knowledge of the probabilities of all outcomes of a population. Instead, our knowledge is usually limited to a small sampling of the total population. From that sample, we must make (hopefully) unbiased estimates of important parameters such as mean and variance for the entire population.
It can be shown that the Sample Correlation Coefficient, r, is an unbiased estimator for the Correlation Coefficient, ρ, using the Standard Error of two random variables X and Y we know to be:
and the Sample Covariance:
As with the Correlation Coefficient for the entire population, r ranges from -1 to 1 with 0 indicating that the two values are completely uncorrelated.
Let’s conduct a simple experiment. We would like to study the divorce rates by U.S. state against education level. A good source of data like this is the U.S. Census Bureau’s Current Population Survey (CPS). The CPS is a continuously updated collection of statistics about the U.S. population that can be categorized in any combination. From the CPS, we can create a table that compares these two figures by state.
Source: census.gov/CPS for data through 2014
Equivalently, we can plot this data in a scatter diagram, with each point represent one of the 25 states randomly selected in the data:
Since the data points in this plot form a fairly straight line, we should be able to conclude that there is a reasonably strong correlation between education level and divorce rates. Calculating the Sample Correlation Coefficient of the 25 randomly selected states:
confirms this conclusion.
The Golden Rule
When I was learning this material for the first time, my professors would tell me that, if I only learn one thing in their class, it’s this:
Correlation does not imply causation.
That is, a high Correlation Coefficient does not imply that the two values are dependent on each other. Our divorce vs. education example above is a classic example. Where we’ve determined there’s a strong correlation between the two, it would be nonsense to conclude that graduating college causes you to get divorced. As such, we can’t control the number of divorces by regulating the number of college graduates, but we can certainly predict the number of divorces in the 26th state within a certain tolerance, when given its graduation rate.
Unfortunately, this association between correlation and causation is referenced all the time. In a recent NewsDay article, researchers announced:
“Investigators recently tested the preserved samples for 25-hydroxyvitamin D, a compound produced by the liver from the vitamin [D], and found that patients with high levels had a lower-than-average risk of developing colorectal tumors.”
Causation is determined, not through epidemiological research like this, but by clinical studies backed up by experimentation. Researchers at the Dana-Farber Cancer Institute in Boston understand this and are gathering empirical data to backup these statistics, but that is not always the case.
We’ve also shown that the opposite can be true; two variables can have a dependency but no correlation. Our card-drawing experiment was built on the assumption that the number of spades drawn depends on the number of jacks drawn, and vice versa, yet the correlation between the two is basically 0.
To review, the test for dependency is often more subjective than that for correlation. In many cases, common sense dictates. Naturally, the outcome of rolling two dice are independent, for example. Mathematically, the test for dependency is simply that:
for all values of X and Y.
We’ve covered a lot of ground today. While we’ll likely come back to many of these issues in later posts, it’s time to begin using these tools to tests our hypotheses on collected data.
As I’ve suggested before, Combinatorics problems like our card example are tricky, and frankly not one of my strengths. Fortunately, some folks at TalkStats where kind enough to help me with my counting. Check out the threads in the Probability forum for more insight on these calculations.