What is a CDF?


First of all, let's start with a histogram:

A histogram is bar graph which shows us what proportion of our data are in a particular data range (sometimes called 'bins'). For example, we could easily get a sense of how our observations are distributed in terms of age, if we type:


histogram age, bin(50)


You can see that many of our observations are at younger ages (below 40).

With continuous variables (of which age is the closest approximation to this that we have in our data set), we can get a smoother histogram, for which the area represented by the histogram sums to 1.



This smoother histogram is known as a PDF. Most of us will have seen the familiar bell-shaped curve of the normal PDF, from one or other introductory statistics class. Recall that each point on the PDF shows us how much probability mass there is at that value in the distribution. If we cumulatively add up all the area under the PDF, we get the CDF, as represented below:

 

 

A point on the CDF may be interpreted as follows: looking at the xline plotted at 40 years of age, we can tell that about 80% of our sample is at or below 40 years.

Notice 2 properties of this CDF:

1. it's Y-values (or probability values) always lie between 0 and 1, and the total area under the CDF sums to 1

2. the function is monotonic - or, continually increasing. This is because we are keeping a running sum of the area under the PDF to generate the graph, and area can never be negative, so the CDF cannot turn down. The increasing function is not linear though; the Y-value decreases faster and faster as age decreases towards 0, and increases more and more slowly as age increases to 60+.

These 2 properties characterize the normal distribution, and all other distributions of continuous variables. The important thing to remember about CDF's is that they are functions which map X-values into probability numbers, and so are bounded between [0,1].


 

Back to Module 8