Zipf's Law
I’ve run across the interesting Zipfian distribution several times recently. Zipf’s law states that for many things, particularly words, the frequency is inversely proportional to the rank of the frequency. So, for example, the most common word is used twice as often as the second most common word, which is used twice as often as the third most common word, and so on. This means that if you plot rank vs. count on a log-log plot, you will see a straight line.
I decided to try it on several of the most common religious texts. The key requirement was that I could download them from Project Gutenberg; one of my favorite websites. A second similarity is that they are rarely studied or understood by many of their professing followers. I also tried with several other texts, but left them off the chart for simplicity.
This rule also works well for problems outside of linguistics, for example: city populations, website visits, et cetera. I plotted the frequency of hits to different pages on my website and for city sizes in the United States, and found fairly good fits.
Zipf’s Law, and later extensions, are common and useful enough that there is an [R] package for dealing with them. The package developers, Stefan Evert and Marco Baroni, posted a good introduction to Zipf’s law and its uses on the zipfR project page. Their presentation materials are also a good example of slide design.