Bas Groothedde

Benford's Law applied to data analysis


Benford's Law applied to actual data


A while ago, I watched a video on Benford's Law by Numberphile (Brady Haran). I was fascinated that there is a law to predict the occurrences of the first digit in numbers.

In a nutshell, Benford's Law states that the probability of a number starting with digit d in a large set of numbers with distribution over several orders of magnitude equals log10(1 + 1 / d). I wanted to put this law to the test on several datasets, and see how different data plots onto a graph together with a plot of the actual Benford's Law set.

Warning: I wrote this up to document this for myself, it might not be as interesting as it seems to me, but this law is quite cool to me.

So yeah, what about this law?

As per usual, I shall cite Wikipedia here:

Benford's law, also called the First-Digit Law, is a phenomenological law about the frequency distribution of leading digits in many (but not all) real-life sets of numerical data. That law states that in many naturally occurring collections of numbers the small digits occur disproportionately often as leading significant digits. For example, in sets which obey the law the number 1 would appear as the most significant digit about 30% of the time, while larger digits would occur in that position less frequently: 9 would appear less than 5% of the time. If all digits were distributed uniformly, they would each occur about 11.1% of the time. Benford's law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution.

Mathematical probability formula

You can calculate the probability of a first digit with a very simple formula. This formula is easily translated to Javascript, because it merely uses a log10. Let's call probability p, and digit d then p(d) = log10(1 + 1 / d). This formula can also be used to calculate the probability of multiple starting digits, but that's not relevant in my implementation. For example, p(123) would calculate the probability of a number starting with 123 in a large set with great distribution.

In JavaScript

In JS you simply have to translate this formula into a function, and considering JS' Math library has log10 implemented, it will not be difficult. See the two functions in the JS code below, where p(d) is the probability formula returning a factor. pp(d) returns a percentage. The pen below the code displays the results of pp(d) for d from 1 through 9.

Check out this pen.

Real-data put to the test

All this information about Benford's Law is quite interesting, however how does it apply to information collected from real life situations? The numbers do have to distribute over several orders of magnitude, so you can't simply test a list of numbers within the range 1 to 10. I have tested some interesting information, the results are displayed in the pen at the bottom of this post.

If you want to skip my opinions / descriptions about the results, just go to the graphs!, it might be more interesting than my own weird interpretation of things!

Population data

Of all datasets analysed, this one had the smallest average difference from the Benford's Law probability. The data consists of the population counts in 2014. It is quite interesting to see that the law applies to some numbers at all!

File sizes of arbitrary directory

This dataset had the second smallest average difference from the Benford's Law probability. This data consists of the file sizes of a large arbitrary directory on my system (e.g. Pictures, Documents, Projects) in bytes. I was quite surprised that this set scored so well on the analysis.

The distance from earth to 119,613 stars in parsecs

This data was processed from a database and only the first digit of all the distances were stored, as we really only need those. If you need the actual data, check out the source. Once again, the distribution is amazing and so Benford's Law applies here very well.

PostModernJukebox's YouTube views

I wrote a little script to fetch the videos page from the YouTube channel of the PostModernJukebox project and process all the video views. The data is from the 29th of June, 2015. The data is less diverse and the numbers are a little bit less distributed. You can see this in the chart, as it does not completely follow Benford's Law, but it's still quite close.

Website's blog pageload count

A per-country pageload count on blog posts from an arbitrary website. There was a lot of diverse data in this one, and all the data was distributed over many orders of magnitude. I'm not sure why this set didn't quite follow Benford's Law, however I think it might have involved search engines loading pages quite often (this was included in the data). You could see this as 'fraud', and Benford's Law is occasionally used to detect fraud.

Prime numbers up to 1,234,567

I can be very brief on this, we do not understand the distribution of prime numbers quite yet. We're never sure where the next one will be on the number line, so I can't quite explain this chart. It still follows Benford's Law slightly, but it's really dependant on the upper limit of the prime sieve I used.

Disrupt code

All I can say is that I'm surprised the code follows it this much, that's all. Disrupt is a game I slightly participated in (Steam API and some rendering stuff) and is greenlit on Steam. Development is steady but slow, Northrock wants to deliver quality.

Image pixel counts

This didn't work out quite as well as I thought, perhaps because it's an area, the 2s tend to come up more. I can't quite explain. I do believe there is no distribution over several orders of magnitude. Pictures tend to follow a specific dimension, and my Pictures library contains plenty of my camera shots. The average difference is quite huge on this one.

CodePen views

Probably the same reason as the one above, not great enough distribution.

Random numbers

The random numbers generated by ISAAC, SmallPRNG and the browser's PRNG are used as a control group. When a random number is generated within a certain range (not the raw random data), it shouldn't follow Benford's law and the probability should be evenly spread throughout the digits.

Graphs


Check out this pen.


If this was interesting to you, hooray! If not, that's okay. Thanks for reading!


Related articles

Discussion