In the past two weeks, I've been completing a data mining project in Python. In the project, I implemented Naive Bayes in addition to a number of preprocessing algorithms. As this has been my first deep dive into data mining, I have found many of the math equations difficult to intuitively understand, so here's a simple guide to one of my favorite parts of the project, entropy based discretization.
What's the issue here?
Data is messy. But we want to know things about our data. It's tough.
We have this problem of trying to clean it up and put it into usable formats for our fancy data mining algorithms. One of the issues is that many algorithms, like Decision Trees, only accept categorical variables, so if you have some age attribute or another continuous variable, the algorithm cannot make sense of it.
In other words, we need to take this continuous data and "bin" it into categories. Sweet. So we can just randomly choose where to cut our data, right? Nope. Well, you could, but it's a bad idea. Here's why.
Let's say you split at the wrong point and have uneven data. For example, you're trying to determine something like risk of alzheimers disease and you split age data at age 16, age 24, and age 30. Your bins look something like this:
 <=16
 16...24
 24...30
 30+
Now you have a giant bin of people older than 30, where most Alzheimers patients are, and multiple bins split at lower values, where you're not really getting much information.
Because of this issue, we want to make meaningful splits in our continuous variables.
That's where entropy based discretization comes in. It helps us split our data at points where we will gain the most insights once we give it all to our data mining algorithms.
What is this entropy thing?
Entropy. It's also called Expected Information. That's what we call this value which essentially describes how consistently a potential split will match up with a classifier.
For example, let's say we're looking at everyone below the age of 25. Out of that group, how many people can we expect to have an income above 50K or below 50K? Lower entropy is better, and a 0 value for entropy is the best. For example, of 10 people below 25, let's say that we have 6 that make above 50K each year and 4 that make below 50K each year. It looks something like this:
Income<=50K  Income > 50K  

Age <25  4  6 
This would have a high entropy value(closer to 1).
If we're trying to determine income based on age, we ask ourselves:
Q: Would knowing someone is below 25 give us good information on their income?
A: Probably not. Our training data indicates that 6/10 people below 25 make more than 50K, and the rest make below. This tells us nothing.
Now let's change the game a bit. Let's assume that, of those people below 25, 9 make below 50K and the last one makes above 50K. Now, our simple histogram looks like this:
Income<=50K  Income > 50K  

Age <25  9  1 
This would give us a lower entropy value because it gives us more information on what we're trying to discover(income).
Now that we understand the basic idea of lower entropy good, higher entropy bad, let's throw in a little bit of math. Here is the equation for entropy:
So what does this mean? I'll break it down step by step. So we have a summation that goes from 1 > m, where m is the number of classifier values we have. In this case, m is 2 because we have 2 options: income <=50K and income >50K.
Next, we have the p value. This is the probability of getting a specific classifier given the bin you're looking at. So if you have 10 rows of data, and 5 of them fall under income <=50, the probability of that would be 5/10 or 1/2. You can just count.
So that's all we need for this part. Let's take another look at our prior examples and calculate the entropy for each.
In our first example, we had 4 samples below 50 and 6 above. Let's calculate entropy:
Income<=50K  Income > 50K  

Age <25  4  6 
Now, let's see the second example:
Income<=50K  Income > 50K  

Age <25  9  1 
Wow! We went from .971 to .469. HUGE improvement. Even better, if we had 10 in one category and 0 in the other, we have perfect entropy0.
Now, let's talk about how entropy fits into the broader scheme of entropybased discretization.
Entropy Gain: The Foundation of Discretization
At a broad level, entropybased discretization performs the following algorithm:
 Calculate Entropy for your data.
 For each potential split in your data...
 Calculate Entropy in each potential bin
 Find the net entropy for your split
 Calculate entropy gain
 Select the split with the highest entropy gain
 Recursively (or iteratively in some cases) perform the partition on each split until a termination criteria is met
 Terminate once you reach a specified number of bins
 Terminate once entropy gain falls below a certain threshold.
If that sounds overwhelming, don't worry! we're gonna walk through it all now.
One thing you may notice is that there is a new term here called "entropy gain." In case you didn't already guess, that basically refer to how much entropy you gain by splitting a data set into two bins.
As we mentioned, lower entropy is better, so Entropy gain is calculated as follows:
If this is confusing you, just think of it this way. We want to perform splits that improve the information we get from our data. Accordingly, we want to perform splits that maximize the improvement to the information we get from our data.
Entropy gain measures that.
So, now you know that we need to find and maximize entropy gain to perform our splits, but how do we find the net entropy between two different data bins and compare it with only one value?(the value is the new E up above) Example time.
Entropy across multiple bins
Let's go back to our example where age predicts income. We have continuous age values and we want to split them. Let's assume again that we have created two bins, split at 25. Let's create a histogram for this as follows:
Income<=50K  Income > 50K  

Age <25  9  1 
Age >=25  4  6 
I used the following numbers because we've already calculated entropy for both data sets!
In this case, our information(using the calculations above) for age <25 will be .469 and our information for age >=25 will be .971. To combine these two values into one net entropy, we simply take the proportion of each in our overall age bin.
Overall, we have 20 samples. Conveniently, 10 fall in the age <25 category and the other 10 fall in the age >25 category. This means that, as a proportion of total entropy, age <25 counts for 10/20 and age >25 accounts for 10/20. This would apply if your bins were unequal sizes as well. In general, we can state the net entropy with the following equation:
In other words, our information across two bins will be equal to the proportion of the bin's size multiplied by that bin's entropy.
For our example, that looks like this:
That will give us a single value that we can compare to our initial value. Now that we've established entropy and entropy gain, let's bring it together with an example.
Example
To bring this together, let's look at an example of this in practice. For the sake of proving that this doesn't need to just predict income based on age, let's do a new example. This time we'll do hours studied as our continuous variable and A on the test as our classifier. The data looks like this(presorted for your convenience):
Hours Studied  A on Test 

4  N 
5  Y 
8  N 
12  Y 
15  Y 
In this case, we have data relating the number of hours various students studied in an attempt to determine its effect on their test performance. We want to discretize this data, so let's start by calculating entropy of the data set itself:
A on Test  Lower than A  

Overall  3  2 
For all of our samples, the entropy to beat is .971! Now, let's iterate through and see which splits give us the maximum entropy gain. To find a split, we average two neighboring values in the list.
Split 1: 4.5
Starting off, we split at 4.5 ((5+4)/2). Now we get two bins, as follows:
A on Test  Lower than A  

<=4.5  0  1 
>4.5  3  1 
Now, we calculate entropy for each bin and find the information gain of this split:
Now net entropy is:
And our gain is:
Looking good! Our maximum entropy gain is now .322. That's good, but we still need to go through the rest.
Split 2: 6.5
Average our next two values, and we get 6.5. Now we repeat the process for this split:
A on Test  Lower than A  

<=6.5  1  1 
>6.5  2  1 
Again, calculate the entropy for each bin:
Wow! This one is looking bad already, but let's finish it.
This is less gain than we had before, so our best split point is still at 4.5. Let's check the next one now.
Split 3: 10
Average our next two values, and we get 10. Our split now looks like this:
A on Test  Lower than A  

<=10  1  2 
>10  2  0 
Finally, we calculate gain once more.
This is the clear winner at this point!
Split 4: 13.5
And for our final potential split, we split at 13.5. Now our splits look like this! For the sake of brevity, I won't calculate this one, but we can infer based on the data below that it would have poor entropy gain because the first bin has a 50/50 distribution.
A on Test  Lower than A  

<=13.5  2  2 
>13.5  1  0 
Choose the split
After calculating everything, we find that our best split is split 3, which gives us the best information gain of .421. We will partition the data there!
According to the algorithm, we now can further bin our attributes in the bins we just created. This process will continue until we satisfy a termination criteria.
When to Terminate
There are two popular options for stopping the algorithm:

Terminate when a specified number of bins has been reached. This makes sense when you value the degree to which you understand your data. A data set with 3 bins is much easier to think about than one with 12.

Terminate when information gain falls below a certain threshold. This is another common method. If information gain drops below a certain value, the algorithm can terminate early. There are a number of ways to calculate what the minimum threshold should be, and it can also be determined empirically through testing.
Oftentimes, both of these criteria will be used in conjunction to yield optimal results.
Summary
Entropy is a fundamental concept in Data Mining that is used far beyond simple discretization of data. These approaches are also used for decision trees and rulebased classifiers, so understanding it is definitely a useful tool to have in your toolbelt.
That's all for now!