Data Mining Discrimination: A Hypothetical Case Study

Since I wrote my last post on algorithmic discrimination, I've been asked how these discriminatory patterns actually appear in a computer system. The processes that underlie discrimination and technological exclusion are varied, but in the realm of data mining, some clear pathways exist. I'd like to introduce a hypothetical case study to demonstrate how discriminatory and exclusionary policies can be embedded into a technological strategy.

To see how this happens, I'll take a cue from the work of Solon Barocas and Andrew Selbst, who, in “Big Data’s Disparate Impact,” outlined how data mining can lead to discrimination at three stages of the data mining process:

  1. Defining target variables
  2. Constructing training data
  3. Feature selection.

The Case

You're a marketing executive at Salads and More, a hypothetical salad shop that has three branches throughout Washington, DC. Revenues have been climbing in the past three months, but you've been noticing that you've been slower to acquire new customers lately.

Suddenly, you're hit with a stroke of genius. You'll send coupons to people throughout the city in packs of three. Each coupon will allow the recipient to buy one salad for 50 percent off. Based on an external data set, you've received demographic data of people across the city, and you want to send coupons to people who are most likely to spend money at the store over their lifetime.

How could we guess who will spend the most money in the long run if they start coming to your business? Easy! We already have data about our current customers, including their demographics and how much they've spent. Obviously, we won't know how much money our new customers will spend, but we could use Data Mining to predict the people who will be in the top 20 percent of spenders.

Our Approach: Target Variables

We want to find potential customers that would be likely to spend large amounts of money at our salad shop. To do this, we should look at our current data and see who is in the top 20% of spenders at the store. The system, then, should predict whether a potential customer would, in the long term, spend enough money to warrant receiving a coupon. To keep it simple, let's say that our data mining system will be given a customer's demographic data and predict/decide whether they will receive a coupon.

That would looks something like this:

Name Zip Code Gender Age Receive Coupon (Y/N)
Kevin Meurer 20007 Male 21 ?

After the data mining model processed it, it would decide whether I would receive a coupon.

Name Zip Code Gender Age Receive Coupon (Y/N)
Kevin Meurer 20007 Male 21 No

If I was predicted to be in the top 20% of spenders, I would receive one.

Discrimination and Exclusion Introduced at this Stage

Believe it or not, we've already introduced a form of exclusion simply based on what we're looking for. Ultimately, we'll give coupons to the people that are predicted to be in the top percentage of spenders. While this intuitively feels discriminatory, it clearly means the coupons will be given to people that have more money to spend at expensive salad shops such as Salads and More. We're already excluding poorer people at this point.

Our Approach: Training Data

As mentioned earlier, we have training data from the customers we already have. Certain customers use a smartphone app when they buy their salads, and from the app we can get data about the customers. These data include Name, age, gender, and birthday (so we know how old they are). Because they pay with the app, we can also figure out how much they've spent. That would look something like this:

Name Zip Code Gender Age Money Spent ($)
Customer 1 20007 Female 28 960.56
Customer 2 20007 Female 19 400.32
Customer 3 20057 Male 24 23.75
Customer 4 20024 Female 25 91.24
Customer 5 20047 Male 37 30.45
Customer 6 20036 Male 57 692.46
Customer 7 20007 Female 24 884.87
Customer 8 20003 Male 63 596.24
Customer 9 20024 Male 27 342.69
Customer 10 20036 Male 54 584.62

Because this is training data, we have to label it in order for it to be effective. Remember, we're trying to train our model to predict whether or not a potential employee will be in the top 20% of spenders. As a result, the previous data will be coded as follows:

Name Zip Code Gender Age Money Spent ($) Receives Coupon (Y/N)
Customer 1 20007 Female 28 960.56 Y
Customer 2 20007 Female 19 400.32 N
Customer 3 20057 Male 24 23.75 N
Customer 4 20024 Female 25 91.24 N
Customer 5 20047 Male 37 30.45 N
Customer 6 20036 Male 57 692.46 N
Customer 7 20007 Female 24 884.87 Y
Customer 8 20003 Male 63 596.24 N
Customer 9 20024 Male 27 342.69 N
Customer 10 20036 Male 54 584.62 N

As you can see, we've already decided that only Customer 1 and Customer 7 would be eligible, so their characteristics will help define who gets a coupon.

Discrimination and Exclusion Introduced at this Stage

Our decision to tag training data based on the top 20% of spenders at our store rather than some other feature (such as number of years as a customer / customer loyalty) has introduced discrimination, as mentioned above. Because it is based on total spending, it also is disadvantageous to the business, which might actually benefit from a different criteria.

Even more so, our training data is derived from people that actually use the iPhone app. As Jonas Lerman described in "Big Data and its Exclusions," many iPhone-based approaches can lead to the exclusion of poorer people.

Our Approach: Feature Selection

The final consideration in the design of our data mining algorithm is feature selection. This is one of the most essential and potentially most discriminatory parts of the data mining process. Assuming we are looking at demographic data with all sorts of potential features, we want to predict which of these features will be the most predictive in this case.

These features will serve as the basis for training our data mining models, which will be used to predict whether a customer will receive a coupon (based on whether they are likely to spend large amounts of money). In this case and for simplicity, let's say that we have a fairly limited data set, so we select basic features such as age, gender, and zip code. Note, here, that these algorithms could take a variety of different features, many of which will not necessarily be predictive.

These features are fed to our algorithm, and it will then use them to predict spending levels for a user who is not yet a customer. To get a sense for how these features will affect the actual output, let's take a look at our existing customers, our "training data" who would receive a coupon (because they are in the top 20% of spenders).

Zip Code Gender Age Money Spent ($) Receives Coupon (Y/N)
20007 Female 28 960.56 Y
20007 Female 24 884.87 Y

We can already see some similarities here. Both customers are female and in their 20s despite the fact that the majority of our customers are men. They also both happen to be from the 20007 zip code, which is right in the middle of Georgetown (a wealthy neighborhood). As you can see, there's already a problem with this feature.

Let's take a look at the rest of our customers now.

Name Zip Code Gender Age Money Spent ($) Receives Coupon (Y/N)
Customer 2 20007 Female 19 400.32 N
Customer 3 20057 Male 24 23.75 N
Customer 4 20024 Female 25 91.24 N
Customer 5 20047 Male 37 30.45 N
Customer 6 20036 Male 57 692.46 N
Customer 8 20003 Male 63 596.24 N
Customer 9 20024 Male 27 342.69 N
Customer 10 20036 Male 54 584.62 N

Now, most of our customers are male in this category, in spite of having spent, on average, a large amount of money at the store. Their zip codes are naturally more varied.

Discrimination and Exclusion Introduced at this Stage

Now, we have included even more discriminatory practices into our model. The result of selecting gender as a feature has made the coupons far more likely to go to women than men. Similarly, we have now isolated a single zip code as the originator of high spending. Given that neighborhood is not only a predictor of income but also of race, the results of this will almost definitely be discriminatory.

The Result

As you have likely already guessed, the result of this simple example will be a highly discriminatory system. The coupons, which offer significant value, will go predominately to a single demographic type, women in their twenties that live in the Georgetown neighborhood. Notably excluded are poorer citizens, men (who actually comprise the majority of customers in this case), and anyone not in their twenties.

To dig a bit deeper, let's look at the core issues at play here.

Responses: Exclusion, Due Process, and Disparate Impact

In the case study outlined above, the coupon approach selects people based on demographic characteristics for a distinct benefit. It manages to exclude people based on all those characteristics.

As Jonas Lerman explains in his concept of "data insubordination," institutions should work to minimize their impact on exclusion and discrimination, and legislation on this subject would have to extend to the private sector to be truly effective. However, privacy legislation isn't enough to ensure that nobody gets excluded.

Solon Barocas and Andrew Selbst note similar legislative challenges in "Big Data's Disparate Impact." A close analysis of Title VII revealed that, for most big data analysis engines, employers are not technically liable for discriminatory systems. There are also challenges related to the identification of features that will lead to discriminatory outcomes. At what point is zip code a discriminatory feature? They describe these decisions as a determination of whether a value is "sensitive" or objective.

A final difficulty arises with due process. We, as humans, have a tendency to accept the output of a computer system through what Danielle Keats Citron calls "automation bias." As she outlines in “Technological Due Process,” automated systems can often lead to flawed decision-making, especially when they are set up so a human does not check the results. In the case above, would we, as the marketing director, realize that the majority of the coupon addressees fit certain characteristics?

Discrimination or Good Business?

Let's imagine that the model we outlined in this case was the most effective possible. Consistently, customers that received the coupons became long-term revenue generators for the business. Would we think of this as discrimination or is that just good business?

Perhaps, sending coupons to different groups would capture a new customer base. It is difficult to tell. However, these considerations warrant further thought as data mining becomes more accessible to small and large business owners. I think this concept was captured best by Solon Barocas and Andrew Selbst:

The major justification for reliance on formal disparate treatment is that prejudice is simply irrational and thus unfair. But if an employer knows that his model has a disparate impact, but it is also his most predictive, the argument that the discrimination is irrational loses any force. Thus, data mining may require us to reevaluate why and whether we care about not discriminating.

The difficulties posed by discrimination of this sort represent serious challenges for our society as a whole. We need to decide at what point the exclusion and discrimination inherent to big data outweigh business utility.

comments powered by Disqus