Thursday, April 10, 2014

Disparate impact and data science

In several recent blog posts (for example here), I have discussed the doctrine of disparate impact. Wikipedia defines disparate impact as follows:

    In United States employment law, the doctrine of disparate impact holds that employment practices may be considered discriminatory and illegal if they have a disproportionate "adverse impact" on members of a minority group. Under the doctrine, a violation of Title VII of the 1964 Civil Rights Act may be proven by showing that an employment practice or policy has a disproportionately adverse effect on members of the protected class as compared with non-members of the protected class.

Of course, the doctrine of disparate impact is often applied in other areas besides employment practices. For example, as reported in Bank News, disparate impact doctrine is often applied to lending practices:

    Under the act, it is unlawful for a creditor to discriminate against any protected class on the basis of race, color, religion, national origin, sex or marital status, age or source of income. The disparate impact theory enables enforcement agencies to prove lender discrimination via a regression analysis of statistical variations in loan terms between borrowers as evidence that a lender illegally facially discriminated against a protected class, even without a showing of discriminatory underwriting criteria.

Disparate impact analysis is on a collision course with data science and big data processing.

Suppose, for example, that you have an algorithm that mines data in an attempt to determine the set of features that are best able to predict whether a prospective borrower will default on a loan or commit some kind of fraud. (Such algorithms are a well-known part of machine learning and statistics and are referred to under the general heading of feature selection.) Suppose further that the algorithm, presumably operating quite impartially and without human intervention, discovers that the features that are most predictive of default are gender, race, and marital status. That is, the algorithm may find that if you are a single black mother then the probability of you defaulting or committing fraud is predicted to be high, whereas if you are a married white male, the predicted probability is small. What, then, are you supposed to do with the model?

I have actually heard professional data scientists say that they would suppress consideration of these features in the predictive model. In other words, data scientists are actually being forced to pervert software algorithms so that they produce corrupt results, simply because they are afraid that the uncorrupted results will be politically unacceptable and subject them to attacks from the race and gender Stasi. There are even some misguided data scientists who willingly embrace the corruption of their science as the price that has to be paid in exchange for the advancement of certain "protected groups."

This is the corrupt state of affairs that we have arrived at because of policies like disparate impact, which promote race- and gender-based prejudice over scientific understanding. Imagine what would happen in a field like Physics if scientists distorted their results to achieve such political ends.

1 comment: