The Emerging Theory of Algorithmic Fairness

>>Hey, welcome everyone. I’m delighted this
afternoon to introduce our AI distinguished
lecturer Cynthia Dwork, who many of you know. Cynthia is one of the most
accomplished technical women who I’ve ever been around. She has four fancy titles. Right now, she’s
the Gordon MacKay Professor at Harvard in the Computer
Science department. She’s also the Radcliffe
Alumna Professor at the Radcliffe Institute
for Advanced Studies, and she’s an Affiliate Faculty
at the Harvard Law School. In addition to that, she’s a distinguished Scientist
at Microsoft Research. She’s was one of
the inaugural class of distinguished scientists
here about a decade or so ago, maybe
a little bit more. Cynthia received her PhD at
Cornell with John Hopcroft. From there she did a couple of year Postdoc with
Nancy Lynch at MIT, and then went on to
spend a fair amount of time at IBM Almaden, where she did a lot of foundational work that
I was not all that familiar with till earlier today. The more I knew about
that she did there was something on
rank aggregation, which was very important
in web search. She also did work on the
cryptography foundations, as well as proof of work, which is one of the underlying
foundations of Bitcoin. At MSR, she is well known for
two other strands of work. One is on differential privacy, putting privacy on
theoretical firm ground, and it’s so incredibly
important today. What she’ll talk about
with us this afternoon is more nascent work on
algorithmic fairness. This work, and her lifelong
accomplishments have been acknowledged by more prizes, and I name, more prizes, and the fancy titles I
mentioned at the outset. She’s obviously
a fellow of the ACM. She won the Dijkstra Prize. More than a decade ago, she won the Go del prize. She’s one of the very few
people who are members of both the National
Academy of Science, and National Academy
of Engineering. She’s an amazingly
a depth scientist, and one who picks problems that have real practical impact, which I think is
a really unique skill. So, with that introduction, I’d like to introduce Cynthia, who will tell us about some of her work on algorithmic fairness.>>Thank you very much for
that gracious introduction. Thank you for inviting me, and I’m really very
happy to be here. I have a couple of questions before I proceed with the talk. So, first of all,
how many people here are working in Machine
Learning and AI? Okay. How many people here have worked on
fairness particularly? All right, and any theoretical
computer scientists? Great. All right, good. That helps me figure out
what I should be saying. This is a research effort that actually began at Microsoft, now defunct Silicon Valley Lab. Started in approximately 2010. I should mention that all
of the good slides in the talk are due to my student
Cristina Elemental. Here’s our problem. We’re interested in
algorithmic fairness, will be more specific in
a minute. We have a population. Our population is diverse
in many different ways, ethnic diversity,
religious diversity, geographic, medical, and so on. Well, whatever it
is we want to do, we want to be sure that somehow or other
we’re being fair. Typical examples would
be something like this. We have a bank that receives detailed user
information about you, before giving you
your online experience. You surf on over to some bank, and before they
show you anything, they consult with
tracking networks, and find out stuff about you. The concern, and this was
a concern that was raised in a newspaper article in
the Wall Street Journal, in a series that they ran in 2010 called, “What they know.” It was mostly about privacy, but it also raised
the following fairness concern. The concern was that of
the illegal practice of steering minorities to credit card offerings of
less desirable terms. The financial industry
is more heavily regulated than
many other industries, and so this problem
really was a legal one, not just “a fairness one.” Just to dispense with
some obvious things, one suggestion is
of course to hide the sensitive information
from the classifier. Whether you’re
a member of a minority group or not would be hidden. That falls victim to
a couple of things. First of all, there’s red lining. Who knows the term red lining? Great. We’ll see a picture
of it in a minute, and on the web
there’s web lining. This is basically say
when zip code is used as a redundant encoding of race. It’s illegal to
discriminate based on race, it’s not illegal to
discriminate based on zip code. We have this issue now in
general of redundant encoding. Sensitive attributes
can be embedded holographically in
your other attributes. The other is a Birds of
a Feather phenomenon. Has anybody seen
this picture before? This was an undergraduate
research project at a class at MIT. This is not a
full-fledged studying, but essentially what the students found was the following. If you’re male,
and something like five percent of your friends on Facebook are
self-described as gay. Then you’re very
likely to be gay. You may not choose to release this information
about yourself, but from this Birds of a Feather flock
together phenomenon, it can be determined from
the attributes of your friends. Hiding the sensitive attribute
is a little dicing. Now, there’s another problem with hiding the
sensitive information. That is that in some sense culturally aware
algorithms can be more accurate than algorithms
that survive. If you have a culturally
aware algorithm, it can make very good use of this sensitive information
to be more accurate. In a typical example,
and by the way, let me introduce some populations that I’m going to be
using in the talk. We all like food or
most of us like food. Some people like to have their food flavored
with the herb sage, and the rest prefer to have their food flavored
with the herb thyme. So, my groups are going to be the thyme eaters and
the sage eaters. Generally, sage will
be the minority group, but not everything
will require that. I’m going to be
calling them S and T. Maybe you have a situation where hearing voices is a very common religious
experience among the sage eaters, but among the thyme eaters it’s a diagnostic criterion
for schizophrenia. This is a real medical example. So, knowing whether or not
one is a sage eater or a thyme eater permits better classification, more
accurate classification.>>Now, as you all know, in machine learning we
train on historical data. Historical data is often biased, and so we can’t just say train
on the data because we’ll imbibe the biases in the
mother’s milk of training data. This brings up
a really important problem, which is that in general, there is no general source of ground truth for the things
that we’re trying to do. So, we have to figure out how to extract fairness
when we possibly can, keeping this in mind. We can’t just go to
the training data. Now, who knows what
this is a picture of? Okay. All right. This is a picture from
the Sistine Chapel, it’s Michelangelo’s depiction
of the last judgment. So, we have Jesus in the middle, and on his left are the people who are being condemned to hell, and on the right are people who are going
to go to heaven. So, this is a binary
classification situation. You can have a ternary
classification situation. So, here, we have some slides
from brain tissue. Over on the left, this is cancer, on the right, is healthy brain tissue, and in the middle we have
cancerous cells on the left. These cells, they’re not cancerous but they’re
not normal either, they’re affected by the tumor. So, we might want a ternary, a more complicated classification system for something like this. Here’s another
classification problem. I’m calling this a dependent
classification problem. So, let’s say that you’re
trying to hire people. You’re job is not just to distinguish the qualified
from the unqualified, or even gradations among that, you have a more
complicated problem. Maybe there are hundreds of qualified people and
well-qualified people, but you only want to
interview 10 of them because, after all, you have
finite resources. So, you want to fairly select a cohort of
people to interview. So, that’s a classification
problem where you can’t just classify each person individually as to whether they’re
good enough to interview, you have to somehow constrain it. Another kind of classification
problem is scoring. This is something for
evaluating arrested children to determine whether or
not they should be eligible for
detention or release, whether they should be
detained or released. So, it’s a scoring system. You can see sorts of things like, there are points for
the various kinds of offenses that they committed
this particular time. Then, there are
other sorts of points for their prior offense history. There are negative points that come from
mitigating factors. These scores are then converted
into a binary decision, whether to release or to detain. Depending on the definition
that you’re using, fairness in scoring
will or will not result in fairness in the decision. I still haven’t defined anything. Now, we want to have a mathematically
rigorous approach to algorithmic fairness. Historically, there’s
a way of trying to take these big goals and translate them into
a research program that should lead to
some practicable outcome. So, the first item in the
basic paradigm is definitions. What are you trying to do? In our case, this is going to be, what do we mean by fair. So, one way of doing this is that you can define
fairness by thinking about what it is you’re
trying to prevent. So, in a good
cryptographic tradition, we imagine an adversary, and the adversary
is trying to force unfairness and you want
to defend against this. So, what are the various
behaviors for goals that the adversary may have that
you want to defend against? So, you define fairness. Then, what you want to do, start with a definition, then you want to create algorithms, and you want that
your algorithms will enforce whatever this thing
is that you’ve defined. So, once we have
the definition of fairness, then we want to build a
classification algorithm that we’ll be fair according
to that definition. Finally, we need a
composition theorem. So, think about the systems that you might use
that classify people. Let’s say, you’re
determining whether they’re appropriate for
this ad or that ad. You don’t just
classify them once, you classify them over, and over, and over again for all sorts
of different kinds of ads, and many different times of day, and all sorts of
things like that. Similarly, if you’re doing a privacy preserving
data analysis, nobody just wants one privacy
preserving statistic, they want a whole
ton of statistics. Cryptography, you don’t just see one message or one does
digital signature, you’ve got digital signatures on tons of different messages, and encryptions of tons
of different messages. So, you need to prove some sort
of theorem that talks about how these things
combine in practice, and those are called
composition theorems. So, here, what we would want is we would want
to say that if you have a system based on you
know made a fair pieces, then the system somehow
is fair in total. So, this is a general paradigm
that has been successful, particularly in cryptography
and in privacy. So, can we apply
this to fairness? The short answer is, “I think we probably can
but it’s way harder in my experience than in
the other scenarios.” Defining fairness is very, very difficult and very,
very controversial. I don’t believe it’s
going to be possible to come up with a single definition that the technical community can come up with
a single definition that the population as a whole
would say, “Yes that’s it.” I don’t think that’s
going to happen.>>Building algorithms
that’s our bread and butter once we have definitions. There’s been a lot
of work already on building algorithms, some of it very recent
and very exciting. So I’ll talk
a little bit about that, and composition turns
out to be very tricky and so we’ll see a little bit
about that as well. So let’s talk about
defining fairness. I said that we think about what it is
an adversary is trying to do. So think about for example,
digital signature schemes. What do you want from
a digital signature scheme? You’re all going to say
the same thing right away. Nobody should be able to forge, then you have to get into
some details about all of the right quantifiers
about what does it mean to say nobody
should be able to forge, but forgeries are what
you’re trying to prevent. So, what are we trying
to prevent here? So one of our starting scenarios was actually in the context
of advertising. So, advertisers might be biased and we want to
build a platform that will mitigate the biases of
the advertisers and be fair. So, we literally sat
around and said, “Okay, what are some of
the kinds of things, a really nasty ill-intentioned
advertiser might do?” So red lining we talked
about and that’s real. Red lining in loans
was absolutely real. Housing loans for example
in the United States. There’s a fabulous book
you should read by Richard Rothstein
called, The Color of Law, which talks about the effect of the US government
in these biases. There’s reverse tokenism. So let’s say for example, you turn me down
for a loan I say, “Oh you turned me down
because I have curly hair,” and they point
to Sergei and they say, “Obviously, he’s way more
qualified than you are. He has straight hair and
we turned him down too.” The thing is they’re using him as the example for all of
the curly haired people. So that’s a reverse tokenism. Then there’s what
I’m going to call deliberately targeting
the wrong subset of the sage eaters and
I’m going to actually postpone that example
for a slid or two. So we literally just
sat down and wrote a whole bunch of negative things that we wanted to try to prevent. Is this a reasonable approach? Well, I don’t know another one and one of
the advantages with this sort of approach is
let’s say that you build a system that protects against all these things
that we’ve listed. That’s nice. Then somebody
comes along and says, “You missed one,”
you say, “Great.” Now you redesign your system to protect against
the new one as well. In the meantime,
you’ve been protecting against these first
behaviors anyway, so something happened that
was better than nothing. When most people think about fairness the way
I’ve introduced it, they’ll immediately think
in terms of groups and in fact the way I introduced
it with suggestive of that. So what are group fair? Group fairness properties
are basically statistical requirements about the treatment
of different groups. So statistical parity,
which is also known as demographic
parenting, says that, let’s say if we’re just
doing binary classification positive and negative, it says that the demographics
of the people with positive classification
should be the same as the demographics of
the general population. So the same proportions are being positively
classified for sage eaters and time eaters as
in the population as a whole and also the same for the negative classifications. Now, this is nice. When is it really meaningful? As we’ll see, it’s mostly
meaningful in the breach. If you see some
incredible departure from statistical parity, that’s a red flag. Doesn’t mean that things have
been unfair necessarily, but it certainly is a nice place
to go looking to see, why did this happen? So, it can be meaningful in the breach but it
permits what we call, as I said targeting
the wrong subset of S. So I have a spa, it’s a luxurious spa and I really don’t want
those sage eaters at my spa, but I have to
advertise to them in the same proportion as their proportion in
the population as a whole. So what do I do?
How do I be biased? I advertise to the members of S who can’t afford
to come to my spa. So I’ve deliberately targeting the ones who can’t come and therefore I’m ensuring that my spa isn’t gaining
new sage eaters. But among the time eaters, I advertise to the wealthy
and they can come. So demographic
parody, oops it’s not enough and you have to look
at it much more carefully. There’s also this question
of the intersectional cases, which leads to what’s colorfully called
fairness gerrymandering. So, we have our population. I’ve divide it into the
sage eaters and the time eaters, but they’re also the tea drinkers and the coffee drinkers. What we maybe really want is that if we look at the four different
intersection categories, we should have
proportional representation for each of these categories. It would be very unfair
if we only showed the ad to sage eating
coffee drinkers. So you can fill in
demographic groups for where you really clearly want
this intersectional parody. Statistical parody by itself does not protect against this kind
of fairness gerrymandering. So statistical parody was one definition of
fairness for groups, there are other
definitions of fairness for groups and then we have to face the fact that some group fairness properties
are mutually incompatible. So how many of you
know this story? So many of you do but
most of you don’t. So, I have a classifier. So let’s say that we are classifying people
and we are predicting. So we’re classifying as to whether we think you’re
going to do something, we think you’re
going to recidivate versus we don’t think
you’re going to recidivate. So importantly here, there’s information that’s
going to be missing, like the future hasn’t happened yet and whether
someone released from jail recidivates may
depend on many things that happened to them during
their experiences. So one thing you might want reasonably is
that you would have an equal false positive rates across your different
demographic groups. Another might be
that you would want an equal false negative rates across the different
demographic groups. You might also want equal positive predictive
value which is among those who are
predicted to recidivate, how many of them actually do? That’s the positive
predictive value. I chose these three,
these aren’t my only three but these
are very famous, because it’s known that no imperfect classifier
can simultaneously ensure equal false positive
rates across groups and equal false negative rates and equal positive
predictive value, unless the base rate is
the same across groups. So unless the base rate of recidivating is the
same across groups. The reason for this, as Chouldechova points
out in her paper, is that these quantities are
related by this equation. So we could write down for the sage eaters
the false positive rate, the false negative rate and the positive predictive value
and they would satisfy this equation where P would be the base rate among
the sage eaters. We would then write down the same equation
for the time eaters. Now, requiring
equal false positive rates across the two groups means that this term would have to be the
same in the two equations. So then these blue guys have to be the same and that blue term would have to be the same and the only way
this could happen is if the base rate P is the
same in the two groups. When the future is unknown, so therefore classification
would be imperfect, no algorithm can
simultaneously ensure these three very natural to desiderata unless
the base rate is the same. Nobody cares how
that algorithm is implemented. So if you believe in
free choice or randomness, you get the same thing
whether the algorithm is a machine or whether
it’s some other form.>>So, it does not make sense to say let’s solve this problem by bringing a human
into the loop. Does nobody could bring into the loop that would
address this problem. It’s just the math. So as you can probably tell, I find a group fairness notions dissatisfying for
a number of reasons. I’m sharing some of
these reasons with you. Now this brings us to
a very famous video. How many people have seen this? Not all, great.
I mean not even half. So, this is an experiment with Capuchin Monkeys,
they’re paired up. The monkey has to
do a task which is to hand you a rock and then
the monkey gets a reward. So, let’s see, sorry.>>So, she gives a rock
to us that’s the task. We give our piece of
Cucumber and she eats it. The other one needs
to give a rock to us, and that’s what she does. She gets a Grape. If the other one sees that, she gives a rock to us
now gets again Cucumber.>>Okay. So that brings us to the notion that I prefer
which is Individual Fairness. Individual Fairness
says intuitively that, “For a given classification task, if two people are
similar with respect to that classification task then they should be
treated similarly.” Now to make any sense
out of this, we need the right
notion of what it means to be similar
or dissimilar. We need some kind of
a distance notion. So, a classic example is in credit scoring where
the difference in credit scoring literally gives you
a metric for how dissimilar a pair of
people are considered to be for credit worthiness. Okay. Of course, again, I have to emphasize this
is a Task-Specific Notion. So, Sergei and I might be in fact comparably qualified for a loan but for recommending hair products who would
be completely different. So we’re similar
for one purpose but totally different
for another purpose. So, it’s very important
that when we talk about these metrics they’re
Task-Specific. Now, thinking about
credit scores, maybe people in
general are sort of smushed uniformly from highly qualified to very unqualified. So, when you’re trying to decide whether to
give a loan based on whether somebody
crosses a threshold, you may have two people
who are very close and you give one of them the loan and you deny it to the other one, right? If the scores are in fact sort of densely packed
along this line, then indeed that has to happen. So, you sort of have to
draw the line somewhere, that’s the general worry. Then you would have
very similar people who are being treated
very dissimilarly. So, what do the theoreticians
say that you should do in this case? You randomize. So, instead of deciding yes definitely for you and
definitely not for you, you assign to people
a probability of getting the loan or
a probability of getting an A depending on
their creditworthiness. You arrange that if
two people are very similar, then the probability
distribution on outcomes that they see is
very similar. Yes?>>Two policemen were just
fired for tossing a coin to decide whether to arrest
someone for speeding ticket. So, they may have
heard your lecture.>>No. I say is there are circumstances under which it might be
the right thing to do. Police have huge amounts
of discretion, and if they treat similar people similarly
maybe that’s fine.>>But maybe instead
of randomizing, one could just scale the loans. So instead of having
a fixed amount that you then, you could change the amount
of loan until reaches zero.>>You could certainly
do something like that. I would have to think through
all of the ramifications. But, there are cases where you must make
a binary decision. So, we’re going to take this as our
definition of Fairness. So, we have a set of outcomes. In this case, it’s
yes or no that’s O. This is the probability
distributions on outcomes. So, we want to find
a classifier that maps individuals in our universe to probability distributions on outcomes with
the following condition. For some appropriate measure of the difference of
two probability distributions, for example total
variation distance, we want that the difference in these distributions will be bounded by the distance
of these two individuals. So, it’s a Lipschitz constraint, the distance under the appropriate
tasks specific metric. Now, of course this immediately says where does
the metric come from? How are you going to get
your hands on the metric. I don’t have
a great answer to that, but I also believe that if we actually had a classification system that
people thought was fair, then we could extract
from it a metric.>>So, how do we ensure
that the metric is fair?>>That’s exactly what
I just said, we don’t know where to get
our hands on the metrics. So, when we did
this work initially, we said that at the very
best it was going to be societies best guess at the moment in clearly
what we would need. I’m now understanding
more and more that these things are
going to be handled by political processes, that there’s no other way
of getting these things other than combining stakeholders and measurements
to get somewhere. But we don’t have
a straight mathematical answer for where we’re going
to get the metric. Yes?>>Is the sort of tension between>>Who watches their
watchmen as an argument or the best person to decide
if I’m going to offend you, is you not me. It doesn’t mean the rule
of fairness is it to be a case that in society
will dictate that, we’ll go back into
the group fairness. So, for individual fairness, the toughest nut to crack
is that the source of the metric is the subject to which the decision
will be applied. As in the case of the monkey, the monkey was the one
decided, “This is not fair.” So, any hands as to how to integral and other societal and
individual fairness or separate them properly?>>So, I don’t have a
mathematical answer to that yet. What I’m starting to engage
in very heavily now is, what is the right way of
thinking about it politically. I don’t mean
political correctness, I mean political
theory and ethics. But, it’s an unknown, and work on it guys. It really makes sense, it’s an important problem. Maybe ask me more
about it at the end.>>Are we right people
to work on it?>>So, not by yourselves, unless you have
a whole very broad education. Inevitably, there’s going to be legislation and regulation. We have to be sure that
what gets legislated and regulated is not
technically stupid, that it’s not infeasible. People will say, I think there’s a real risk of very bad law
and regulation. So, I think that
the technologists have an important role to play in helping to
steer away from that. Also, I think that
technologists sort of hearing more about
the concerns and the issues, they have some of
the tools needed to try to move those and act
on those concerns. So, we are sort of, but not by ourselves. Yes?>>How do we know that
the metric is even computable? For example, if a brain is
a universal Turing machine, then they can exist. The computer will function
but not serving in extreme machine
[inaudible] machine to, a computer will number,
so it’s just a metric.>>So, I’m hearing
about two-thirds of your words, but you seem to be saying, how do we know that it’s
computable in the sense of how do we know that
an information is available? How do we get our hands
on the information, which would give us the metric?>>If you suppose that our brain is
a universal Turing machine, and it got eschewed
any art training machine, the Nell computer will
also can map from a universal Turing machine to a metric, computable metric.>>So, I’m not sure what decision problem you’re
trying to work in here like whether the brain stops on itself when
it gets itself, I don’t know exactly. Maybe asked me after, I’m sorry. I’m going to go on. Let’s talk a little bit about algorithms. So, first, as I said, these are challenges
that we are familiar. So, in the ad setting, we imagined the situation where we actually knew
our universe of individuals, and we even gave the advertiser the opportunity
to say for each individual and
each potential outcome how bad is it for this individual to be
mapped to that outcome. So, what’s a loss
that’s associated incurred by mapping
this individual to outcome, o. It’s just something that we
are adding optionally to allow the vendors to
weigh in on things. Then, we assemble
the ingredients of the science fiction metric and the vendor’s loss functions. We want to minimize the vendor’s loss subject
to the fairness conditions. So, we’re viewing the
loss functions as soft constraints and the theorists conditions
as hard constraints. This is just a linear program. So, in theory, this
can be solved, and in fact, there’s
an interesting observation, which is sort of evocative
of those theorems about what you can’t do when
the base rates are unequal. If you solve the fairness
linear program, you get a classifier that gives
you a statistical parity, if and only if, intuitively that
the distributions are the same. So, the Earth mover
distance between the uniform distribution on the sage eaters and
the time eaters is zero. So, that was something where we assumed that we knew
the whole universe, very recently Rothblum
and Yona have shown how to relax this a little bit
to what they’re calling probably
approximate fairness, and then they get
generalizability. We didn’t generalize
in our initial result. So, that’s a 2018 result. Now, I mentioned
that there’s been a lot of progress
in recent years. So, for example, Hardt, Price, and Srebro took
equal false-negative rate and equal false-positive rate, and optionally also equal
false-positive rate as a fairness requirement, and they showed how to convert unfair predictors
to fair predictors. Not for individual fairness, their notion of fairness,
these group fairness. Something that I think that
you guys might appreciate is, they did this all
in post-processing. So, as you know, if you’re trying to
get a product group to use one of your tools, it’s much better
if you don’t have to interfere in their pipeline, and you can stick something
at the beginning or the end. So, that’s something they did. Then, in other work,
Joseph Kearns, Morgan Stern, and Roth looked
at a different problem. They were looking at
a bandit setting. So, you have basically one arm for each demographic group. So, in our example, there
would be two groups, but you could be more general. They’re worried about the fact that you may have a lack of training data for
your minority group. So, for the sage leader, you don’t have a lot
of training data. What they require is
that at each round, let’s say, think about
this in terms of loan. At each round, one person from each demographic group comes and you have to choose
one to give a loan to, or at most one to give a loan to. Maybe what you’ll do is
you’ll actually choose a probability
distribution and you will then choose somebody
from that distribution. The requirement is that
a worse applicant will never be favored in terms of
these probabilities over a better one
despite uncertainty. So, they study how to minimize the regret and learn
a fair policy. So, they look at
the difference between the regret that you get when you enforce this kind of
fairness condition, which basically costs you while you were learning about
the minority group. Once you know
the minority group well, you can predict with confidence, but when you don’t, can’t. So, they show that there’s a gap, that it’s more expensive. You have higher regret
when you’re learning a fair policy versus regret when fairness is not a concern.
There was a question?>>Yes. So, what is the fairness
that are used here?>>They’re using individual
fairness at each round. Among the people who have
shown up at that round.>>The probability that you choose usually shouldn’t differ.>>Right, shouldn’t be. Actually, no sorry.
There’s something a little bit more worse. Just a worse applicant
is not favored over a better applicant. Yes, sorry. I misremembered. That’s the one for this paper. So, it was 2016, 2017 was
a really interesting year. So, two independent groups of researchers started looking at some of these similar questions. So, now, we’re getting into the intersectional case
that I mentioned before. So, imagine that you have, for the first paper, this is where the gerrymandering
nomenclature came from. Imagine that you have a couple
of sensitive bits for now. Lets say, K sensitive bits. So, sage versus time, and coffee versus tea, and a couple of
other sensitive bits. This gives you two
to the K subsets.>>Suppose all of
those subsets are still large, they’re not tiny relative to
the population as a whole. So, this is sort of large
intersections of large set. All right, so, they’re studying the intersectional case and they want to learn a classifier
that will essentially ensure, well they looked at
a few different notions, statistical parody, and equalizing the false
positive or false negative rates across these intersectional sets. So, I described it as being
defined by say K bits, all of the subsets
to those K bits. But you can have other ways of describing your
collection of sets, like a set of all the sets that can be described
by small circuits. So, as long as the set is
these sets are large enough, they want to try to guarantee these notions of fairness
across the groups. They also showed that the general problem
of what they call fairness auditing is hard. So I give you a classifier and these descriptions and perhaps
implicit of these groups, how do you determine
whether there is a group to whom
you’re being unfair? When one of these
intersectional things to whom you’re being unfair, and they showed hardness by reduction from agnostic learning. So, the reason they
can have an algorithm as well as these negative
results is as you all know machine learning often does interesting things even when
in theory they’re hard. Very similar questions were
looked at by Hebert-Johnson, Kim, Reingold and Roth. Wait I made a mistake
this is Roth Blom. The other one is Roth
this is Roth Blom. They studied calibration. So suppose my predictor
is now giving a score, and so I’m going to rate people some number between zero and one. Think about this as
my prediction about your probability of
recidivating, for example. So, the system is calibrated if among those that are rated v, the fraction who truly
will end up with a positive label
is close to v. So, in other words it’s requiring
correctness on average for those rated v for all v. So, they got results that
were very similar but they’re using
this kind of notion of correctness as fairness through the calibration version. If you’re familiar
with the Kleinberg, Mullainathan, and
Raghavan result, this is the kind of
calibration notion there or related
calibration notion. They argue that this makes more sense than these other
group fairness notions but other than that they
get very similar results. So, for every set defined by a circuit in
a predetermined set C, they get this approximate
calibration simultaneously. Subsequent work by Michael Kim
and James Zoe has shown that they’ve used this to improve some medical predictors. They’ve gotten some nice results. Then, how long is my talk?>>[inaudible].>>Okay, I’ll talk for
just a little more then. Okay, so, the next batch
of results deals with algorithmic work
that tries to bridge the gap between group fairness and
individual fairness by assuming that we have some sort of oracle that tells us for we can ask about certain collections
of edges and say, “What are the distances
on these edges?” And then it leverages that to do powerful things without requiring you to ask about all pairs. In the deep learning community, there’s a whole different
approach which is adversarial learning or learning a fair representations and
adversarial version of that. So, this is a group
fairness notion. Oh shoot. Sorry, all right let
me skip this slide. Okay, so the general idea here is we start with
instances in some set X and we’re going to learn a new representation for people that in some sense retains as much information as
possible while suppressing some key attributes such
as whether somebody is or is not a member
of a sage eater, whether they’re saying
eater or a time eater. And they’re using
gens and so we’ve got an adversary that is trying to distinguish
given a z whether it came from the sage eater X
or a time eater X. The key point in this intuition is the predictor is going to operate on
the representation Z. If the predictor is not so powerful as
the adversary who is trying to distinguish sage eaters from time eaters and who has trained the encoder to make that representation not
yield distinguishability, the predictor is no more
powerful than A than in some sense its predictions
can’t be biased. This is really,
really intriguing. It is a group notion
of fairness and I would love to see
individual fairness worked in here but this is another very interesting
direction of research. So, in the last five minutes I’m going to talk
about composition. Because this is really
interesting and this is new. So, our intuition is, if we have a system in
which we have a bunch of individually fair
pieces then when we look at the system that’s
made of these as a whole, it’s still at least
relatively fair. And the answer is,
it’s complicated. So, here’s a very simple example. Suppose let’s make the example really simple you’re looking at the New York Times
website and there is a single slot for a banner ad
at the top of the page. That’s going to be our scenario. As you know, there’s an auction
run for your attention. Advertisers are competing
to get that slot. So, let’s look at the case where we have two kinds of advertisers. One is advertising tech jobs and the other one is advertising
a grocery delivery service. So, the tech jobs advertiser and the grocery delivery
service advertiser are competing for your attention. Ideally if our classifier
for determining who is a a good bet for the tech
jobs ad is fair, and if our classifier
for determining who’s a good bet for
the grocery ad is fair, then somehow the system
will be fair as a whole. But we want to require that
in this complex system, it’s still the case
that people who are similarly qualified for tech jobs should be similarly likely to see tech job ads and same
thing for groceries. So, suppose that
the grocery advertiser routinely outbids
the tech jobs advertiser. It’s an auction,
the outbidder wins. So we can even think of it as just the outbidder goes first. The grocery advertiser
is going first or its bidding more, all right? And so the tech company is now essentially being
reduced to the leftovers, the ones who weren’t grabbed
by the groceries advertiser. This is true again whether
it’s done temporally first one than the other or whether it’s just
done by the money bids. So, this is a major problem. It says that no matter how qualified you are
for a tech jobs ad, if you’re very qualified
for groceries, you’re not going to
see the tech jobs ads. Now who is qualified for
the grocery service, the grocery delivery service? New parents. Advertisers drool over the things they can start advertising to new parents and those tend not to be the tech
jobs advertisers, and in fact, the tech jobs
advertisers you might say shouldn’t touch
the parenthood bid with a 10 foot pole. So, even if they are really appropriately not looking
at parenthood status, you would think it was
appropriate not to be looking at parenthood status
in determining who should see the ad there’re competitors for your attention are looking at that
and that’s affecting what happens under
the tech jobs advertiser. Okay and there are versions of this particular problem both for individual fairness and
for group fairness. So, this problem doesn’t go away. No time for these things. There is a way out here though
which I will mention which is we could fix a probability distribution
on the advertising tasks. Any probability
distribution we want. Then we pick a task according
to that distribution, and then we run
the classifier just for the chosen task and the proof that this gives us individual fairness
is very, very simple. But it leaves money on
the table because there are some people for whom we are just not going to be showing ads. So, that raises a number of very interesting
economics questions for those who are interested in the intersection of
algorithms and economics. Because of time, I
will skip the rest.>>Final comments. First of all, Interpretability
and Causality. There are big literatures on interpretability and
causality for fairness. Why is this slide left blank? First of all, I have
tried very hard to make headway with
interpretability, and I have failed. I don’t understand
what’s being asked, and how it can be achieved. There’s a very interesting
article called the Mythos of Interpretability
by Zachary Lipton. He explains, he catalogs, these are some things
that people could want, and here are some of the issues. It’s very, very thoughtful, and I’m not done
thinking about this, but I’m stumped on this one. As far as causality, it’s problematic for a number
of reasons that I can see. So, first of all,
if you’re using, say Pearl’s theory
of causal reasoning, you need a generative model. Now, maybe you’re familiar with the saying that all
models are wrong, but some models are useful. So, if your definition of fairness ties the
classifier to the model, and you need those two together to determine
if something is fair, and your model is wrong, you may very well come up
with the wrong conclusion. So, that’s a glimpse into what some of
the difficulties I’ve encountered with causality are. Another brief comment as we’ve
talked about the metric, we’ve talked about the issues of where do you get the metric from, there’s still problems. So, here, suppose we have
news organization, a nice news. Your news organization
has an environment, there’s randomness
in the environment, you have an individual x, there’s randomness in x, but you can talk about, although you can’t
get your hands on, the probability that
this individual x will succeed at nice news. It’s well-defined
once you specify the coins for this randomness and the coins for this randomness. Again, you can’t get
your hands on it, but at least it makes
sense mathematically. Now, what you might require is that if two people
are similar, meaning they have similar
probabilities now in success. And success can be
very crisply defined, like you stay in the company
for at least three years, and you have
at least one promotion during that time, okay? You might say that if you
have two individuals who are so equally likely to succeed, then they should
have very similar probabilities of being hired. That seems to make some sense, so the metric then is capturing the
probability of success. But, what happens if
this is not nice news, but it’s nasty news, which has a history of being
very hostile to sage eaters. So, sage eaters can’t succeed no matter how
talented they are, or they’ll have
a real uphill struggle. Should the metric, in fact, be capturing whether people are equally likely to succeed? Should it be capturing whether
they’re equally talented? What’s the right
thing to do here? It’s not clear.
So, final remarks, truth is elusive,
it’s not remedy. This problem is not remedied
by having a computer. They’ve argued, I really believed that metrics are really at the heart of everything even if it may take us
a long time to find them, but we advocate always
sunshine for the metric. The metric should
not be kept secret, it has to be open for debate and discussion
and refinement. Computers are not necessarily
worse than humans, they may be more accurate, especially on the easy cases, they may be more easily tested, they don’t get tired
when you keep throwing examples at them to see
what they have to do. That’s it. Thank you.>>There’s a question.>>Sure. Go ahead. Sorry, one behind, and then you.>>So, I’m really
[inaudible] a lot. I’m sure you thought
about this and might even mentioned it in
the previous slide, when you have curious
figure of two thing, so in considering
the individual there, if we’re looking at time eaters, time eaters always
teach their kids, and there is this terrible
sage brush infection, anyone who eats
sage get very ill, also there is this terrible
historical injustice right off the sage eaters. They just did weigh less
good, apparently everything, they’re so focused on
handling this illness. So, it just don’t happens via this historical coincidence, the sage eaters are almost never like a similarly
situated individual, they compete with
the time eaters. It seems like on
that individual fairness metric, we would just say,
okay no problem, and if we’re designing jobs, and it’s the time eaters
they get all of them, whereas we’re not
considering metric of group aspect and I think somewhere along [inaudible] societies
[inaudible] historical justice.>>That’s, of course,
an excellent point, and in the paper, we described what we call
Fair Affirmative Action, which is a method for
dealing with this, where we break certain of the Lipschitz constraints
for exactly this reason, but it’s a principled
approach to this thing. I should mention that it’s not so unlike what the state
of California does, and the state of Texas does
with respect to college. That if you’re in the top
10 percent of your class, then you’re admitted to one
of the state in California, it’s one of the UCs,
and in Texas, it’s one of the state
colleges, I guess. The point is that schools in different neighborhoods
may have very, very different
performances on tests, and so on, and so forth, but
it’s the same basic idea. So, we add a bit to that, it’s not quite as course as
taking the top 10 percent. There’s a very interesting work
by John Roemer at Yale, who is very concerned with theories of
distributive justice, and he thinks a lot
about education. He thinks that you must
invest, and invest, and invest in children
in order to compensate for differences of situation, and that as you go through
the schooling process, you do less and less of
this compensating because the children having been given a good foundation or more,
theoretically, I guess, more responsible for how they
make use of what they have, but in particular, I remember he said the following in the context of
college admissions. Stratify the students according to the education level
of the mother, and within each stratum, look at the cumulative
distribution function describing how many hours a week the students
spend on homework, and view people in
the various percentiles in that CDF from the different
strata, as being comparable. One of the points he makes, which I found extremely
compelling is that, if you’re brought up in a home
where nobody is educated, then you may not
even think that it’s possible to spend 10 or 15 hours
a week on homework. It just doesn’t cross
your mind to do it, or you might be working, and unable to do it. I think that these sorts of insights from the social
scientists are very important in what we are doing
is in some sense analogous to that. It’s
a very good point.>>So, that whole conversation is basically related
to this nasty news, choosing the metric thing, and it comes down to questions of what the condition on
mediating variables and stuff. So, basically, the rest of the talk didn’t address
those types of issues, like how to choose
the metric fairly. Is there any work that uses graphical models or other ideas
for causal inference to, because we have
this intuitive idea, oh, well we should condition on these things we want
to make a fair metric, but we definitely don’t want to condition on these other things. Is there any work that
tries to formalize that?>>There is work. It’s very interesting, and if you’re interested, I can point you to
a bunch of papers. The one difficulty with some
of this work is that you can have the same
observable distribution on data that’s consistent with very different
causal models, and so that just causes problems.>>[inaudible] models
is just really hard.>>So, I don’t know what
to say about that yet. I was really excited when I saw that there was
a counterfactual fairness paper. I thought, “Yes,
this is it, this is it,” but then we found two different models
that gave rise to the same distribution, and the classifier, and the
thing was fair under one, and not under the other.
So, what do you do?>>Yes.>>Suppose I want to get
a chess team for my school.>>You want to get what?>>I’m sorry, select
the chess team of three people to represent
my school at some company. What I’d probably do
to try and measure how good the people are and pick the best three by measurement, and see if they fit
your definition of individual fairness.>>Yes.>>The more accurately
I can measure people, the less randomness there is, and the less fair it is.>>You introduced the randomness.>>There is some random.>>You say that the algorithm
introduces the randomness.>>I mean, if you just take
what people typically do, they’re not going to introduce randomness out of the fact
but there will be randomness and because some
people win, some people lose. There’s randomness
in how well you can measure people’s skill. It seemed like by
your definition, the more accurately we can
measure people’s skill, the less randomness there is, the less fair
the selection process.>>If you’re saying that you
can get very, very, very.>>Maybe, wait a minute, Y is a completely random process more fair than a meritocratic Y.>>Right. Okay. This is the kind of problem
that shows up, in fairness, all the time. I mean, from
this particular example, you’d hit the nail on the head. In the definition of
individual fairness, buried in there is the idea
that if you simply flipped coins all the time,
that would be fair. If all you did was decide whether someone could be on the chess
team by flipping a coin, certainly similar people
would be treated similarly. What we’re not ensuring is that dissimilar people are
treated dissimilarly. This is where we want to throw in they will minimize
the loss subject to the fairness conditions
and we want to incorporate some notion of
loss in the decision process. We want to stay fair but we also want to maximize utility.>>I think it’s not
fair to be [inaudible] a fairness constraint has to
be random. That’s simple.>>I know what you’re saying, and you can probably come up with even better examples of where randomness is unfair or where treating everybody
the same way is unfair. Often, it’s possible to reframe the problem so
that you’re pushing those concerned into
the objective function rather than the fairness. Yes. By the way, we thought that we
were so original with similar treated
similarly but of course, Aristotle says the same thing, and Aristotle says dissimilar should be treated dissimilarly. Since we were thinking
initially about advertising, and I was thinking
about how when you look at a physical newspaper, everybody gets the same ads. There’s no targeting. That seems pretty fair. In terms of exposing young people to ads
for luxury goods, does that help form
aspirations or not? Should you be saying
that poor people should not see poor children, shouldn’t see nice things
or pictures of nice things? There are people who work on these kinds of questions from the psychological and
sociological point of view and they think that yes, the ads are very tied in with the formulation
of the self and that having a real difference in
what people see is harmful. So, I don’t know. Yes.>>When hearing the conversation of individual fairness metrics, I was thinking about
drug treatment effects and where you’re doing these randomized
controlled assignments. You’re thinking about,
here’s the population, let me find someone from this other population
that matters well, see the treatment effect when different treatments
are assigned. Then really, I can imagine a different sort of problem
here where you really want the classifier to behave
like a placebo for these matched individuals rather than as a drug with
some treatment effect. That actually boils down not to the individual fairness where you’re assigning this metric on every pair of individuals in the population but only
like across groups.>>Right.>>Then you do allow meritocracy
within a group but you want placeboness across groups. It’s good to address this.>>This is really interesting
and I don’t know. I mentioned that in
the causality literature, there’s been work sort
of in the Pearl model. In the Rubin model,
I don’t know of any. Maybe there, but I
don’t know of any, and I want to look at that. I think this is definitely
worth thinking through, so very good question.
Wait, someone, yes?>>I was wondering
about something. This title is about
this going towards the theory of modeling fairness. I was thinking, have people tried to turn it around
and say that you’re going to accept that
things are going to be unfair and we want to actually quantify the degree of
how unfair things are. In other words, instead of treating them as
fairness constraints, try to put them into
maybe equivalence classes of the way people study program classes and say B and D. Do you have any thoughts
in that direction?>>I only know of one theoretical result
along those lines, which when I was talking about when individual fairness
implies statistical parody. In fact, there’s a tighter
characterization of how unfair it is in terms of the move or distance
between groups. I don’t know of, I guess, some of the very recent works
do look at what happens if you relax
things a little bit like that probably
approximately fair notion. You’re saying, “Okay,
we’ll introduce a little bit of slack, what can we get from it?” There is some of that, but I don’t know of
wholesale saying, “Okay. Well, we’re just going to
treat these people this way.” I don’t know.>>I was wondering, in
differential privacy, there’s this trade
off of, let’s say, the epsilon parameter you said to get somewhere on that curve. You give up privacy
to get more accuracy.>>Yes. If you think about it, the definition of
fairness is very similar in flavor to
the definition of differential privacy but it doesn’t behave the same way
under composition. Yes. There is
one result also about using a particular algorithm from differential privacy
to ensure fairness, where the trade-off
that you talk about would perhaps be relevant. Yes. Anyone else? Yes.>>I heard a question
about the practice. This problem is complicated. You’ve given that very
beautiful summary of the related work and talking about
how it’s going to take a long time to
reach maybe consensus, maybe not, but
some legislation on this. At the same time,
computers are not necessarily worse than humans and people are trying
to put mission and models into the world
in different domains.>>RIght.>>What is your prescription about what needs to
be done right now? Do we stop and wait until we really understand
these things better? Do we apply some
reasonable approach because something is
better than nothing? What is the practitioner’s guide of what needs to be done?>>I feel wildly unqualified to answer that question
but I’ll try anyway. Even if I thought, and I don’t, but even if I thought it
was appropriate to wait, that’s a nonstarter in
the business world. Facebook is not going to stop. You just cannot stop industry
where it is, I think. Just, the political forces
are way too strong. Now, what you can do, of course, is start creating test sets, test sets where there were at
least the people who create the sets believe they know
what the outcomes should be. For example, algorithmic assistance in bail determination. It’s happening in many states. I can’t remember the number, I’m sure it’s at
least 12 but I can’t remember why that number
isn’t in my head. It’s happening in
a lot of states. One thing that we would want, I don’t know how useful
it is to see the code because code can be hard to
understand, but certainly, you would want that
somehow appropriate people and organizations should
be able to throw test cases at the algorithm
and see what it does. I would say really intensive
monitoring with the best golden sets
you can come up with and lots of people doing it, I think that’s where
we are for now. Yes, okay. Because
monitoring, you can do.>>Yeah.>>So, if you just do what people are doing and
trying to maximize your accuracy and release
these algorithms in the wild, are there any instances where it’s clear
that it went bad?>>Where it’s clear that
the algorithm has gone bad?>>Yes.>>Lots of cases.>>Like what?>>Speed bump. Is that
what it was called? Street bump? I forget. So, pothole detectors that will report to the
municipality about potholes. But this was an app on an iPhone, and so the only areas that got the reports of the potholes
for the wealthy areas.>>But it seems like,
that’s an issue where it’s not as accurate
as he would like. So, the question is->>It is certainly
an unfair impact->>Inaccurate. Then okay, that’s probably like
you’d like to fit, but the people making
that app would like to be more accurate and that’s what
they’re trying to do.>>No. No, I don’t understand. If the app isn’t being brought
over the right potholes, if it’s not make penetrating
into the poor areas of town, it’s just not giving you
the information you want. The system as a whole is unfair. So, you’re saying, is there
a classification algorithm?>>Lucy, that’s like
the minority group, in which it has less data
so it’s less accurate. Is roughly->>It had, I mean, I’m not sure that it was
inaccurate where it was, it just wasn’t coverage. There was lack of coverage.>>Maybe you don’t
understand because like->>Okay. Do you want to go
ahead and provide an example?>>Yes. So compass, so if you define ->>Wait, so you’re going
to start with compass. The problem with compass is
that they were test fair.>>Right, for those criteria
but they weren’t for- or positive and false
positive, false negative.>>Yes. But they were test
fair and it’s impossible to simultaneously be
test fair and have equal false positive rates
and have equal false negative rates. So->>I’m looking for one
where it’s clear that it’s unfair by if you talk to
a random person on the street, comes very clearly unfair.>>So->>College admissions?>>Huh?>>College admissions and wealth is probably
a really good example.>>So affirmative action
is controversial, right?>>This is part of the issue. When I began the talk by saying that there’s
consensus at least, there’s more consensus on privacy than there
is in fairness. So is affirmative action fair? What do you think? There will be a range of opinions here as to
whether it’s fair or not. Have there been
examples historically that news classification
algorithms that were very biased? Yes. The answer is yes.>>When we talk about
fairness and classification, it seems like some
classification problems are less important than others. For example, if you’re trying to decide what ads to show some that sort of
caused people problems, but if a car is trying to decide which person to
kill its classification->>So here’s what I think
we should do about cars. Is that your question or not?>>Well, I mean, what
are you going to say?>>Oh, thank you. All right. So for a while, I
thought “Oh my God, we can’t do anything
until we figure out what’s right to do
in all situations.” Then thinking about the cars, I thought well suppose
you have an algorithm that what you do is you learn
what do average people do. Maybe you used
the Moral Machine for this. Maybe you have some other way of figuring out statistically which you learn what sort of
a random person would do. And then your car
driving algorithm should do the following, when it gets into one of
these really touchy situations, it should do what
a random person does, and the rest of
the time, it should just do it automated driving. So this would be much better than humans because in
all the usual situation, it’ll be more accurate. In the really wacky ones, it’ll be no worse than humans. So altogether, it’s better.>>And you say, oh what
a random person would do, what does that really mean? I mean, I think one way, one intuition is that everyone is equal so there’s no way-.>>So what a random
person would do. So with some probability
distribution and you just draw from that distribution and you deal with
that person does.>>Is a shifting
down the problem, like when do you switch mode? Is it really the shifting
the decision there that?>>So I was thinking that it would not be difficult to know when to switch modes, that most of the time was
totally clear what to do and then sometimes there
was a moral question, and then you see you
go to your database of what would people do
and you consult that.>>It would also seems like
it would just reinforce societal biases if you’re sampling from a bias sample
whereas people are biased. Even if you pick a
random, that’s going to be a biased decision.>>Okay.>>I have a question
about the- sorry.>>Last question. Go ahead.>>Okay. I have a question
about the data part. It seems like a lot
of this classroom just that the asset because these kind of problems that
fairness problems are figure any bias like
recidivism the assets, that people tend to be arrested at higher
rates, you know, even what kind of
people the base rate really starts off being
bias their sense. So maybe, like
income example for instance, like maybe, you don’t see so many women CEOs and therefore, if you are trying to do a model where you’re predicting pay, it tend to be fewer highly
paid women historically, so you don’t even have the
data in the first place.>>Right.>>So it seems to me that there should be some kind of work, like just working
in the data space, trying to maybe hallucinate what fair training data
will look like, something.>>There are efforts that involve what I would call
massaging training data, exactly as you’re suggesting, I don’t know how to put them on a sort of
firm theoretical foundation, but people have thought about
those sorts of things, yes. As you say, they sort of, you could hallucinate some data to change your training set. I don’t know a lot
of work on doing that and how you would
do it, but I guess, I mean now that you mentioned it, Statisticians do this sort
of thing all the time when they weight
their samples. So something maybe
could be doable, but then you have to decide how are you going
to do that waiting.>>The last comment is like the Rubin counterfactual
fairness approach?>>What was it?>>Rubin’s style
counterfactual fairness. So not [inaudible]
, Rubin’s style. You would be trying to
weigh up the women CEO because she’s very rare sample
compared to CEO. So when you actually train on it, you way up the woman
CEO a little bit more, and you pass these courses
to find that way, yeah.>>Yeah. So as I said, I think it’s a really good idea
to try to pursue that direction and
see what it gives.>>There was a nice
similar paper about that.>>Cool. Awesome, yeah.>>Okay, good. All right.>>Thank you very much.>>Thank you all.

Maurice Vega

3 Responses

  1. that guy at 0:07:00 hit the nail on the freaking head. 100 points. random guessing is most "fair" over the variables but not with regards to the actual task and the performance of the model.

  2. The best part of video is at 1:04:53 This theory suggests that randomness is more fair than meritocracy. How can this be even considered to be fair?

Leave a Reply

Your email address will not be published. Required fields are marked *

Post comment