The Uncertainty Bias in AI and How to Tackle it
Bias in AI is a formidable topic for any data scientist. If you are reading this, you probably know that artificial intelligence systems have a bias problem. While true, that thought is misleading. AI systems themselves inherently have no bias. However, if it is using biased data, or the people running the system do not correct it, AI systems can return faulty, biased information.
But you may not know that the same AI systems, even those we would consider to be free of bias in AI, can present a different and no less concerning outcome. By favoring the most common, normative or expected data, AI can subject unusual or outlier data to uncertainty.
It’s been well established that AI systems will replicate and often exacerbate the bias inherent in its training dataset. However, even when measures are taken to level the playing field, a subtle but equally undesirable result may occur because of prediction uncertainty.
Uncertainty, and especially unfairness in uncertainty, can be a complicated idea. Think about comparing two different GPS navigation apps. Both apps tell you similar expected travel times, but the first app is always within a minute or two of actual time while the routing of the second app results in actual travel times that can be 10 minutes faster or 10 minutes slower than the expected – which one would you use in that case? And why does this situation arise in the first place?
An AI system’s certainty and accuracy in making predictions tends to increase with the amount of training data it sees. In our data rich world, it’s usually not an issue to collect more data. However, while some groups of people are well represented in commonly used datasets, other, marginalized groups are under-represented. When AI systems are asked to make predictions for marginalized groups, the answers it provides will be less predictable, accurate, or relevant than for a situation that’s well represented in the training data.
Arguably, an uncertain, unpredictable system is worse in some respects than one that’s predictably biased.
A biased system isn’t a good thing, but if the bias is known and quantified beforehand, adjustments are possible and people using its predictions can compensate. In contrast, the problem with uncertainty is that you don’t know what is going to happen.
Consider this example.
If you are not sure what is going to happen when you turn in a homework assignment or an essay in your class, you become less prepared to make adjustments or plan for the next assignment or essay than someone who knows with greater confidence what the outcome will be.
When aggregated together across the huge number of decisions made by and made about each person every day, even small differences in uncertainty can have enormous consequences.
To be clear, this isn’t an argument against AI in society, but rather a call to action to recognize that its enormous potential to improve the lives of all people comes with important considerations that shouldn’t be ignored.
In truth, I am more than a believer. I spend my days building and refining AI systems for my company. If you are not familiar with us, students, educators, and schools use our software to uphold academic integrity. We give students step-by-step, personalized guidance on writing techniques and source citation. We provide data to help teachers and schools identify authentic work from plagiarized, copied, recycled, or otherwise fake work. We also help teachers cut grading time and more efficiently give feedback to students.
We increasingly do these things with AI and algorithms. We use existing information to make assessments and strong, calculated guesses about the source of written materials or whether one error is sufficiently like another error to merit the same response.
This context is important, because it is a good example for discussing the unpredictability of AI.
One of the most hotly researched areas of AI is in automatic feedback and grading of long form writing such as essays and reports. This form of writing is not commonly used, it’s also enormously time consuming to grade. Unlike math problems or computer code, writing blends freedom of abstract, stylistic self-expression with the need to convey concrete ideas.
Building AI that provides feedback and scoring capabilities requires collecting [at least] thousands of human-scored essays, feeding them into a specifically designed natural language AI formulation, and allowing the model to learn associations between co-occurrences of words, phrases, syntax and punctuation, and human generated scores, which it stores as mathematical parameterizations. Given enough training essays, the model can learn to mimic – and in some ways, exceed – human scoring performance on previously unseen writing that is statistically similar to the writing of the training set. It does so by using the stored parameterizations to perform a set of mathematical operations on the new data that renders an “answer” which we refer to as a prediction.
We’ve been developing technology to do this work for almost a decade, and we are the leaders in the field. We’ve also been judicious about deploying this technology because we understand its limitations.
Read More on AI ML: SocialGrep releases Intelligent Keyword Alerts for Reddit
As an example, consider the sentence from Toni Morrison’s Beloved: “Definitions belong to the definers, not the defined.” Show this extraordinary sequence of words to an essay grading AI that’s only been trained on typical English fluent middle school writing and it’s equally likely to deem the sentence as remarkable as it is to say that the sentence is repetitive and nonsensical. The particular mathematical parameterization of this AI is unable to make sense of the power of this sentence – it’s simply never seen anything like it before.
Of course, most writers aren’t Toni Morrison; however, the underlying issue still persists. AI models that are not shown enough representations of speech and writing patterns of writers from different ethnic, cultural and regional backgrounds begin to perform unpredictably when shown writing from those groups, while at the same time performing with high accuracy and low unpredictability for those in well-represented groups. The definers of the AI are the majority group and the definitions that the AI are operating with are not being defined by everyone equally,
Since the AI that my team builds is designed to help students, I think of a student whose writing or composition background and style are unique – not bad, just different from the norm present in the training data. And I think about the stress that the unpredictability of AI assessment must cause. Simply not knowing how a system based on predictable norms will handle the non-norm must be a terrible way to engage the process of teaching and learning – or anything else for that matter.
And although it is not a technology issue per se, I also wonder what systems based on making good guesses within established boundaries teach people with unusual inputs, with unusual writing styles in this example.
Are we inadvertently telling them to wrap up and tuck away their creativity and individuality?
Are we teaching them to write boring? To be “normal”?
We believe that learning should help each individual become more of who they are by helping them fulfill their own potential, with their own style, voice, and direction. How can we build AI that helps accomplish this?
The good news is that there are two ways we can minimize those risks and tamp down the unpredictability penalty. Yet, as one might expect, neither is easy.
One way to get AI to do better at assessing outlying information is to be diligent about human review. When AI says the next William Faulkner is gibberish, a human needs to be in the oversight pathway to make the right determination. The AI needs to be constantly told what is what – this is actually good, that is actually not.
This approach is also useful for mitigating many of the harmful effects of bias in AI – people can spot it and override or counteract the result, reducing not only the adverse outcome but the possibility of reinforcing it for use in future, similar cases. This requires close cooperation of AI teams and product teams to build AI enabled experiences and products that give context to potential bias, highlight areas of low confidence in AI prediction and specifically bring in human experts to oversee and, if necessary, correct the AI predictions.
The second way of addressing the issue of unequal AI uncertainty is in improving the representation of marginalized groups in training data sets. On the surface, this sounds like the old adage “add more data,” but in reality, I mean that we need to add specific data that captures the enormous and wonderful tapestry of learners. Additionally, we need to make sure that the data’s labels (grades, tags, etc.) are carefully vetted by those who have the relevant cultural and lived experiences to the source of the data. This allows us to train AI that encodes, that’s context aware in ways most AI isn’t today.
Over the past few years, the power and peril of embedding AI into every aspect of our lives has become a mainstream topic – and I’m glad to see our society begin to grapple with these important questions. The way AI can actively propagate societal biases is now well understood, and efforts are already underway to mitigate their harmful impacts. We need to add unequal uncertainty to the conversation around AI fairness. Creating AI that works “better” for some groups and “worse” for others – even if on average the AI is fair – is still unfair and does not live up to our ideals.