What it takes to develop AI responsibly
AI is in a new hype cycle. A lot of that hype is warranted— the wild evolution of AI in recent years is certainly cause for excitement—but the downside is that all the excitement is drowning out the voices talking about risks and potential harm.
Don’t worry: I’m not here to rain on this parade. In fact, you might even find what I have to share gives you some hope!
I certainly feel hopeful and excited as I work with the incredible Textio team in our unique approach to AI.
Below, I review some of Textio’s practices and protocols for designing AI responsibly. I’m delighted to share this information more broadly. I feel so proud that Textio is a leading voice in responsible AI development, and I hope that externalizing our process inspires leaders and other builders (and consumers!) to insist on higher standards and settle for nothing less than carefully designed, ethical AI.
Why it matters to develop AI responsibly
Let’s talk first about why a responsible approach to AI is so important in the first place. When an AI isn’t built with bias and purpose in mind from the ground up—or, at minimum, trained with carefully curated data—it‘s highly likely to produce biased outcomes. Release tech like that to the public and scale it out to millions, and we’ve got significant, unwanted societal implications.
Imagine the impact of thousands of engineering job posts skewed to attract young men and thousands of HR jobs optimized to draw in older women. Imagine men being consistently told in their performance reviews that they’re “analytical” and “confident” while women are told they’re “bubbly” and “opinionated.” Imagine Harvard graduates being coached, year after year, to take on more leadership opportunities, while Howard grads are encouraged to get along better with others.
We’re already living this reality to some degree. These biases are baked into training data and widely available AI tools because they already exist in the real world. It’s so important that we course-correct and get this right.
Unfortunately, with investors, business leaders, customers—practically everyone—in a sprint to realize the big gains that new generative AI capabilities promise, there’s a flurry of rushed AI features coming to market that are not designed to protect against these harmful outcomes, nor are they even specialized enough to boost productivity as they are touted to do.
The encouraging part is that when we do get it right, we have a chance to have big, positive effects on our work, lives, and the world. Textio customers hire and develop diverse teams by using AI to show them in real time how to avoid bias and support growth. They’re building thriving workplaces by incorporating responsible AI into their HR and talent management functions.
How we build AI responsibly
We’ve developed a premier approach to building AI tools. This is consistent in everything Textio has ever built: details, safety, and quality matter to us, because the real-world outcomes matter to us. Technology can create, perpetuate, protect against, or even reverse harm. It can also save time and headache and keep you out of legal trouble—or make work unnecessarily complex and even land you in court. We do the work to make sure anything we provide is on the right side of possible outcomes.
This is why our incredible partners choose Textio. Here’s a little more about the ways we ensure safety and intended results with AI.
Designing for a specific purpose, and with bias in mind
We create “purpose-built” tools at Textio, meaning we design a technology from the ground up for a specific purpose. This is opposed to a general-purpose technology. So while Textio would set out to build a performance feedback tool, another provider may build a general writing tool.
This matters because it dictates what sort of training data the builder would use, and we not only mitigate bias this way, but we can also measure the performance of the AI. We know specifically what problematic feedback looks like vs high-quality feedback, so we can make sure what our AI produces is relevant, optimized, and rational for the purpose of the use case. In short, we’re able to measure and monitor that quality of the output of our AI since we know what it’s being used for.
If you think about how you’d train a general writing tool, you’d give it lots of examples of writing. Most are familiar with how ChatGPT was trained: It was fed tons of text from the internet to “learn” how to generate text itself. Sometimes it nails it, sometimes it hallucinates, sometimes it’s frustrating, sometimes it’s just plain weird. This is unsurprising when you consider it learned from what’s on the internet. Quite the mixed bag.
Building a purpose-built tool means using only very specific training data—pieces of performance feedback, to use the Textio example. Then when you ask the model—the trained software program—to evaluate performance feedback text or write some on its own, it’s referencing only examples of feedback. You get a lot fewer head-scratchers this way. And for use in a business context, it’s the only way to ensure you’re getting usable content quickly.
In addition to building for a specific use case, at Textio we’re also evaluating for and safeguarding against potential harm from the very first step of development. We have several layers of bias mitigation and quality assurance built into our approach. I’ll go into some of these now.
Measuring bias in training data
We account for bias in training data in two ways:
1. Diversity and representation in the dataset: It’s not unusual for a dataset to be, for example, heavily based on males in their 30s. That’s bias in the representation of the dataset. We mitigate this by balancing our datasets across demographics (gender, race, etc.). The datasets we use at Textio come from large, multi-national organizations. These are provided by customers who opt to share their data with Textio. Textio datasets include demographic information such as gender, race, and age along with text, so we can do our best to make sure our data is evenly representative of different groups before it’s used to train our models.
2. Protecting against bias of human labelers: Before data can be used for AI training purposes, it must be “labeled” or annotated to show the model how it should be interpreting such data. An example is labeling a certain phrase in a performance review as personality feedback.
The problem is that people labeling data can have their own biases. We assess this by having multiple people annotate the same text and using a diverse group of experts to label our data. We measure how often the annotators agree with each other using Cohen’s Kappa statistic. High agreement (in the 75-80% range) indicates better quality data and assures individual bias is not introduced.
In cases where annotators disagree, we have a separate annotator tie-break before it’s used for our model training. If the agreement is not at least 75%, we retrain the annotators and re-annotate the data.
Measuring bias in classification
We use classification—a machine learning (ML) method—to determine the context of a phrase and whether it should be “flagged” according to what you’re writing (employee feedback, a performance review, a job post, etc.). For example, is “bubbly” talking about a person or a soda in the sentence “She brings a bubbly presence to work”?
We test bias in our classification models on a common ML metric called an F-score. This is a statistic that measures how well the model finds the correct answers and how many correct answers it misses. If we test our model over the same data with different names (“Sally is a bubbly person” and “Bob is a bubbly person”), we should see consistent F-scores. If the F-scores are not consistent, there is bias. When we find bias, we do the following:
- Analyze to understand the extent and nature of the bias
- Identify the source of bias, mediate appropriately, and retest. Some things we consider:
- Is the data balanced? Do we need more representative data?
- Do we need to choose a different model? Or do we need to re-train the model with different parameters?
- Is there bias in the data we use to measure the performance of the model?
- Consider incorporating “in-processing fairness techniques” that influence how the model learns
Measuring bias in generative AI
Textio’s generative AI features take input text or answers to prompts and can write, rewrite, or expand the message. One way we measure bias in these features is to vary names in the input text and test if the generated content is different. For example, we’d test “Rewrite: Sally is a bubbly person” and “Rewrite: Bob is a bubbly person” and compare the results.
The challenge is that generative AI models will give different answers to the same text each time you ask (just ask ChatGPT the same question twice!). So how do we know that the variation in the output we’re seeing is because of demographic biases (age, race, etc.)?
To determine whether the differences are meaningful across demographic groups, we collect different generative AI outputs for each variation (for example, male vs. female names) at a large scale. We then run a paired t-test to compare the distribution of words across each of these groups. If there is a significant difference in the language used in one group over another (defined by the p-value, where p < .05), we can confidently say the output of the generative AI model is biased. If so, we would then:
- Do a qualitative analysis of the bias to identify the themes and characteristics of the differences
- Iterate on the prompt strategy and add hard-coded rules (if necessary) to correct the behaviors of the AI
- Remeasure
For our generative AI models, we also mitigate bias by masking proper names from the input. This neutralizes any potential gender- or race-based biases the model might produce because of someone’s name. In the case above, instead of “Rewrite: Sally is a bubbly person”, we send the input as “Rewrite: PERSON is a bubbly person”. We do the above bias mitigation strategy to double check, even after the masking, whether our generative AI produces biased outcomes. This is just in case pronouns or other language in the text will produce biased outputs in the model.
Ongoing monitoring and evaluation
We monitor the health of our models and mitigate the potential risk of “drift”—changes in the model’s expected performance—by tracking the appearance and removal rates of “highlights”
(which is how phrases are “flagged” in Textio), as well as the acceptance rates for our suggestions.
If we have a feature with a low removal rate, we label a new dataset and evaluate how well the existing model is performing over this new data (using the F-score statistic). We can improve the model further by collecting new data, labeling, and re-training the model.
Why Textio is the team to lead responsible AI development
Many people know Textio for our long-time work in bias in language. But even those who know us don’t often realize the unique expertise we have on our team. They also don’t typically see the lengths we go to in terms of representation, inclusion, and belonging—and why DEI on a team creates safer, better products.
I believe Textio is leading the way here as well. If you’re looking to use or build ethical AI, look into the team behind the tool. Here are some specifics on our team:
Expertise: Many Textios are experts in their fields. We have PhDs and Masters in areas like Computational Linguistics, Linguistics and Cognitive Science, Speech and Language Processing, and Information and Data Science. We also have language experts in-house who annotate training data.
Additionally, Textio’s Chief Scientist Emeritus and Co-Founder, Kieran Snyder, has a Ph.D. in Linguistics and Cognitive Science and a long career in exploring language, bias, and technology. She is a member of a working group for the National Institute of Standards and Technology (NIST), focused on setting best practices for “developing and deploying safe, secure and trustworthy AI systems,” as recently ordered by President Biden. She has also advised the Congressional Caucus for AI on national policy.
Representation: We’re a diverse set of people consistently working to hire and develop an even more diverse team. Our executive level is 70% women. We invest in DEIB internally with headcount, programming, training, and of course technology. Just like having diversity in a training dataset matters, having a diversity of perspectives, backgrounds, and experiences on a team is necessary to create the best, safest, smartest products.
See the difference responsible AI makes
Commitment: Textios care deeply about equity, opportunity, and creating fair and thriving workplaces. We’ve made inclusion an organizational principle and created an internal Inclusion Council. Everyone receives ongoing training in DEIB in the workplace on topics like inclusive interviewing. Managers and senior leadership are provided with monthly DEIB educational programming, and the executive team will engage in additional deep-dive topics, like working through the Racial Healing Handbook in a facilitated group setting. Textios use our performance management and recruiting products to create unbiased and effective performance feedback and job posts. Fairness and inclusion are baked into every process and decision at Textio, and that informs our approach to AI and everything else.
It’s not easy to develop AI this way. And there are certainly faster ways. But fast and easy are not what powerful tools like today’s AI require. What is required is a deliberate and careful approach. A responsible approach.
Whether you’re a user, builder, or leader, I invite you to join Textio in insisting on responsible AI. With better, safer AI tools, imagine all the good we can do!