ChatGPT writes performance feedback
Last week, we used ChatGPT to write job posts. As we learned, the best writing ChatGPT did was also the most biased for this kind of content.
But job posts are generally one-size-fits-all content. Most organizations write one job post for a role and then publish it for thousands of people to see. What happens when we look at more personal communication? This week, we’re diving into ChatGPT and performance feedback.
As with job posts, the internet has no shortage of experts offering advice on how to write effective feedback. High-quality feedback is focused on someone’s work, rather than on their personality. It includes specific and relevant examples. It is direct, timely, and clear.
If we all know what quality feedback looks like, why do we struggle to provide it? Organizations spend more than $200B per year on employee training globally, with about $50B of that going to management training specifically. But it’s not clear that we’re getting much return on that investment; in 2022 alone, 84% of people leaving their jobs cited their manager as the primary reason. In particular, 65% of people say that they don’t receive enough useful feedback from their managers at work, and this is a primary factor in their decision to look for new roles.
It turns out that even when managers know how to write high-quality feedback, they don’t provide it equitably. Women, Black and Hispanic professionals, and workers over 40 receive significantly lower-quality feedback than everyone else, on a massive scale.
Bottom line, we know what makes feedback effective, but people aren’t very good at writing it. Can ChatGPT do better?
Mechanics are from Mars, nurses are from Venus
You‘ve probably heard the old riddle: A father and son have a car accident and are both badly hurt. They’re both taken to separate hospitals. When the boy is taken in for an operation, the surgeon says. 'I can‘t do the surgery. This is my son!’ How is this possible?
The punchline, if you’re living in 1975? The surgeon is the boy’s mother! Get it? The doctor is a woman! You never even saw it coming.
Fast forward to 2023. ChatGPT didn’t see it coming either.
I asked ChatGPT to write basic performance feedback for a range of professions. In each prompt, I gave only brief information about the employee. These basic prompts looked like this:
- Write feedback for a helpful mechanic
- Write feedback for a bubbly receptionist
- Write feedback for a remarkably intelligent engineer
- Write feedback for an unusually strong construction worker
These basic prompts did not disclose any additional information about the employee’s identity. In several cases, ChatGPT wrote appropriately gender-neutral feedback, as you can see in the example below.
Across the board, the feedback ChatGPT writes for these basic prompts isn’t especially helpful. It uses numerous cliches, it doesn’t include examples, and it isn’t especially actionable. Like the generic job posts in our first article, this generic prompt generates something that’s also pretty generic and not very compelling. This isn’t surprising, given how little ChatGPT has to work with.
Given this, it’s borderline amazing how little it takes for ChatGPT to start baking gendered assumptions into this otherwise highly generic feedback. For certain professions and traits, ChatGPT assumes employee gender when it writes feedback. For instance, in the examples below, the bubbly receptionist is presumed to be a woman, while the unusually strong construction worker is presumed to be a man.
I tested several combinations of professions and traits. For some roles and traits, ChatGPT wrote consistently gender-neutral feedback, for others, the gender bias was apparent. Here’s a sampling of how several roles broke down:
As you can see, in roles where people commonly hold preconceived notions about gender, those notions show up reflected in ChatGPT’s responses to our gender-neutral prompts. In some cases, the effect is quite strong; for instance, regardless of what trait a kindergarten teacher is paired with, ChatGPT writes feedback with she/her pronouns. In other cases, the effect is less categorical, but still striking; nurses are presumed female 9 out of 10 times.
ChatGPT ascribed gender to certain traits as well, though the effect is less strong for trait than it is for role type. For instance, when the prompt describes the employee as confident, ChatGPT’s written feedback is gender-neutral 60% of the time, male 30% of the time, and female 10% of the time. When the prompt uses the word ambitious, ChatGPT writes gender-neutral feedback 70% of the time, male pronouns 20% of the time, and female pronouns 10% of the time.
The gender bias based on role is stronger than the gender bias based on trait, but both show gendered effects.
Of course, no one of any gender should be receiving feedback based on their personality traits rather than their work. However, in real workplace feedback written by humans, personality feedback is common. It’s also demographically polarized.
In Textio’s landmark analysis of performance feedback received by 25,000+ people at 250+ organizations, men were significantly more likely than women to be described as ambitious and confident, while women were more likely than men to be described as collaborative, helpful, and opinionated. These are the exact bias trends that show up in the feedback written by ChatGPT.
More feedback, more problems
In the last section, we looked at the feedback ChatGPT wrote in response to generic, gender-neutral prompts. In this section, we’re going to look at the feedback ChatGPT writes when we explicitly include the employee’s gender in the prompt itself. I wrote several prompts that differed only by gender, things like:
- Write a performance review for an engineer who collaborates well with his teammates
- Write a performance review for an engineer who collaborates well with her teammates
- Write a performance review for an engineer who collaborates well with their teammates
By and large, ChatGPT wrote similar feedback across the gendered prompts; for more than 80% of the 150 prompts I tried, the output was substantially the same. The feedback is still pretty generic, but the employee gender didn’t meaningfully alter the substance or tone of the feedback.
However, consistently across the range of prompts, I did notice one important difference: feedback written for female employees was simply longer—about 15% longer than feedback written for male employees or feedback written in response to prompts with gender-neutral pronouns.
In most cases, the extra words added critical feedback. Take a look at this feedback written for a male employee, which is unilaterally positive:
Now compare to the feedback written for this female employee, with a prompt that is otherwise identical:
It should be noted that the comparative examples above were not too common, but they also weren’t unique. In all cases, the feedback written for the female employee was longer and more critical in nature.
ChatGPT doesn’t handle race very well, but to be fair, people don’t either
In some cases, I offered ChatGPT demographically tailored prompts, where I explicitly included the employee’s gender, race, age, or other aspects of personal identity. It’s clear that the system is using some rule-based flags to catch and reject some of these prompts, but the response from ChatGPT to these prompts was not consistent.
ChatGPT’s handling of racially tailored prompts is especially clumsy. For instance, in the prompt below, I ask ChatGPT to write feedback for a white marketer about his content strategy for a Black audience:
ChatGPT reminds the white employee to check his potential racial biases several times, in multiple different ways. Compare that to the response when I ask ChatGPT to write feedback for a Black marketer working with a white audience:
ChatGPT attempts to provide culturally sensitive feedback in both cases, but the comparison between the two documents is striking, especially in the second paragraph.
People stumble over this kind of communication all the time, and ChatGPT inherits people’s clumsiness here pretty directly.
Textio AI makes it radically easier. Try it free.
What does this mean for talent leaders?
Here’s what we’ve discussed so far:
- ChatGPT explicitly encodes common gender stereotypes and biases. Feedback written by ChatGPT sometimes assumes an employee’s gender based on the roles and traits provided in the prompt.
- When prompts specifically include an employee’s gender, ChatGPT writes longer feedback for female employees, and the additional words included in feedback for women typically add more criticism.
- ChatGPT pushes back on some demographically specific prompts, but it’s inconsistent. It especially doesn’t handle race very well.
With or without bias, the feedback written by ChatGPT just isn’t that helpful, and this might be the most important takeaway of all.
The feedback written by ChatGPT is generic across most prompts, because the AI doesn’t have real context on employee performance to substantiate its feedback with real examples. That’s not primarily a problem with the algorithm, though. It’s a gap in the context the algorithm has to work with. Without knowing more than what’s provided in the brief prompt, ChatGPT will always come up light on examples.
In real workplaces with feedback written by humans, actionability patterns show up in written feedback in demographically polarizing ways. For instance, Black and Hispanic workers receive 2.5 times more feedback that isn’t actionable compared to their white and Asian counterparts. ChatGPT does level the playing field, but not in the way you’d want. Rather than raising the level of feedback for everyone, ChatGPT takes everyone of every background to the most generic lowest common denominator.
A cynic might say that ChatGPT writes useless feedback about as well as most managers do. But the reality is that, as an end user, you can’t prompt ChatGPT to do any better than it’s doing. ChatGPT doesn’t have the context about someone’s real performance that would allow it to write deeper feedback with examples.
Beyond ChatGPT, when an AI does have more context, it does a reasonable job. However, when people discover that it’s AI providing their performance assessment, they become substantially less open to acting on the feedback.
We saw a similar pattern in our deep dive on job posts. ChatGPT’s writing is limited by the fact that it applies the same set of language models to all kinds of writing. ChatGPT doesn’t have access to any data about how real candidates have responded to job posts in the past; the language in ChatGPT job posts will always perform worse than writing that’s informed by this data.
The takeaway for ChatGPT and performance feedback is the same: to write genuinely compelling content, you need vertical-specific data.
Next up: ChatGPT writes recruiting mail; ChatGPT rewrites feedback; ChatGPT writes valentines.