Feedback is one of the most powerful mechanisms for improving students’ writing and guiding students’ learning. Effective feedback is personalized, attending to individual students’ cognitive and socio-psychological states. It’s no surprise, then, that a new generation of AI writing tools powered by large language models (LLMs) are rapidly making their way into classrooms. They promise to save teachers’ time and provide students with instant, personalized feedback at scale.
But what kind of personalization do these systems actually provide?
In our recent study that examined the feedback given by four popular LLMs, we found that when the models are given information about a student’s identity, learning needs, or motivation, the feedback changes in troubling ways, even when the student’s essay stays exactly the same. Our research found that LLMs systematically shift the substance, tone, and expectations of their feedback according to social stereotypes and assumptions, indicating that these tools are not neutral. They adopt different instructional stances based on who they presume the students to be. We dubbed these instructional orientations “Marked Pedagogies,” and they raise concerns about the LLM-generated feedback and the biases underlying the personalization that makes the technology so promising to educators.
Our research found that LLMs systematically shift the substance, tone, and expectations of their feedback according to social stereotypes and assumptions, indicating that these tools are not neutral. They adopt different instructional stances based on who they presume the students to be.
The Experiment
To test how LLMs respond to student characteristics, we used 600 eighth-grade persuasive essays from The Learning Agency’s PERSUADE dataset and asked four widely used language models to generate inline feedback. For each essay, we tested a series of descriptive prompts to the LLM. In the baseline condition, the model was only instructed to give feedback that would help the student revise. In subsequent conditions, the same prompt was supplemented with a short descriptor about a student’s learning needs, identity, or socio-psychological state: low achievement, English language learner designation, learning disability designation, White, Black / African American, Hispanic / Latino, Asian, female, male, motivated, and unmotivated.
In other words, the essays did not change, but the prompts describing the student did.
We then compared the language of feedback across conditions using a log-odds analysis. For each condition, we pooled the generated feedback and counted how often each word appeared. Log-odds identifies which words are unusually characteristic of one condition versus another. Very common words such as “the” or “a” tend to disappear because they occur often in both groups, while words that are especially concentrated in one condition stand out. The highest-scoring words can be interpreted as the words most representative of that condition’s feedback language. In this way, we can describe the words that students would see more often when LLMs are prompted, for example, with the information that they are an English language learner.
Marked Pedagogies Of Disapproval, Language Inadequacy, And Lowered Expectations
For the exact same essay, simply describing the student as high achieving is more likely to lead to feedback that sees the writing as thoughtful, compelling, and full of potential. Describing the student as low achieving or having an English language learner designation is more likely to lead to feedback that sees the writing as unclear, error-prone, and in need of correction. In fact, for students described as English language learners, the LLM-generated feedback focused heavily on grammar, mechanics, and judgements about the student’s grasp of language, at the expense of engaging with the essay’s ideas and arguments.
For students described as having a learning disability, models leaned on simplified vocabulary and a patronizing warmth. The feedback often used “we,” “let’s,” and “try” as if guiding someone through something difficult, and called for writing to become “shorter” and “easier,” as if the student needed simplification rather than substantive writing instruction.
Describing the student as low achieving or having an English language learner designation is more likely to lead to feedback that sees the writing as unclear, error-prone, and in need of correction.
Marked Pedagogies of Cultural Stereotypes, Emotional Connection, and Enthusiasm
When students were described by race or ethnicity, models generated feedback that reflected stereotyped assumptions about identity. Feedback for students described as Black more frequently referenced collective identity, power, and social change. Feedback for students described as Hispanic more often assumed limited English ability and invoked cultural frames around family and responsibility. Feedback for students described as Asian was more likely to emphasize academic responsibility and respect. Across all three attributes, feedback more often called for the writing to sound more “polished.”
Feedback for students described as female was warmer, more personal, and more emotionally framed, with words like “love,” “empathy,” and “wonderful.” Feedback for male students, by contrast, was more direct and task-focused, treating the writer as someone capable of handling critique.
For students described as unmotivated, models became cheerleaders, leaning on praise, exclamation points, and inclusive pronouns to soften requests to “try.” All the while, the feedback more often nudged students toward basic task completion rather than refining their arguments or pushing their thinking forward.
When students were described by race or ethnicity, models generated feedback that reflected stereotyped assumptions about identity.
Why These Embedded Stereotypes Matter
Good teaching often requires responsiveness to students’ identities, histories, and needs. A teacher who knows a student well might reasonably choose to lead with encouragement, or to focus on language mechanics, or to speak to a student’s cultural background. But a teacher exercises that discretion and professional judgement case by case, drawing on genuine knowledge of the individual. The Marked Pedagogies we find in LLMs are systematic and consistent. LLMs apply these patterns wholesale, across every student who shares a label, with effects that can outweigh the actual quality of their writing.
These findings carry a lesson for how we work with models more broadly. When we deploy LLMs, we typically prompt them to complete a task and produce an artifact. The orientations, values, and beliefs underlying that artifact are left unspecified. The model fills that gap on its own terms, and pedagogy becomes an afterthought. But defining how a model ought to respond to students (and to what end) is as important as defining the next task it can automate. As AI becomes more embedded in writing instruction, developers and educators need stronger ways of evaluating not just whether feedback is personalized, but how it is personalized.
Language models can tailor their responses to students’ personal attributes, but are the pedagogical values driving those results the ones we are willing to accept?
