Large language models (LLMs) have become increasingly capable of writing essays, answering questions, and even generating images. But a new role has emerged: LLMs acting as judges, often evaluating the outputs of other LLMs. This method, known as LLM-as-a-Judge, leverages advanced language processing to automate evaluations that were once handled exclusively by humans.
The R&D team at The Learning Agency has used this approach and seen both its promise and its pitfalls. This article offers an inside look at how The Learning Agency and others have used LLM-as-a-Judge, where it works, and where caution is warranted.
What is LLM-As-A-Judge?
LLM-as-a-Judge refers to using language models to evaluate outputs or performance on complex tasks by ranking responses, rating helpfulness, and scoring accuracy. This method can be applied in a range of fields, including education, finance, and law.
Its popularity has grown due to its speed and performance that is comparable to human evaluators. A 2023 study found that trained LLM judges agreed with human raters over 89 percent of the time, which was about the same level of agreement between humans. Using LLM agents as judges enables quicker feedback and faster iteration during the early stages of development for LLM-powered applications.
However, the evaluation quality and accuracy can be impacted by variables such as prompt design, biases, model understanding, choice of model, and complexity of the task.
How The Learning Agency And Others Are Using LLM-As-A-Judge
While the promise of LLM-as-a-Judge is compelling, its real-world value is best seen through practical applications. These highlight both its strengths, like accelerating workflows, and its limitations, including inconsistent outputs.
Use Case 1: Career Navigation Project
As part of a broader initiative to develop high-quality datasets for AI-driven career guidance, The Learning Agency explored the use of LLM-as-a-Judge to assist in labeling user-submitted career questions.
Manually labeling large datasets is time-consuming and costly. To accelerate the process and improve scalability, The Learning Agency used the SetFit model, a prompt-free framework for few-shot fine-tuning designed specifically for labeling. This model was used to assign between one and four career scenario labels (e.g., resume optimization, career advancement) to 200 short, diverse user queries like “What careers can I pursue with a biology degree?”
However, the results highlighted some key challenges. There was limited agreement between model-generated and human-provided labels, and the model often struggled to apply labels consistently.
That said, the experience also surfaced valuable insights. Clean, well-structured inputs can lead to noticeably better label accuracy, and longer or more context-rich text showed potential to improve performance. For example, including longer, synthetically generated questions slightly improved performance. Most importantly, the process underscored the value of robust label definitions and high-quality training data, both essential to building meaningful AI evaluation pipelines.
While LLM-as-a-Judge can help accelerate workflows, it is most effective when tightly coupled with thoughtful human oversight and clean data practices with sufficient context. Even then, its success also depends on the specific model being used and how it is used.
While SetFit showed some utility, a more significant improvement in scenario labeling came from using the data with OpenAI models. A function-calling version of OpenAI’s GPT-4.1 showed increased labeling performance (SetFit macro f1: 0.22; Function calling macro f1: 0.76), and a fine-tuned GPT-4.1-mini model showed even better performance (macro f1: 0.81, weighted f1: 0.94).
Overall, fine-tuning OpenAI’s GPT models proved more effective at implementing multi-labels on short questions than fine-tuning SetFit. However, the model choice comes with considerations beyond performance. SetFit is open-source and free to use, whereas fine-tuning OpenAI models requires payment and increases in cost with an increase in data.
Ultimately, this project revealed that while LLM-as-a-Judge can help accelerate workflows, it is most effective when tightly coupled with thoughtful human oversight and clean data practices with sufficient context. Even then, its success also depends on the specific model being used and how it is used.
Use Case 2: LEVI Projects
The Learning Engineering Virtual Institute (LEVI) has explored LLM-as-a-Judge in AI-powered math tutoring.
Researcher Thomas Christie, in collaboration with the University of Colorado Boulder, used LLMs to identify instructional “Talk Moves”, conversational strategies that support learning, like restating or pressing for accuracy, in transcripts of math tutoring sessions. They found that fine-tuning GPT-3.5-turbo on a small set of labeled examples produced better results than a state-of-the-art RoBERTa-based model. This approach improves not only the speed but also the quality of AI-generated feedback provided to tutors. Christie also used LLMs to evaluate feature additions to LiveHint AI, Carnegie Learning’s interactive math assistant. He used LLM judges to evaluate conversations between LiveHint AI and simulated students, which provided a quality feedback signal during early feature development.
In a related study, The Learning Agency assessed OpenAI’s GPT-4o-mini with rubric-based prompts to evaluate student-tutor interactions. Most outputs were rated positively, some were applied based on irrelevant student behavior, and the rationale for labels was sometimes unclear.
These findings reinforce a key insight: LLM-as-a-Judge can be effective in evaluating math tutoring systems, but success depends on clear rubrics and validation against human judgments. Careful rubric development is essential, whether the graders are LLMs or humans. Using LLMs doesn’t reduce the need for high-quality rubrics; it reinforces it. The quality of a rubric is often reflected in the inter-rater reliability among those who use the rubric and highlights the importance of clarity and consistency for both human and AI evaluators.
These findings reinforce a key insight: LLM-as-a-Judge can be effective in evaluating math tutoring systems, but success depends on clear rubrics and validation against human judgments. Careful rubric development is essential, whether the graders are LLMs or humans.
Use Case 3: External Projects
Researchers across academia and industry are also experimenting with LLM-as-a-Judge in math reasoning and educational feedback.
A 2025 study examined how eight LLMs judged multi-step problem-solving. The models were asked to compare two answers and determine correctness. LLM judges often favored higher-quality models, even when their answer was incorrect. This highlights a key limitation: identifying model quality is a complex task, and over-reliance on surface-level signals like accuracy can obscure more fundamental issues, such as flawed reasoning or conceptual misunderstandings. These deeper failures are especially critical to address when the goal is to support students’ mathematical understanding.
A 2024 study used GPT-4 to evaluate human and AI-generated feedback on incorrect answers to math multiple-choice questions. They designed a rubric assessing five aspects: Correct, Revealing, Suggestion, Diagnostic, and Positive. GPT-4’s evaluations aligned with human ratings 76 percent of the time across both AI and human-generated feedback, showing promise for rubric-based assessments.
Together, these studies highlight a recurring theme: while LLMs can approximate human judgment, strong design, and validation are key to achieving useful results.
Another limitation is the lack of explainability. In The Learning Agency’s evaluations, ChatGPT occasionally offered vague or unsupported justifications when assigning ratings like “Good,” undermining the interpretability and trustworthiness of its assessments. LLMs also show limited understanding of interaction dynamics. They may misidentify the end of a conversation or misinterpret minimal student prompting.
Limitations & Criticism Of LLM-As-A-Judge
Beyond the challenges surfaced in the specific use cases above, additional limitations emerge when using large language models as judges.
One issue is a tendency toward positive evaluations. For instance, as noted earlier by The Learning Agency, GPT-4o frequently rated AI tutor-student interactions as “Excellent”.
Another limitation is the lack of explainability. In The Learning Agency’s evaluations, ChatGPT occasionally offered vague or unsupported justifications when assigning ratings like “Good,” undermining the interpretability and trustworthiness of its assessments.
LLMs also show limited understanding of interaction dynamics. They may misidentify the end of a conversation or misinterpret minimal student prompting. For instance, in The Learning Agency’s AI-tutor student interactions, ChatGPT rated an interaction as “Good” due to the tutor not continuing the discussion even though the interaction ended with the student giving a final answer, mistaking the lack of follow-up for appropriate support, rather than recognizing a conversational cut-off.
Cost and accessibility must be considered. While a particular model may be higher performing or have reduced limitations, hardware or financial requirements may make the model inaccessible for a project. These requirements, model performance, and model availability are continuously changing, so these aspects should be re-evaluated before each project.
Bias and inconsistency also remain concerns. GPT-4 has been shown to favor its outputs over those from other models, like Claude. LLMs also experience position bias, where judges favor certain positions within prompt components rather than the content itself. These models sometimes lack the very skills they are tasked with evaluating, leading to flawed judgments. In MRBench taxonomy experiments, for example, GPT-4 could often detect that an answer was incorrect but failed to identify the specific conceptual or procedural error, resulting in inaccurate step-level assessments.
LLM judges are highly sensitive to prompting methods. Different evaluation designs, such as interactive, multi-turn dialogues versus single-step prompts, can produce significantly different results, as noted in a 2025 study. Notably, newer models do not always outperform older ones across all tasks or evaluation types, revealing inconsistencies in progress.
Finally, cost and accessibility must be considered. While a particular model may be higher performing or have reduced limitations, hardware or financial requirements may make the model inaccessible for a project. These requirements, model performance, and model availability are continuously changing, so these aspects should be re-evaluated before each project.
Best Practices Of Using LLMs-As-Judge
To address these limitations, several best practices can help improve the quality and reliability of LLM-based evaluations.
Clear evaluation rubrics and well-defined label criteria are essential, as vague or inconsistent guidelines often lead to unreliable results. As discussed previously, whether evaluations are conducted by humans or LLMs, well-crafted rubrics remain fundamental. High-quality rubrics guide evaluations, and without clear guidelines, inter-rater reliability, consistency, and certainty of quality evaluations will be low.
Incorporating human oversight, particularly by validating a sample of model-generated labels and a human-in-the-loop process, helps identify inconsistencies and refine evaluation strategies. While LLM-as-a-Judge methods are particularly useful in early development stages for surfacing broad issues, alignment with expert human review becomes essential in later stages to ensure evaluations meet higher standards of accuracy and reliability.
Structured, clean, and informative input data also improves outcomes, as LLMs are highly sensitive to the clarity and completeness of the information they’re given. Using benchmark datasets with ground-truth labels further strengthens evaluation quality by anchoring expectations and reducing the drift or variability seen in open-ended assessments.
As the field pushes the boundaries of assessing model usefulness, it continues to refine how it evaluates, combining the scale of LLMs with the discernment of human judgment. LLM-as-a-Judge could become a powerful tool. Experience shows it is most effective when used with clean, detailed, well-structured data with clear prompting or fine-tuned models.
Prompting should include detailed, specific instructions to reduce ambiguity and variability in model responses. Additionally, requesting explanations alongside outputs can improve transparency and offer insights into model reasoning, helping to demystify what is often viewed as a black-box process.
Following these practices helps mitigate common issues and ensures that LLM-based evaluation contributes meaningfully to the development and fine-tuning process.
Judging With Judgment
As the field pushes the boundaries of assessing model usefulness, it continues to refine how it evaluates, combining the scale of LLMs with the discernment of human judgment. LLM-as-a-Judge could become a powerful tool.
Experience shows it is most effective when used with clean, detailed, well-structured data with clear prompting or fine-tuned models. However, more investigation needs to be done to improve it. The future of trustworthy AI depends not just on what models say, but on how users judge what’s worth saying.
