Managing Subjectivity in Data Annotation: Best Practices for Reliable Likert-Scale Evaluation

March 23, 2026

Jules King, L Burleigh, Kennedy Smith

Educational tools and research depend on high-quality datasets to drive innovation and improve outputs. A critical step in building these datasets is data annotation, the process of assigning meaningful labels to data. However, not all data is easy to annotate, especially when labels are influenced by human judgment rather than objective facts.

The Learning Agency has developed numerous benchmark datasets, with our most recent work conducted in partnership with Renaissance Philanthropy as part of CareerNet. In this project, annotation focused on the quality of career navigation responses using dimensions of completeness, coherence, and correctness. During this process, we encountered challenges when asking domain experts to annotate responses using Likert scales, as individual judgement inevitably influenced how criteria was interpreted and applied.

This article explores how subjectivity can affect annotation, particularly when considering Likert scales, and shares best practices that emerged from our experience to help other teams navigate similar challenges.

How Human Judgement and Rating Scales Affect Annotation

Even when a clearly defined rubric is provided, some annotation schemas, such as sentiment, do not always have a single correct answer, and annotators may reasonably assign different codes to the same data point. These constructs are applied using human judgement based on a conceptual understanding. As a result, variability can emerge not because the labels themselves are poorly specified, but because annotators may interpret the rubric criteria differently for each item or have different life experiences that impact their judgements. For example, when rating math learning videos for engagement and quality, annotators may interpret criteria such as “engaging” or “high quality” differently and draw on their own perceptions in applying the rubric.

This variability can appear even when annotators are domain experts. While their deep knowledge can produce accurate and reliable annotations, it can also lead to differences in how criteria are applied. For example, some experts may consider additional information they would personally include, while others may apply the rubric more strictly. These differences in interpretation can result in inconsistent labels, even among well-qualified experts acting in good faith, making it challenging to ensure high-quality annotations.

Likert-style scales, used to assign quantitative values to qualitative judgements, can further amplify these differences. By introducing gradations rather than binary decisions, they increase ambiguity especially when distinctions between scale points are subtle. For instance, a rating of two might correspond to “rarely”, while a three might indicate “occasionally”. Determining what meaningfully distinguishes these points can be difficult, and without shared calibration, annotators may interpret scale values differently, leading to values that shift over time or vary across annotators. Even if descriptions are made more distinctive, for example, by assigning numeric values to indicate what it means for something to “rarely” versus “occasionally” occur, human perception and prior experience can still influence scale selections. Some annotators may focus intensely on the distinction, while others quickly revert to their own understanding of the scale as they annotate.

Even when a clearly defined rubric is provided, some annotation schemas, such as sentiment, do not always have a single correct answer, and annotators may reasonably assign different codes to the same data point. As a result, variability can emerge not because the labels themselves are poorly specified, but because annotators may interpret the rubric criteria differently for each item or have different life experiences that impact their judgements.

Best Practices to Mitigating Subjectivity in Annotation

Despite these challenges, careful design and calibration can help mitigate variability. By providing clear definitions, examples, and discussion opportunities, teams can align interpretation, become more consistent and maintain quality annotations. The Learning Agency employs several of these strategies in its own annotation work.

1. Develop Clear and Explicit Rubrics

Reducing subjectivity begins with creating detailed and well-defined rubrics, especially when using Likert scales. Effective rubrics clearly outline each criterion and provide concrete examples for each score, helping annotators understand exactly what to look for and what criteria should be met when labeling data. Testing the rubric with domain experts on a small sample of data can be especially valuable to gather feedback on clarity, potential overlaps, or suggested refinements, then incorporate necessary adjustments before full-scale annotation.

For CareerNet, each rubric used by annotators was first tested by our advisory board, experts in career navigation and machine learning, who provided guidance throughout the process. Their feedback helped ensure that the scales were clear and unambiguous for annotators.

2. Collapse Scales When Needed

If using a Likert scale for annotation, it is important to be intentional about the number of scale points. While 5- or 7-point scales can offer a wide range of options, helping capture subtle differences in opinion, such as when annotating videos to evaluate their effectiveness, too much granularity can make it difficult for annotators to interpret and apply the scale consistently. For example, distinctions like “strongly agree,” “somewhat agree,” or “agree” may be confusing, and excessive variation can reduce alignment across annotators. In such cases, collapsing the scale to 4 or 3 points can simplify decision-making, reduce ambiguity between criteria, and increase consistency in annotations.

A robust norming process is critical when labels are susceptible to subjective interpretation. At The Learning Agency, our approach involves two to three rounds in which annotators label small data samples and assess agreement by comparing their labels against a gold standard created by experienced master coders.

3. Have Training Led by ML Experts

Involving machine learning (ML) experts in training can be highly beneficial for certain annotation tasks. For example, for CareerNet, ML experts acted as master coders for questions focused on technology careers. They provided feedback to other annotators and explained how annotations would be used to train large language models (LLMs). Many annotators initially rated responses harshly, comparing them to what they personally would have written or expecting every answer to be “perfect.” The ML experts helped them shift their perspective, emphasizing that each response should be evaluated on its own merits. By understanding that a range of good and bad answers is valuable, annotators could better evaluate the full spectrum of responses, helping LLMs learn what constitutes high- and low-quality answers. When ML experts guide annotators through the rubric, especially for tasks evaluating quality, they can clarify how annotations relate to LLM training, helping keep the annotation process focused and as objective as possible.

4. Develop a Strong Norming Process

A robust norming process is critical when labels are susceptible to subjective interpretation. At The Learning Agency, our approach involves two to three rounds in which annotators label small data samples and assess agreement by comparing their labels against a gold standard created by experienced master coders. Annotators discuss any discrepancies and reflect on how their interpretation of the rubric may have differed. These discussions may lead to refinements in rubric language, which requires updating the gold standard. Norming continues until annotation alignment stabilizes. While exact agreement is not expected, the goal is to achieve annotations that are closely aligned.

It is important to note that when it comes to assessing agreement during norming, a single inter-rater reliability metric may not fully capture alignment. In addition to a measure like Cohen’s Kappa, it can be helpful to consider other statistical measures like percent perfect agreement. Using multiple metrics provides a broader view of consistency, showing both how perfectly annotations match and how closely they align. This approach also helps distinguish areas where some variation is acceptable from those that may benefit from further calibration.

5. Employ Ongoing Quality Checks

Continuous quality checks are essential to mitigate subjectivity over time. After norming, periodically review annotations for consistency. In CareerNet, annotator agreement was continuously monitored by having annotators complete both double-coded and single-coded sets. A similar approach can be used, such as regularly double-coding a subset of items to assess agreement.

If significant drift is detected, pause annotation and conduct refresher discussions to realign the team. Annotators should flag ambiguous cases rather than guessing, and master coders can provide guidance on these instances. This process ensures that quality and consistency are maintained throughout the annotation project.

Continuous quality checks are essential to mitigate subjectivity over time. After norming, periodically review annotations for consistency. In CareerNet, annotator agreement was continuously monitored by having annotators complete both double-coded and single-coded sets. A similar approach can be used, such as regularly double-coding a subset of items to assess agreement.

Managing Subjectivity in AI Annotation

Subjectivity is often an unavoidable aspect of educational AI annotation tasks. When rating essays or crafting questions, personal opinions can influence quality assessments, and if left unmanaged, this can lead to inconsistencies that compromise dataset quality. For this reason, careful process design is essential.

Expert annotators alone are not sufficient; alignment, training, and thoughtful selection of measurement approaches are equally critical. Using clear rubrics, ML-informed guidance, strong norming processes, and multiple agreement metrics can help reduce inconsistency. While subjectivity cannot be fully eliminated, it can be managed in a way that produces reliable, high-quality datasets.