Assessing ChatGPT’s Writing Evaluation Skills Using Benchmark Data

Guides & Resources

Assessing ChatGPT’s Writing Evaluation Skills Using Benchmark Data

Introduction

For many, one of ChatGPT’s most promising educational applications is as an assisted writing feedback tool (AWFT). In other words, ChatGPT could help students hone their writing skills and support teachers in evaluating student work. Many educators already use tools like ChatGPT in the classroom.

However, for ChatGPT to be an effective feedback tool, it must understand the different components of student writing and provide targeted feedback anchored in the organization and development of these components. Research shows struggling writers can improve when receiving this formative and granular feedback. ChatGPT is less likely to improve writing outcomes by offering summative feedback alone to students, such as holistic evaluations of writing samples or letter grades.

For ChatGPT to be an effective feedback tool, it must understand the different components of student writing and provide targeted feedback anchored in the organization and development of these components.

It is unknown how well ChatGPT can provide students with either summative or formative feedback. Thus, we tested ChatGPT’s (version 4) default model’s performance for providing summative and formative feedback on student writing using two benchmarks for automated writing evaluation: the Hewlett Foundation ASAP dataset, a corpus of student essays that pioneered innovation in automated essay scoring algorithms, and 2) the PERSUADE (Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements) Corpus, a large-scale collection of student essays with annotated discourse elements related to individual components of student writing along with their effectiveness in supporting a student’s ideas.

ChatGPT shows human-level performance in holistic scoring using the ASAP dataset as a benchmark, but it struggles with more granular, discourse-level evaluation using the PERSUADE dataset as a benchmark. ChatGPT’s greatest challenge was identifying the distinct elements of argumentative writing (e.g., claim, counterclaim, rebuttal, supporting evidence) and breaking an essay into meaningful and coherent segments. ChatGPT also tends to be a lenient grader when focused on smaller writing segments, like discourse elements, rating them at a higher effectiveness level than a human would. This finding isn’t entirely surprising because the chatbot comes from a family of generative language models that weren’t trained on data representing tasks like essay segmentation or classification. Put differently, generative language models are much less exposed to annotated essay segments in training than their usual use cases (text generation, sentiment analysis, etc.). Thus, they cannot be expected to be as reliable for discourse-level evaluation.

Methods

In assessing ChatGPT’s performance on the ASAP benchmark, we tasked the chatbot with assigning a holistic and analytic score for approximately 300 essays sampled from the ASAP dataset that were stratified by score. These essays were sampled from the eight ASAP essay prompts, and we interacted with the chatbot via OpenAI’s API using its standard or default settings (e.g., temperature). We instructed ChatGPT via few-shot prompting, a method where you provide a few examples related to the task to demonstrate ideal performance and supplement the general knowledge of the language model. This approach was done deliberately to see how well the default version of ChatGPT could perform without special-purpose tuning. The API prompts were also adjusted to accommodate variations in rubrics and score ranges across these sets.

Performance was evaluated using the quadratic weighted kappa, a standard metric for comparing machine-generated holistic scores against human evaluations. This metric was computed for each essay set by comparing the final score to GPT-4’s predictions. Quadratic weighted kappa was also calculated for the machine-generated analytic scores (e.g., organization, sentence fluency, conventions) by comparing GPT-4’s predictions with the human raters.

In our assessment of ChatGPT’s performance on the PERSUADE benchmark, we tasked the chatbot with three key assignments:

Segmenting an essay into distinct rhetorical or argumentative segments.
Labeling each segment according to a discourse element type (e.g., lead, position, claim, counterclaim, rebuttal, evidence, concluding statement).
Rating the effectiveness of each discourse element as Ineffective, Adequate, or Effective.

We randomly sampled 300 essays from the PERSUADE dataset for ChatGPT’s evaluation and interacted with the chatbot via OpenAI’s API. Using a standardized API prompt for each essay, ChatGPT (GPT-4 version) received a scoring and segmentation rubric along with several examples for each annotation label of a discourse element type. Additionally, the prompt instructed ChatGPT to generate a rationale before assigning a label, following a chain-of-thought prompting method. We assessed accuracy by checking whether the chatbot matched the human-annotated writing elements with at least 50% text overlap across both the machine prediction and ground truth, along with correctly identifying the discourse type and effectiveness rating.

Findings

On the ASAP benchmark, which examines overall essay quality, ChatGPT performed comparably to a human in holistic scoring. Specifically, as shown in Table 1, the quadratic weighted kappa statistic for the chatbot on final holistic scores ranged from 0.67 to 0.84, indicating an acceptable level of human agreement. However, while ChatGPT generally agreed with human raters in holistic scores, the manner in which ChatGPT’s scores differed from those of humans (when they did differ) is not consistent with how humans’ scores differed among themselves. Table 2 presents kappa agreement on score differences as another metric for comparing GPT’s agreement with each human rater to the agreement among the raters themselves. Note that essay set 2 had two holistic domains: 1) a holistic score considering ideas, voice, organization, and style, and 2) a holistic score considering only language conventions and skills. Overall, there was lower agreement (e.g., .13 for essay set 1, domain 1) on the patterns of score differences between ChatGPT and human rater 1 compared to the differences between the human raters. This low agreement indicates it’ll be difficult to predict how ChatGPT’s holistic scoring would diverge from that of a human evaluator.

On the ASAP benchmark, which examines overall essay quality, ChatGPT performed comparably to a human in holistic scoring.

ChatGPT did not perform as well as a human on analytic scoring for specific components such as ideas and conventions in the ASAP benchmark. As shown in Tables 3 and 4, kappa scores for GPT ranged from 0.18 to 0.58 on essay set 7’s analytic traits and from 0.33 to 0.63 on essay set 8’s analytic traits. Across both essay sets, the chatbot demonstrated low agreement for the analytic trait of ideas while uniquely performing poorly for scoring conventions in essay set 7. GPT also demonstrated an inconsistent pattern in how its scores differed from those of human raters, compared to the differences among the human raters themselves, as shown in Tables 5 and 6. Note that combined scores for raters were not available for the individual traits in ASAP essay sets 7 and 8, so comparisons on a “final” holistic score were not possible.

Table 1: Kappa Agreement Between ChatGPT and Human Raters in Holistic Scoring of the ASAP Dataset

Essay set and domain (holistic)	GPT vs. final score	Rater 1 vs Rater 2	GPT vs. Rater 1	GPT vs. Rater 2
Set 1 domain 1	0.67	0.72	0.66	0.67
Set 2 domain 1	0.84	0.81	0.84	0.78
Set 2 domain 2	0.55	0.8	0.55	0.47
Set 3 domain 1	0.67	0.77	0.7	0.65
Set 4 domain 1	0.73	0.85	0.71	0.71
Set 5 domain 1	0.8	0.75	0.76	0.79
Set 6 domain 1	0.84	0.78	0.81	0.86
Set 7 domain 1	0.66	0.72	0.57	0.64
Set 8 domain 1	0.74	0.62	0.73	0.50

Table 2: Kappa Agreement on Score Differences Between ChatGPT and Human Raters in Holistic Scoring of the ASAP Dataset

Essay set and domain (holistic)	(GPT-R1) vs (R1-R2)	(GPT-R2) vs (R1-R2)
Set 1 domain 1	0.13	0.14
Set 2 domain 1	0.30	0.44
Set 2 domain 2	0.15	0.15
Set 3 domain 1	0.40	0.33
Set 4 domain 1	0.48	0.23
Set 5 domain 1	0.52	0.26
Set 6 domain 1	0.49	0.27
Set 7 domain 1	0.53	0.15
Set 8 domain 1	0.23	0.56

*R1 = Rater 1, R2 = Rater 2

Table 3: Kappa Agreement Between ChatGPT and Human Raters in Analytic Scoring of the ASAP Dataset, Essay Set 7

Essay set 7 trait	Comparison	Kappa
Ideas	R1 vs R2	0.70
	R1 vs GPT	0.35
	R2 vs GPT	0.34
Organization	R1 vs R2	0.58
	R1 vs GPT	0.53
	R2 vs GPT	0.58
Style	R1 vs R2	0.54
	R1 vs GPT	0.55
	R2 vs GPT	0.51
Conventions	R1 vs R2	0.57
	R1 vs GPT	0.18
	R2 vs GPT	0.26

*R1 = Rater 1, R2 = Rater 2

Table 4: Kappa Agreement Between ChatGPT and Human Raters in Analytic Scoring of the ASAP Dataset, Essay Set 8

Essay set 8 trait	Comparison	Kappa
Ideas and Content	R1 vs R2	0.65
	R1 vs GPT	0.55
	R2 vs GPT	0.42
Organization	R1 vs R2	0.54
	R1 vs GPT	0.63
	R2 vs GPT	0.47
Voice	R1 vs R2	0.40
	R1 vs GPT	0.61
	R2 vs GPT	0.33
Word Choice	R1 vs R2	0.54
	R1 vs GPT	0.64
	R2 vs GPT	0.37
Sentence Fluency	R1 vs R2	0.58
	R1 vs GPT	0.68
	R2 vs GPT	0.44
Conventions	R1 vs R2	0.64
	R1 vs GPT	0.73
	R2 vs GPT	0.54

*R1 = Rater 1, R2 = Rater 2

Table 5: Kappa Agreement on Score Differences Between ChatGPT and Human Raters in Analytic Scoring of the ASAP Dataset, Essay Set 7

Essay set 7 trait	Comparison	Kappa
Ideas	(R1 – GPT) vs (R1 – R2)	0.32
	(R2 – GPT) vs (R2 – R1)	0.26
Organization	(R1 – GPT) vs (R1 – R2)	0.39
	(R2 – GPT) vs (R2 – R1)	0.13
Style	(R1 – GPT) vs (R1 – R2)	0.40
	(R2 – GPT) vs (R2 – R1)	0.29
Conventions	(R1 – GPT) vs (R1 – R2)	0.41
	(R2 – GPT) vs (R2 – R1)	0.10

*R1 = Rater 1, R2 = Rater 2

Table 6: Kappa Agreement on Score Differences Between ChatGPT and Human Raters in Analytic Scoring of the ASAP Dataset, Essay Set 8

Essay set 8 trait	Comparison	Kappa
Ideas and Content	(R1 – GPT) vs (R1 – R2)	0.41
Ideas and Content	(R2 – GPT) vs (R2 – R1)	0.33
Organization	(R1 – GPT) vs (R1 – R2)	0.51
Organization	(R2 – GPT) vs (R2 – R1)	0.30
Voice	(R1 – GPT) vs (R1 – R2)	0.64
Voice	(R2 – GPT) vs (R2 – R1)	0.36
Word Choice	(R1 – GPT) vs (R1 – R2)	0.55
Word Choice	(R2 – GPT) vs (R2 – R1)	0.18
Sentence Fluency	(R1 – GPT) vs (R1 – R2)	0.55
Sentence Fluency	(R2 – GPT) vs (R2 – R1)	0.28
Conventions	(R1 – GPT) vs (R1 – R2)	0.56
Conventions	(R2 – GPT) vs (R2 – R1)	0.22

*R1 = Rater 1, R2 = Rater 2

On the PERSUADE benchmark, ChatGPT’s predictions are poor across the board. This benchmark measures the performance of algorithms in identifying and evaluating elements of argumentative writing. Table 7 shows that the chatbot could only match 33% of the annotated writing elements, on average, and Table 8 shows it averaged a 52% predictive accuracy for the three effectiveness labels (Ineffective, Adequate, Effective). One major challenge that ChatGPT encountered with the PERSUADE benchmark was text segmentation, or breaking the essay into meaningful units. First, ChatGPT struggled to faithfully replicate the original essay text in its outputs because it tended to automatically correct spelling, grammar, formatting, and other elements. As a result, only 80% of ChatGPT’s generated segments faithfully reproduced the text of the original segment. This is unsurprising, however, because ChatGPT is based on a generative language model, meaning it’s trained to predict and generate text one word at a time, and there’s a possibility of unintentional omissions or alterations when replicating text at its default temperature setting.

Table 7: ChatGPT’s Discourse Element Matching Accuracy by Category

Discourse Element Category	Predictive Accuracy
Lead	76%
Position	20%
Claim	19%
Counterclaim	26%
Rebuttal	21%
Evidence	31%
Concluding Statement	79%
All	33%

Table 8: ChatGPT’s Predictive Accuracy for Discourse Effectiveness

Discourse Effectiveness Label	Predictive Accuracy
Ineffective	31%
Adequate	49%
Effective	74%
All	52%

ChatGPT also tended to overestimate segment sizes compared to the human annotations. Despite there being 2,100 annotated discourse segments to match in the sample, ChatGPT only generated 1,600 predictions, already ruling out matching roughly a quarter of the ground truth segments. This observation suggests two possible scenarios: (1) ChatGPT might be grouping adjacent writing elements or overestimating segment sizes, and/or (2) ChatGPT may be overly cautious when making predictions. In either case, the model generates a reasonable number of predictions but faces challenges in achieving accurate matches.

ChatGPT tended to be an “easy grader” when scoring discourse elements of writing. It often rated writing elements as "Effective" when humans had rated them as "Adequate," or it rated elements as "Adequate" when humans had assessed them as "Ineffective."

Additionally, ChatGPT tended to be an “easy grader” when scoring discourse elements of writing. It often rated writing elements as “Effective” when humans had rated them as “Adequate,” or it rated elements as “Adequate” when humans had assessed them as “Ineffective.” More specifically, it rated 40% of Adequate elements as Effective, and 54% of Ineffective elements as Adequate.

ChatGPT’s performance is also significantly lower than other language models trained on the PERSUADE dataset, specifically those that won the related Feedback Prize competition series. These winning models matched upwards of 78% of the human-annotated elements, and their effectiveness labels were correct 75% of the time. It’s worth noting that the competition-winning solutions did not use GPT -4 but rather DeBERTA, which belongs to the BERT family of Transformer models and is notably different from GPT in its architecture.

Discussion

ChatGPT can perform comparably to a human in assigning a final holistic score for a student essay, but it struggles to identify and evaluate the structural pieces of argumentative writing in our experimental setup. It can assess an essay’s overall quality, but it can’t meaningfully break essays into units that form the student’s argument, such as claims, counterclaims, or supporting evidence.

ChatGPT can perform comparably to a human in assigning a final holistic score for a student essay, but it struggles to identify and evaluate the structural pieces of argumentative writing in our experimental setup.

One likely explanation is that the model performs really well on datasets that might have been leaked during the pretraining or instruction-tuning phase but not so well on datasets that are not public. It can work well in a zero-shot manner when the task is similar to one from the datasets it was fine-tuned on, but the PERSUADE segmentation task might not be that common. PERSUADE proposes a less common task on which simple prompting does not perform well. Though performance can likely improve with more sophisticated prompting strategies, fine-tuning, and parameter tuning, the gap in performance for ChatGPT relative to PERSUADE baselines is large enough to suggest that foundation models have problems generalizing to new, less common tasks.

One alternative solution is to combine ChatGPT with BERT-based models, which are designed to analyze and categorize text to improve writing evaluation and feedback systems. For instance, ChatGPT can enhance these systems by providing text augmentations, rephrasing, and stylistic improvements. Overall, these distinct families of language models can work powerfully together to provide feedback on student writing and improve learning outcomes.

– This article was written by Perpetual Baffour, Tor Saxberg, Ralph Abboud, Ulrich Boser, and Scott Crossley.

1 thought on “Assessing ChatGPT’s Writing Evaluation Skills Using Benchmark Data”

Eric Westendorf
04/20/2024 at 8:49 AM

It would be super helpful to see examples. One link was provided to a prompt, which helped. But to understand the implications of the report, seeing examples of when ChatGPT performed well compared to a human rater vs. did not perform well would bring the research to life.

Reply