Now That ChatGPT’s Been Introduced, It’s Time To Fine Tune It

January 29, 2024

Ulrich Boser

It’s been a year since ChatGPT arrived – landing in classrooms and causing leaders to reimagine education. Of course, there was AI before ChatGPT, but much like there was basketball before Michael Jordan, the game hasn’t been the same since.

Following its initial rollout period, researchers from Stanford, Worcester Polytechnic Institute, and the University of Toronto among others have been looking at AI models in education. In three recently-published studies they outline what works when it comes to ChatGPT – and what does not.

Notably, the new studies suggest that ChatGPT needs to be trained on education-specific data and targeted to specific contexts if it is to truly improve learning outcomes.

To be sure, ChatGPT can be a boon for students and teachers alike, freeing them from mundane tasks and alleviating inefficiencies. It may potentially herald breakthrough solutions in scalable learning and help mitigate inequities, but it could also exacerbate them. The new research helps answer these questions, digging closely into the implications and execution of ChatGPT for learning technology.

New Kid On the Block

When AI-powered large language model (LLM) ChatGPT hit the scene in late 2022 it elicited everything from fear to a fervor of excitement. Some educators continue to welcome its potential to help streamline and scale lessons, bear the burden of cumbersome administrative tasks, and even engage students with the allure of new tech.

Other educators approached the introduction of ChatGPT with more caution – expressing concerns that the technology could lead to cheating, compromise critical thinking skills, push out erroneous information or even replace the work that teachers do.

More recently researchers have set out to study the implications of LLMs and AI in a variety of scenarios, subjects, and skill sets with mixed results.

The first study was released last year from a team at Worcester Polytechnic Institute, titled Comparing Different Approaches to Generating Mathematics Explanations Using Large Language Models. It looked at whether LLMs could be used to quickly produce math problem explanations in order to help expedite timelines for adding new math lessons to online learning platforms.

Specifically, the team explored the: “possibility of large language models, specifically GPT-3, to write explanations for middle-school mathematics problems, with the goal of eventually using this process to rapidly generate explanations for the mathematics problems of new curricula as they emerge, shortening the time to integrate new curricula into online learning platforms.”

So what happened? The team took two approaches. The first attempted to “summarize the salient advice in tutoring chat logs between students and live tutors. The second approach attempted to generate explanations using few-shot learning from explanations written by teachers for similar mathematics problems.”

Ultimately, teachers outperformed LLMs. The study’s authors concluded that: “In the future more powerful large language models may be employed, and GPT-3 may still be effective as a tool to augment teachers’ process for writing explanations, rather than as a tool to replace them.”

Education leaders are studying to see if ChatGPT can help by becoming a sort of automated teacher coach – essentially providing cost-effective, scalable support for educators via generative AI.

Meanwhile, researchers at Stanford explored another aspect of LLMs – namely their ability to become good coaches or trainers for teachers. One of the challenges facing teachers is a lack of high-quality coaching – a fundamental component of teacher training and something that requires classroom observation and knowledgeable feedback.

But amid teacher shortages and resource challenges, a majority of teachers don’t have access to such coaching and training expertise. Education leaders are studying to see if ChatGPT can help by becoming a sort of automated teacher coach – essentially providing cost-effective, scalable support for educators via generative AI.

The Stanford research team conducted “three teacher coaching tasks for generative AI: (A) scoring transcript segments based on classroom observation instruments, (B) identifying highlights and missed opportunities for good instructional strategies, and (C) providing actionable suggestions for eliciting more student reasoning.” They recruited expert math teachers to evaluate the zero-shot performance of ChatGPT on each of the three tasks for elementary math class transcripts.

The “Is ChatGPT a Good Teacher Coach?” study found that while the potential is there and ChatGPT offered relevant suggestions, those suggestions were neither novel nor particularly insightful. In fact, live teachers reached the same conclusions earlier and better and 82% of the model’s suggestions pointed to places where teachers already implemented the generative AI’s suggestion.

In short, much like the previously discussed middle-school math study, the researchers found that there is significant work to be done to make ChatGPT work in the envisioned capacity.

Finally, last month, researchers at the University of Toronto and Microsoft published the results of their work investigating how exposure to LLM-based explanations affects learning. The study, “Math Education with Large Language Models: Peril or Promise?” included 1,200 participants and sought to capture insights into how large language models might serve as scalable, personalized tutors for students.

In the experiment’s learning phase, participants received practice problems. (The study’s participants were comprised of individuals – including undergraduate students – via Amazon’s Mechanical Turk program.) Two key factors were manipulated and assessed in the study’s design: 1. Whether participants were required to attempt a math problem before or after seeing the correct answer. 2. Whether participants were shown only the answer or were also exposed to an LLM-generated explanation of the answer. (All participants were later tested on new test questions to assess how well they had learned the underlying concepts.)

Early research indicates that ChatGPT may not be capable of offering large scale, actionable insights for teachers – yet. Nor can it equitably and honestly fast track online learning programs that help address complex educational challenges – yet. ... But the potential is there with more fine tuning, more research, and more time.

This study was more positive than the previous two, and overall, the study’s authors concluded, “We found that LLM-based explanations positively impacted learning relative to seeing only correct answers.”

The authors were still cautious and the benefits were best for those who attempted to solve problems on their own first, but positive trends still held true even for those who were exposed to LLM answers before they attempted to solve the practice problem on their own.

The “exposure to LLM explanations increased the amount people felt they learned and decreased the perceived difficulty of the test problems,” the study’s authors found.

LLMs And The Power of Yet

So what does this all mean? The studies suggest two things. One ChatGPT can be better than nothing, as was the case in the Toronto study. But for ChatGPT to do teacher-level work, it will need to be trained on education-specific datasets.

More broadly, the studies suggest that the excitement over LLMs is warranted, at least for now. Think of the motivational wall poster that many students encounter in school, advocating “The Power of Yet.” Such is the case with ChatGPT and LLMs.

Early research indicates that ChatGPT may not be capable of offering large scale, actionable insights for teachers – yet. Nor can it equitably and honestly fast track online learning programs that help address complex educational challenges – yet. And it may not be capable of providing out-of-the-box teacher training solutions – yet. But the potential is there with more fine tuning, more research, and more time.

In particular, the field needs more datasets to train ChatGPT so that it can perform better. Such datasets will also be key to figuring out issues of equity. Do LLMs perform better for some students than others? Do LLMs have issues of bias?

Just like a sports team working out the kinks and finding its rhythm with a key, new player — the initial rollout of ChatGPt and generative AI may be rocky but big wins are possible, even if they aren’t quite here… yet.

This article first appeared on Forbes.com

Ulrich Boser

CEO, The Learning Agency