The Promise of LLMs for Measuring Parent–Child Interaction: What We Know and What We Need to Know

February 26, 2026

Ariel Kalil, Linxi Lu, Erik Sarrazin

The rapid progress of large language models (LLMs) is opening remarkable new doors for developmental science. For decades, researchers studying parent–child interaction have relied on detailed human coding of conversation transcripts; that is, carefully trained observers categorizing utterances based on dimensions such as responsiveness, linguistic complexity, and decontextualized talk. These methods have provided essential insights into how parental language input influences children’s cognitive and language growth. However, they are costly, slow, and limited in the number of dimensions they can realistically capture at once. LLMs now have the potential to change this entirely.

Modern models can analyze natural language transcripts and produce evaluations of conversational quality from viewpoints previously unreachable to human coders working at scale. Researchers have demonstrated that LLMs can reliably classify aspects of classroom discourse, approaching (and sometimes matching) human inter-rater reliability. In one recent paper, Jang and colleagues showed that algorithms can detect detailed features of parent–child conversations that predict children’s syntactic development, surpassing what traditional frequency-based measures can reveal.

These advances suggest that LLM-based tools could create detailed, multidimensional profiles of parent–child interaction at a fraction of the cost and time required by traditional coding. This has major implications for intervention research, including, for example, the possibility of providing real-time feedback to families.

What We Need to Know About LLMs and Parent–Child Interaction

Enthusiasm about these tools is well warranted, but the field is still in its early days. Several methodological challenges deserve attention.

First, which constructs can LLMs measure well, and which require new approaches? Early evidence suggests that LLMs are relatively strong at identifying surface-level linguistic features (e.g., question types or vocabulary diversity) but less reliable when it comes to relational qualities such as warmth or scaffolding. The path forward likely involves developing construct-specific prompting strategies and combining LLM outputs with other signals (e.g., acoustic features) to capture dimensions that text alone cannot represent.

Second, how sensitive are LLM-derived measures to prompt design, and how can we ensure robustness? Small changes in task framing can lead to significantly different outputs. This presents a practical challenge and an opportunity: systematic A/B testing can help researchers identify which methods produce the most consistent and valid results. The field would benefit from shared protocols and benchmarks, similar to existing validated coding systems that support observational research.

Third, how should we approach the diversity of real-world parent–child conversations? LLMs were mainly trained on formal written English, which is quite different from the fragmented, informal speech typical of everyday interactions between caregivers and children in households across society. Focused fine-tuning using collections of naturalistic parent–child speech and systematic testing across different demographic groups are the key next steps.

Early evidence suggests that LLMs are relatively strong at identifying surface-level linguistic features (e.g., question types or vocabulary diversity) but less reliable when it comes to relational qualities such as warmth or scaffolding. The path forward likely involves developing construct-specific prompting strategies and combining LLM outputs with other signals (e.g., acoustic features) to capture dimensions that text alone cannot represent.

Case Study: Measuring Young Children’s Creativity through Parent–Child Conversation

One especially promising application, drawn from ongoing work in our research lab, the Behavioral Insights and Parenting Lab at the University of Chicago, illustrates both the power and the methodological demands of this approach: using embedding-based models and LLMs to measure young children’s creativity from parent–child conversation transcripts.

The economic case for measuring creativity

Creativity, broadly defined as the production of new and useful ideas, drives innovation and long-term economic growth and is increasingly seen as vital for the future workforce. Its economic importance also extends to individuals: in a landmark study using the National Child Development Study—a British birth cohort followed from age 7 into their mid-fifties—Gill and Prowse showed that teacher-rated creativity in childhood strongly predicted higher earnings, better occupational status, and greater educational success decades later. These links remained even after accounting for cognitive ability, socioeconomic background, school and parental factors, and teacher evaluation errors. If the ability to explore different ideas during early childhood conversations reflects a fundamental skill that supports later creative success, then naturalistic conversational data might provide a scalable way to assess this important yet hard-to-measure developmental trait.

Operationalizing creativity computationally

Methods for measuring creativity computationally have evolved rapidly, from semantic distance measures based on static and dynamic word embeddings to LLM-based approaches.

Semantic distance as a measure of originality

Originality refers to the novelty or unexpectedness of an idea, and a large body of research has shown that semantic distance—the computed cosine distance between vector representations of words (embeddings)—provides a reliable, valid, and efficient proxy for human judgments of originality. Higher semantic distance reflects the ability to connect concepts that are not typically co-activated in semantic memory, a hallmark of creative thinking grounded in associative models of cognition. Early static embedding models (e.g., Word2Vec, GloVe) assign each word a fixed vector and are effective for computing distances between single words—as in Sarrazin (2026), who applies this approach within a novel creativity task to examine different dimensions of creativity—but cannot capture context-dependent shifts in meaning. Models employing BERT (Bidirectional Encoder Representations from Transformers) address this limitation by generating word representations that vary with surrounding text, allowing semantic distance to reflect meaning in context rather than in isolation.

How should we approach the diversity of real-world parent–child conversations? LLMs were mainly trained on formal written English, which is quite different from the fragmented, informal speech typical of everyday interactions between caregivers and children in households across society. Focused fine-tuning using collections of naturalistic parent–child speech and systematic testing across different demographic groups are the key next steps.

From words to narratives: divergent semantic integration

While semantic distance is effective for scoring short-form responses, applying it to longer narrative text requires a different architecture. Johnson and colleagues addressed this gap with divergent semantic integration (DSI), which averages the semantic distances between all words in a narrative. Across nine studies, DSI explained up to 72 percent of the variance in human creativity ratings at the latent variable level (r = .85), exhibiting strong psychometric properties and generalizability across demographic groups.

Beyond semantic distance: Large language models

As powerful as semantic distance approaches are, they remain unsupervised metrics that primarily capture originality. They receive no human feedback and, in conversational contexts, face a conceptual limitation: a child who goes completely off topic would score high, even though creativity requires not only originality but also usefulness—the idea must be appropriate and relevant.

LLMs offer a path beyond this limitation. Luchini and colleagues trained RoBERTa-base on nearly 1,000 stories and predicted human-rated originality at r = .81 on hold-out data, far exceeding DSI’s correlation of r = .34–.53 on the same data; multilingual versions using XLM-RoBERTa achieved r ≥ .72 across all 11 languages tested. A critical open question is whether these advances—developed and validated primarily on standard creativity tasks—transfer to more naturalistic forms of creative language use, such as parent–child conversation. Initial evidence is encouraging: DiStefano and colleagues extended automated scoring to metaphor generation—a more naturalistic, abstract task that requires nonliteral language. Fine-tuned RoBERTa and GPT-2 models predicted human creativity ratings at r = .70–.72, outperforming semantic distance (r = .42).

If the ability to explore different ideas during early childhood conversations reflects a fundamental skill that supports later creative success, then naturalistic conversational data might provide a scalable way to assess this important yet hard-to-measure developmental trait.

Hadas and colleagues demonstrated that general-purpose LLMs with carefully designed prompts can perform comparably to fine-tuned models on a standard creativity test, and they validated this across children in grades 3–4. These results raise a key question for our context: should we fine-tune a model on human-rated creativity in parent–child conversation, or can a general-purpose LLM deliver comparable accuracy? Fine-tuning can offer superior predictive performance but requires substantial human-rated training data; prompt-based approaches are more flexible but depend heavily on prompt design.

Building an Integrated Pipeline for Parent–Child Creativity

In our work with Chat2Learn (a language intervention we developed to boost parent-child conversational interaction), we are applying computational creativity measures to naturalistic parent–child interaction transcripts in an intervention setting. Our pipeline combines embedding-based measures of originality (semantic distance from conversational prompts) and flexibility (pairwise semantic diversity across a child’s utterances) with LLM-based ratings of overall creativity, capturing both originality and usefulness. We also calculate responsiveness (turn-by-turn semantic similarity between parent and child), decontextualized language use (e.g., past-tense verbs, modal constructions, hypothetical framing), question-asking, and sentiment—all automatically derived from transcripts. Unlike laboratory story-writing tasks, this requires managing the fragmented, overlapping, and often ungrammatical speech of young children. A crucial challenge is the unit of analysis: a child’s creative contribution may develop across multiple turns or appear within a single utterance, and determining the appropriate level of aggregation remains an open question.

The convergence of longitudinal evidence on the economic returns to creativity with new computational methods for measuring them in narrative text creates an unusually promising research moment. If childhood creativity reliably predicts lifetime economic outcomes—and if automated tools can now measure creativity at scale from naturalistic conversation—then interventions like Chat2Learn that enhance the quality of parent–child conversational interaction may be targeting a developmental outcome with far-reaching consequences. The work ahead requires shared validation benchmarks, systematic comparisons of prompting strategies, and clarity on where LLMs succeed and where they fall short. But the payoff—dramatically accelerating what we know about how everyday conversations shape children’s development, and how interventions like Chat2Learn can be designed to support them most effectively—makes the investment well worth it.