Staying current with developments in artificial intelligence (AI) is essential not only for adopting new tools and models, but also for making informed decisions about the datasets and algorithms needed to support them. As AI becomes more capable of acting autonomously, questions about how they reason and how it should be evaluated are becoming more important.
One emerging capability is agentic AI, systems designed to plan, reason, and act toward goals with limited human intervention. These systems are being explored across domains such as healthcare, business, and higher education, often relying on explicit forms of reasoning to guide their decisions. Another emerging capability is Large Reasoning Models (LRMs), an advanced AI method that can break complex problems into smaller steps.
Renaissance Philanthropy recently released datasets as part of the CareerNet initiative, in partnership with The Learning Agency. Because career navigation and coaching is heavily influenced by individualized planning and reasoning, it lent itself as an interesting opportunity to support the emergence of agentic AI and LRMs in career navigation. As part of the annotation process, The Learning Agency added reasoning labels to career navigation answers provided by experts. These labels are expected to help models better understand the reasoning used in an expert’s response to a student question, ultimately improving the quality and relevance of model-generated answers.
Developing reasoning labels proved surprisingly difficult and surfaced several insights about how reasoning is defined, represented, and evaluated in AI systems.
This article examines agentic AI, how reasoning is understood in this context, and how it can be evaluated in practice.
Understanding Agentic AI and The Role of Reasoning
Agentic AI refers to AI systems that can act autonomously toward a goal. These systems are composed of models also known as agents that mimic human decision making and operate across multiple steps to complete tasks. In a system that uses multiple agents, each agent is responsible for a specific task that contributes to the goal.
For example, if a user asks the system to plan a trip to see a Broadway play, the system could coordinate with the user’s schedule while searching for flights, booking a hotel, and purchasing theater tickets. To accomplish such goals, agentic systems perform several core functions, including perception, goal setting, decision-making, execution, learning and adaptation, orchestration, and reasoning.
Reasoning plays an important role in this process. It enables the system to extract meaningful insights from the data it receives. The system interprets the user’s request, detects patterns, and considers the broader context to determine appropriate actions. This capability helps agents decide what steps to take based on the situation.
Although agentic AI mimics aspects of human decision-making, the type of reasoning used by these systems differs from how reasoning is typically defined in psychology. This difference can make it challenging to determine how reasoning should be annotated, particularly when developing datasets intended to support or evaluate this capability.
Although agentic AI mimics aspects of human decision-making, the type of reasoning used by these systems differs from how reasoning is typically defined in psychology. This difference can make it challenging to determine how reasoning should be annotated, particularly when developing datasets intended to support or evaluate this capability.
Case Study: Annotating Reasoning in CareerNet
To build on Renaissance Philanthropy’s CareerNet initiative, The Learning Agency applied reasoning categories to career-related questions and answer pairs. The goal was to help models better understand the reasoning used in an expert’s response to a student question to improve models’ answers.
However, a challenge emerged early in the process. In our search for experts who could assist with the coding scheme and annotation process, it became clear that definitions of reasoning differed between experts in human (psychologists) and experts in machine learning.
While there is some meaningful overlap between these disciplines, experts often suggested the task was primarily suited for different specialists than themselves, indicating that the conceptual frameworks used in human cognition and AI are distinct.
This experience raised broader questions about how reasoning is defined and measured, particularly in the context of agentic AI systems.
In machine learning, reasoning, while attempting to simulate aspects of human reasoning, is quite different from human reasoning. Rather than focusing on internal cognitive processes, reasoning is operationalised through task performance. AI models will break down tasks to arrive at the best response to a prompt, based on the data they were trained on or rules they were provided.
Reasoning in Human Cognition vs. Machine Learning
In human psychology, reasoning is defined as a mental process used to draw conclusions from information. There are several types of reasoning including deductive and inductive reasoning. Reasoning can be impacted by several factors including prior knowledge, beliefs, biases, and inferences, which can sometimes lead to errors.
In machine learning, reasoning, while attempting to simulate aspects of human reasoning, is quite different from human reasoning. Rather than focusing on internal cognitive processes, reasoning is operationalised through task performance. AI models will break down tasks to arrive at the best response to a prompt, based on the data they were trained on or rules they were provided. For example, an AI agent tasked with recommending a career path may evaluate the user’s skills, experiences, and the job market, consider possible options, and outcomes to select the best recommendations that maximize the user’s success.
Human reasoning, however, can be shaped by the context and may not always be explicitly verbalized. For example, when deciding which route to take home, someone might choose a longer path to avoid traffic or multiple lights, without articulating the intuition and past experiences that guide that decision. As much as AI can mimic the human mind, there are aspects that it may not be able to recreate.
Given these differences, the question emerges: how can we meaningfully evaluate reasoning in AI systems?
AI benchmarks normally measure how well a model performs on a particular task. These benchmarks function as standardized tests that use the same data and evaluation criteria to compare model performance. Many benchmarks are built around questions with clearly defined correct answers. However, this approach may not fully capture how reasoning operates in practice, particularly when multiple steps or multiple reasoning types are required to generate a comprehensive response.
How Do We Benchmark Reasoning in AI?
AI benchmarks normally measure how well a model performs on a particular task. These benchmarks function as standardized tests that use the same data and evaluation criteria to compare model performance.
Many benchmarks are built around questions with clearly defined correct answers.
However, this approach may not fully capture how reasoning operates in practice, particularly when multiple steps or multiple reasoning types are required to generate a comprehensive response.
For example, benchmarks often evaluate large language models on mathematical reasoning and logical tasks, such as solving equations or following step by step procedures. While these benchmarks test reasoning, they primarily evaluate performance within a particular domain and particular reasoning types, like reasoning in math, rather than distinguishing between different types of reasoning. Even within mathematics, problems may rely on deductive, inductive, or other forms of reasoning. Most benchmarks do not explicitly assess whether a system can identify or select the appropriate reasoning strategy, focusing instead on whether the correct answer is produced.
Agentic AI introduces additional challenges for evaluation because these systems are directed toward a particular goal that require multiple steps, which could compound errors or hallucinations, and may vary depending on the user and context of the request.
Performance may depend on producing a correct answer but that answer relies on planning, adapting to new information, and recovering from errors made during the process, which should also be evaluated.
This raises important questions about how benchmarks should be designed to ensure that agentic systems operate reliably and effectively.
As agentic AI systems become more common, it is important to develop clear definitions of its reasoning and evaluation methods that capture its effectiveness. Reasoning is not a concept that is easily defined. Differences between how psychology and machine learning experts view reasoning can create challenges for both dataset development and model evaluation.
Navigating the Challenges of Reasoning in Agentic AI
As agentic AI systems become more common, it is important to develop clear definitions of its reasoning and evaluation methods that capture its effectiveness. Reasoning is not a concept that is easily defined. Differences between how psychology and machine learning experts view reasoning can create challenges for both dataset development and model evaluation.
The CareerNet efforts demonstrate this clearly: even experts disagreed on what counts as reasoning, and frameworks from different disciplines did not always align. These differences highlight that reasoning in AI is complex and requires careful consideration to ensure that systems perform reliably and meaningfully.

