ARC-AGI-3 vs Human-Grade Interaction

Why stronger capability benchmarks still leave the interaction layer unsolved — and where AVA fits

Apr 25, 2026

ARC-AGI-3 tests whether agents can learn efficiently inside unfamiliar environments, but it does not test whether AI systems can communicate with humans in grounded, proportionate, useful ways. This piece separates capability benchmarks from human-grade interaction and shows where AVA fits at the exchange layer.

ARC-AGI-3 is interesting because it changes the object being tested.

Instead of asking a system to answer a static prompt, it places an agent inside unfamiliar, game-like environments where it has to figure out what is going on: which actions are available, what changes when it acts, what hidden rules organize the environment, and what success even means when no explicit win condition is supplied.

To perform well, the agent has to explore, infer rules, form hypotheses, choose actions, learn from results, and improve over time. Success depends on learning efficiency, not only eventual completion. That makes ARC-AGI-3 a test of adaptive agentic capability: perception, action, planning, memory, goal acquisition, and feedback loops under uncertainty.

That clean design is a strength. By bracketing cultural knowledge, verbal explanation, and ordinary conversational polish, ARC-AGI-3 becomes easier to interpret as a benchmark for learning inside novel environments. The same boundary also marks what it cannot tell us: whether a more capable AI system communicates in a grounded, proportionate, useful way when a person is trying to get something done.

Capability and coherence are related, but they live at different layers. The distinction becomes practical as stronger models keep arriving, because most teams don’t need to settle the definition of AGI to notice where improved capability still leaves users carrying extra work.

Human-grade interaction is the exchange layer

Human-grade interaction means an AI system can move through a conversation in a way people can actually use. The exchange has to interpret the request, keep the task in shape, handle uncertainty plainly, give the right amount of explanation, and reach a useful endpoint.

Some interaction failures come from limited capability, and stronger models reduce part of that burden by tracking more context, using tools better, and adapting more effectively from feedback. Many others come from the way the product organizes the exchange: routing, retrieval policy, source handling, UI constraints, evaluation design, and product incentives. Those are the parts a user actually experiences when raw capability becomes a product.

The distinction already shows up in ordinary AI products. Models can solve difficult math or programming tasks while missing simple human intent; document summaries can blur source material with inference; long, fluent plans can give users more to manage than shorter, bounded answers. The product problem is not always that the model lacks intelligence. Often, the exchange lacks a ruleset for using that intelligence well.

That’s the missing interaction layer. A benchmark may show that a system solves, adapts, explores, or plans; a product still has to decide how that system behaves while communicating with a human. There’s still a gap because the exchange has its own structure, and without rules to govern that structure, capability can arrive as extra work for the user.

More data does not define the exchange

One reason this gap persists is that the training surface itself is not a clean model of coherent interaction.

The public internet contains enormous amounts of human language, but much of it is shaped by incentives that differ from useful exchange. Posts, threads, essays, arguments, advice, and commentary often reward performance, compression, confident framing, and continuation. They teach systems how people sound when they’re explaining, arguing, reacting, or positioning; they don’t automatically teach a system how a good exchange should behave.

More data improves fluency, and more compute improves capability. Better models recognize more patterns and solve harder problems. Those gains still don’t define the rules of conversation: when to ask, when to act, when to support a claim, when to narrow the scope, or when the work is complete. Those rules have to be designed.

The problem grows as systems become more capable. They will operate across more workflows, touch more decisions, summarize more information, and act through more tools. If the exchange itself is underdesigned, users end up with systems powerful enough to do difficult things while still requiring constant human cleanup.

Where AVA fits

AVA is aimed at the interaction layer: a CC0 framework for improving AI behavior where user input becomes model output. It complements model progress by giving teams a practical layer to improve as capability continues to advance.

The simplest way to describe AVA is as a conversational grammar.

A conversational grammar defines how an AI system should move through an exchange: how it understands the request, decides what kind of work is being asked for, grounds what needs support, generates a response, validates that response, and closes once the purpose has been met.

AVA’s core runtime names that sequence as:

Sense → Decide → Retrieve → Generate → Validate → Close.

That grammar is supported by validators for containment, drift, proportion, progression, recursion, language hygiene, and closure. Its purpose is to give teams a way to inspect and improve the behavior of the exchange itself, across different products, voices, and risk thresholds.

AVA can be tested at the prompt layer, but durable claims belong in evaluation, product instrumentation, transcript review, and deeper integration where the same checks can be measured against real use.

The overlap with ARC-AGI-3 is real but limited. Both point toward structured loops rather than one-shot generation. ARC-style environments reward systems that perceive, act, test, and revise; AVA applies a related discipline to human-facing communication. From there the domains diverge. ARC-AGI-3 tests action inside hidden environments; AVA shapes the conversation around the action.

What this looks like in practice

Imagine a user asks an AI system, “Summarize this for the board.” A capability-first assistant might produce a long, fluent synthesis immediately. The answer could sound polished while skipping the practical shape of the task:

who the board is,
what decision the summary supports,
what source material can be trusted,
and what kind of ending would actually help the user.

An AVA-shaped exchange treats that shape as part of the work. Before generating, the system has a grammar for deciding what must be understood, supported, compressed, checked, and completed. The difference is not abstract intelligence; it’s whether the system begins producing language immediately or first establishes the terms of the exchange.

Many product failures live at that level.

In support, weak closure creates user burden; in research, early synthesis can turn a useful assistant into a risk surface; in writing tools, plausible text loses value when the user has to fight to control it. Agents become harder to trust when their actions outrun the user’s intended scope, and companion or coaching products become difficult to exit when continuation is treated as care or success.

Those are product signals. AVA turns them into testable behavioral hypotheses: this flow needs a clearer exchange contract, this assistant needs stronger stopping rules, this agent needs better scope detection, this summary mode needs slower movement from source to synthesis.

The first move is small: pick one transcript, flow, or output where the system technically works but still leaves the user carrying extra effort, then identify which part of the exchange created that burden.

The benchmark and the behavior

If the field continues to improve capability through benchmarks alone, AI systems will become more impressive without automatically becoming easier to live or work with. They will solve more tasks, handle more complex environments, plan across longer horizons, and use more tools. That progress raises the stakes of the exchange layer because the cost of incoherent behavior rises with the power of the system.

Capability benchmarks remain useful; they’re incomplete as a guide to human usefulness. A system that learns efficiently in a novel environment has crossed one kind of threshold. Clear, proportionate, reliable communication with people is another. The mistake would be treating the first threshold as proof of the second.

ARC-AGI-3 asks whether an agent can learn its way through a new world.

Human-grade interaction asks whether a system can move through a conversation in a way people can actually use.

Strong AI systems will need both.

The future may arrive with agents that solve unfamiliar environments faster than expected. That still leaves a human question on the table: can the system communicate around that power in ways that make ordinary use clearer instead of harder?

For product teams, the actionable space is the exchange itself: the place where capability becomes behavior a person can use, and where AVA gives teams something to test.

Human-Grade frameworks and tools can be found in this project’s GitHub repository.

AVA can be viewed and downloaded directly from avacovenant.org/AVA.pdf

Related: Where AVA Plugs Into AI Systems — the product-team application path

The ARC-AGI-3 piece explains why capability progress does not automatically solve human-facing interaction. This door shows where teams can act on that distinction: prompts, product flows, orchestration, evaluation, and governance.

Where AVA Plugs Into AI Systems

Edyn March

Apr 25

Read full story

Related: Interaction-Layer Behavior Review Is a Missing AI Business Category — turning the capability gap into a business decision

This piece takes the gap between capability and human-grade interaction and makes it operational. It shows how companies already pay for behavior failures across UX, support, evals, and trust functions, and defines interaction-layer behavior review as a missing category that makes those costs inspectable and actionable.

Interaction-Layer Behavior Review Is a Missing AI Business Category