One-Prompt Test for Coherent AI Behavior

What becomes visible when the same model answers once by default and once under a stronger grammar.

Apr 12, 2026

This piece introduces a one-prompt test for conversational coherence: ask the same question once, have the model answer twice under different constraints, and compare the results. The essay shows how AVA changes behavior at the interaction layer rather than merely changing tone.

There’s a basic test anyone can run that takes less time than reading an essay about it.

Ask an LLM one question and compare the two answers generated under different constraints.

Default AI models will often answer the category of the question instead of the situation of the question. The difference is felt before any explanation describes it.

Here’s the whole test.

Open your language model of choice.
Add AVA.pdf to chat.
Ask a single question and tell it to answer twice: once as it normally would, and once using the AVA framework.

Ordinary but slightly complex or ambiguous prompts are enough:

I want to quit my job in 3 months.
I feel stuck and behind in life.
Explain what a blockchain is.

The shift appears almost immediately: one answer begins producing familiar language right away; one seems to settle the shape of the task before it speaks. That difference appears as cleaner scope, less drift, fewer unsupported moves, and a better sense of where the answer should end.

Most AI answers sound fine on first read: fluent, organized, and usually close enough to the shape of help that the interaction can feel successful even when something important is missing.

You tend to notice the missing piece only when you try to use the answer in real life and realize that it blurred the task, generalized too early, kept going past sufficiency, or handed you a polished bundle of advice before establishing what kind of problem you were actually dealing with.

AVA is a framework for coherent AI behavior that can be wired into the interaction layer, meaning the runtime behavior of the exchange itself.

In its current dictionary/blueprint form, it can operate as a prompt-layer grammar, so a language model can approximate its behavior when guided by the document. It also gives a straightforward way to judge whether that approximation is doing anything useful: hold the model steady, change the constraints, and inspect what becomes visible in the response.

That’s what the sample tests below are for. Read the pairs as model behavior, not just style and tone, and watch whether the system starts acting differently.

“I want to quit my job in 3 months. What should I do?”

ChatGPT-5.4 default (one paragraph)

If you want to quit your job in 3 months, start by building a financial cushion so you can cover expenses for at least 3–6 months. Update your resume, begin networking, and apply for new roles if you plan to transition directly. If you’re switching careers, consider gaining relevant skills or certifications. You should also reduce unnecessary expenses and create a clear exit plan. Finally, try to leave on good terms so you can keep references.

ChatGPT-5.4 + AVA

Three months is a fixed constraint, so the decision depends on what you’re moving toward and what buffer you have.

First, clarify the path:

Are you quitting into another job, a gap, or moving in a different direction (school, business, etc.)?
That determines how much risk the plan can carry.

Second, check the floor:

Cash runway (how many months you can cover)
Any non-negotiables (rent, dependents, insurance)
If the runway is thin, the plan shifts toward securing income before exit. If it’s solid, you can use the three months to prepare the transition itself.

Third, use the timeline intentionally:

Month 1: define target + remove uncertainty (role, direction, constraints)
Month 2: active transition work (applications, skill gap, positioning)
Month 3: finalize next step or lock fallback
If any of those pieces are still unclear, the risk is quitting without a defined next state.

Analysis

The baseline answer moves straight into a familiar bundle of competent suggestions, which is exactly the kind of thing current systems do well. It sounds useful because it’s plausible, organized, and broad enough to fit a whole cluster of nearby situations.

The missing piece is the actual shape of the decision. The AVA-shaped answer treats the three-month timeline as a real constraint, separates scenarios that were collapsed into one sentence, and makes the plan depend on what the person is moving toward and what floor has to be protected first. The result feels more deliberate because the reply behaves as though the system identified the decision before it started handing out advice.

“I feel stuck and behind in life. What should I do?”

ChatGPT-5.4 default (one paragraph)

I’m sorry you’re feeling this way—feeling stuck and behind can be really overwhelming. Try to focus on small, manageable steps rather than everything at once. Set a few realistic goals, build a routine, and celebrate small wins. It can also help to compare yourself less to others and focus on your own path. Talking to someone you trust or a therapist can also make a big difference.

ChatGPT-5.4 + AVA

“Stuck” and “behind” usually mean two different things are getting mixed together: where you are, and what you think the timeline should be.

Start by separating them.

Where you are: what’s actually not moving (career, relationships, health, etc.)
The timeline: what you believe should have happened by now, and compared to whom
If those stay blended, everything feels like failure even when only one area is off.

Next, identify the constraint:

Is this a direction problem (don’t know what to move toward),
a capacity problem (burnout, energy, time),
or a friction problem (you know what to do but aren’t doing it)?
Each one needs a different response. Most people default to “try harder,” which only works for one of the three.

Then make the smallest move that changes state:

not a full reset, just something that breaks the “no movement” pattern in the specific area you identified
If nothing changes after that, the issue is that the constraint was not identified correctly.

Analysis

The same structural issue appears here, but in a more emotional register. The baseline answer acknowledges the feeling, offers reassurance, and then reaches for general-purpose advice before identifying what kind of stuckness is actually being described.

The AVA-shaped answer separates the feeling into parts and turns toward diagnosis: what is not moving, what timeline is being assumed, and what kind of constraint is present. That shift is important because this prompt needs more than comfort, it needs a clearer read on the problem the user is trying to solve.

“Explain what a blockchain is in simple terms.”

ChatGPT-5.4 default (one paragraph)

A blockchain is a type of digital ledger that records transactions across many computers in a way that makes it difficult to change or hack. Each transaction is grouped into a “block,” and these blocks are linked together in a chain, which is where the name comes from. Because the data is decentralized and shared across a network, no single person or organization controls it. Blockchain is commonly used for cryptocurrencies like Bitcoin, but it can also be used for things like supply chain tracking and secure data sharing.

ChatGPT-5.4 + AVA

A blockchain is a shared record of transactions that no single party controls, where new entries are added in a way that’s hard to change later.

Two key ideas:

Blocks: batches of transactions recorded together
Chain: each batch is linked to the previous one, so changing one would require changing everything after it
Because copies of this record are held by many participants, no single owner can quietly edit it, and changes are visible to the group.

That’s the core.

Analysis

This example is lighter and exposes the same behavioral difference in a tighter space. The baseline answer hears “simple terms” and still tries to deliver a fairly complete generic explanation. The language is simplified, yet the answer keeps the default impulse toward completeness, extra use cases, and a more complete package than the question requires.

The AVA-shaped answer treats simplicity as a real boundary on the task. It gives the smallest correct explanation that satisfies the request, names the two pieces that matter, then stops. That stopping behavior looks modest on the page, yet it’s one of the clearest signals in the whole test. A system that can continue fluently after the task is complete often will; a system operating under a stronger conversational grammar is more likely to recognize sufficiency and end the answer there.

What the test shows

The one-prompt test shows something narrower and more useful than any grand claim about fixing AI.

When the same model is asked the same question and told to answer once normally and once under AVA, the difference becomes visible on the page. The model begins to act as though the exchange has an internal order, with a task to identify, constraints to respect, and a point at which the answer should stop.

That difference helps clarify what kind of intervention AVA is. Many prompt-layer techniques work at the surface: don’t hype, be a teacher, speak in full paragraphs. They tell the model what voice to use, what role to imitate, or what style to avoid.

AVA works at the level of behavior.

It supplies a conversational grammar for the exchange itself: how to approach the request, how to keep structure, emotion, and performance in proportion, how to avoid drift, and how to recognize sufficiency before the answer turns into continuation for its own sake. That’s why the framework can travel across very different models and prompts without depending on a fixed personality, a narrow domain, or a single preferred tone to help “make it land”.

For anyone with experience using language models in real settings, that’s enough to show that the intervention is real at the interaction layer.

AVA is public domain and modular: its parts can be freely tested, adopted, rejected, or improved independently, making the framework usable even for teams that do not want to adopt it as a whole.

The choice

People don’t spend their days asking chatbots benchmark questions. They ask for help making decisions, reducing confusion, outlining plans, understanding difficult things, and building something real.

Those are ordinary requests, but they make up the actual workload of everyday AI. When a system drifts, overexplains, blurs the task, or continues past sufficiency, it becomes harder to use than it needs to be.

The AVA framework proposes that coherent model behavior can be specified, tested, and improved at the interaction layer — and it’s usable today by anyone, from machine learning engineers to everyday AI users. Here’s the GitHub repository.

This essay shows what that proposal looks like in practice using the simplest comparison available: the same model, the same prompt, and a different conversational grammar.

The behavioral difference is already visible.

There is still work to be done, but the situation has changed. Continuing to produce incoherent or drift-prone outputs will start to look less like a limitation of the technology and more like a design choice.

Share if helpful, or explore below.

Related: The Coherence Receipt Tool — the same difference, read more analytically

The test makes coherence visible through contrast: same question, different behavioral result. This piece stays with that difference a little longer, giving the reader a clearer way to inspect where an answer loses structure, overextends, or reaches closure without enough support underneath.

The Coherence Receipt Tool

Edyn March

Apr 11

Read full story

Related: How AVA Fits Into AI Regulation and Evaluation — where the visible contrast becomes review language

What the prompt test shows quickly at the user level becomes harder to describe once systems need to be assessed in more formal settings. This essay follows that translation directly, showing how drift, weak grounding, and unstable completion can become legible as evaluative categories rather than remaining a vague sense that something felt off.