You're Not Talking to a Model — You're Talking to a Harness

Yanbing Li
Apr 18
12 min read

A working vocabulary for AI: Context, Memory, Tools, Skills, Harness, Agent

By Yanbing Li, Founder of iSterna LLC

Two weeks ago I asked the same question, to the same AI, in two different tools.

In a web chat, I got a generic answer — perfectly fluent, immediately useless, because it had no idea what my codebase looked like.

In a terminal-based agent, the same question triggered the tool to read three of my files, run a search across my repository, and come back with an answer specific to my actual code. Correct, actionable, and five minutes of work done in fifteen seconds.

Same underlying model. Same question. Completely different quality.

If you work in IT, you've probably had this experience too. You use the "same" AI in different places and wonder why it feels like different products. The benchmarks don't explain it. The model version doesn't explain it. The vendor's marketing definitely doesn't explain it.

This short whitepaper offers the vocabulary to explain it — and, more importantly, to make better decisions about which AI tools to use, which to skip, and how to get more out of the ones you've already got.

The core insight

You are not talking to a model. You are talking to a harness that is running a model.

Think of the model as a CPU. Extraordinarily capable at what it does, but on its own it's just silicon. Everything useful about a computer comes from what wraps the CPU: an operating system that schedules work, drivers that let it talk to the world, a filesystem that remembers things, applications that do specific jobs. The same CPU in Linux, Windows, and an embedded RTOS gives you three completely different machines.

The AI equivalent of that wrapping layer is called a harness, the term that's become shorthand for the runtime around a model. Most AI tools — ChatGPT, Claude Code, Cursor, Copilot, your company's internal AI platform — are harnesses around one of a small number of underlying models. The quality of what you experience depends at least as much on the harness as on the model.

Below are the six moving parts inside any modern harness, organized into three natural pairs:

Informational — what the model sees / remembers: Context (now) and Memory (persistent)
Operational — how the model acts: Tools (primitive actions) and Skills (packaged expertise)
Runtime — where the model lives: Harness (the environment) and Agent (the loop)

Thinking in pairs makes the architecture much easier to hold in your head. Learn these six and the whole AI-tool landscape starts to make sense.

1. Context — what the model sees right now

Context is RAM. It's what the CPU is actively holding this instant. Finite. Flushed when the process ends.

When you ask an AI something, the model only sees what the harness assembles for that single call. Four things typically go in:

your message,
the system prompt the harness injects behind the scenes,
any files or tool results the harness pulled in for this call,
and a memory index, if the harness supports memory.

A closer look at the system prompt. This is the block of text the harness sends to the model on every call, before your message — a "hidden config" that shapes behavior before you type a word. It typically bundles the vendor's base instructions ("who this AI is, how to behave"), safety guardrails, the list of available tools and how to call them, environmental facts (current date, user timezone, platform), and any user- or project-level instructions the harness auto-loads. System prompts are often several thousand tokens long, and most harnesses never expose them to users. Two harnesses can build very different system prompts around the same underlying model — which is a big part of why the same model feels different from tool to tool.

Those four things together are context. It's bounded — typically 200,000 tokens for current-generation models, with some reaching a million — and it's the only thing the model can reason over.

Two implications matter:

Every token costs money and displaces other information. Dumping a 20,000-line file into context pushes earlier conversation out. Context engineering — deciding what to load, what to summarize, what to leave out — is a real skill, and it's usually the hidden determinant of answer quality.
The model's answer depends on what's in context, not what exists in the world. If the relevant file isn't loaded, the model will guess — confidently, often plausibly, sometimes disastrously. "Hallucination" is usually a context problem before it's a model problem.

That explains the web-chat vs. terminal-agent example from the opening. Same model, but the terminal agent had the harness to actually read my files and put the right content in context. The web chat didn't.

2. Memory — what persists across sessions

If context is RAM, memory is disk. It's what survives the restart and gets loaded into RAM when a process needs it.

A harness with no memory is a colleague with amnesia. Every session starts from zero: they don't remember your role, your preferences, your project, or the rules you've already taught them. Useful for a one-shot query, exhausting for sustained work.

Modern harnesses typically support several memory layers:

Global user memory — applies to every project (your role, working style, general preferences).
Project memory — scoped to a specific codebase or workspace (what's in flight, why, by when).
Semantic auto-memory — topic-organized notes the harness decides to load when relevant, rather than dumping everything into context.

A closer look at semantic auto-memory. Harnesses that accumulate a lot of memory face an obvious problem: if they dump everything into context every session, context fills up and quality drops. Semantic auto-memory solves this by storing memory as small, topic-organized items and loading only the ones relevant to the current conversation. How "relevance" is determined varies — some harnesses use vector embeddings (finding items semantically close to your query), some use tags or keywords, some maintain an index that updates as you work. The practical effect: you can accumulate hundreds of memory entries over months, but only a handful enter context at any moment, leaving budget for the actual work. It's the difference between a cluttered desk and a good filing cabinet with a smart assistant.

Good uses of memory include durable facts about you ("my team standardizes on Python 3.11"), rules you've established ("never mock the database in integration tests"), and pointers to external systems ("bugs are tracked in the INGEST project in Linear"). Bad uses include anything already in version control (the source is authoritative), ephemeral task state (use a task tracker), and secrets (never).

The test for a mature harness: can it learn something about you this week and apply it next week, without you repeating yourself? If not, it's a harness with no memory, and your productivity will hit a ceiling fast.

3. Tools — the primitive actions a model can take

Tools are the system calls and drivers exposed by the harness. Each one is a primitive action the model can perform: read a file, run a shell command, query a database, call an external API, search the web, spawn a helper process.

A model without tools can only write text. A model with tools can do work. That distinction — text versus work — is the entire line between a chatbot and an agent.

Every agent action decomposes into tool calls. The model emits structured output saying "call read_file with path X," the harness executes it, the result comes back in the next turn, the model reads the result and decides what's next. That cycle — emit a tool call, observe the result, decide — is what makes agentic behavior mechanically possible.

Two things worth knowing:

Tools are where permissions matter. "Can this AI read my files?" is really "does the harness expose a file-read tool, and does the harness gate it behind a permission check?" Every harness chooses a different permissions model — and that choice matters enormously for security, trust, and enterprise adoption.
Tools have schemas, not intentions. A tool has a name, a list of parameters, and a text description. The model decides when to call a tool based on its description. Clear descriptions = correct tool use. Vague descriptions = the model makes reasonable-looking but wrong calls. This is why harness authors obsess over tool schemas — bad schema design is the silent killer of agent reliability.

A pattern teams run into often. An agent keeps calling the wrong tool — instead of fetching customer records, it keeps hitting a generic search endpoint that returns noisy, irrelevant results. The team upgrades to a smarter model. Same problem. They refactor their prompts. Same problem. Someone finally opens the tool definitions and finds that both tools have one-line descriptions like "look up information." The model had no way to distinguish them.

The fix is not a model change. It is a ten-minute edit to the tool descriptions — adding a "Use when..." clause and a "Do NOT use when..." clause to each. The original, cheaper model starts picking correctly on the first call.

This is the shape of most agent reliability problems in practice. Teams spend weeks blaming the model and minutes fixing the schema. The schema was the bug the whole time.

A closer look at MCP (Model Context Protocol). Before MCP, every harness invented its own tool format, so a tool written for one harness couldn't run in another without rework. Your team's Slack integration, database connector, or custom API wrapper had to be rebuilt per tool. MCP standardizes the protocol: a tool is packaged as an "MCP server" that any "MCP client" (the harness) can connect to. Write a tool once, run it in Claude Code today and in a different IDE plugin tomorrow. The practical impact is ecosystem formation — open-source MCP servers already exist for Git, Postgres, Slack, GitHub, filesystems, and many more, so teams can adopt AI tools without re-engineering their toolchain. If you're building a custom agent today, betting on MCP for the tool layer is the cheap long-term move. Think of it as the USB-C of AI tooling.

4. Skills — packaged expertise, invoked by name

Skills are installed applications — packaged capabilities the user or the AI can invoke by name. They sit one level above tools: a tool is a primitive (read a file); a skill is packaged expertise (score a requirement against our quality gates), which under the hood may call several tools, run a specific prompt, and apply domain logic.

Skills don't bring their own tool infrastructure. They orchestrate the same tools — including any MCP-backed ones — that the rest of the harness already exposes. This is why skills are portable across teams: the skill ships the expertise; the harness provides the tools.

You don't re-derive "how to convert PDF to text" every time you need it. You install pdftotext once and call it. Skills work the same way for AI: a named, reusable unit of behavior — instructions plus metadata plus optionally code and resources — bundled so you or the AI can invoke it deterministically.

Common categories:

Analytical skills — scoring, review, classification. ("Score this requirement against our quality gates.")
Extraction skills — pulling structured data out of documents or code.
Review skills — running a specific reviewer persona against a target. ("Review this design from a security perspective.")
Workflow skills — multi-step operations. ("Deploy this branch to staging and verify.")

Take requirement-quality scoring as a concrete example. A skill pack for this might ship four skills — extract, score, review, export — each invoked by a slash command (/req-extract, /req-score, etc.). A team evaluating requirements doesn't have to retrain anyone: they run /req-score against their requirements and get consistent results. That pattern generalizes: every skill you codify is permanent leverage. The cookbook grows; the team doesn't have to start from scratch each time.

Skills are not prompts. A prompt is advice you type once; a skill is a codified capability that can be invoked repeatedly, by you or by another skill. The difference is the same as between a shell one-liner and a packaged CLI tool. Both work, but one compounds.

5. Harness — the runtime around the model

The harness is the operating system itself — what decides which drivers load, what the scheduler prioritizes, what applications run with what permissions, how memory is managed. Same CPU in Linux, Windows, and an embedded RTOS gives you three entirely different machines. Same model in three different harnesses does the same thing.

What a harness decides:

Which tools exist and which are exposed to the model — the harness provisions the tools described in §3 and enforces the permission model around them.
How context is managed — what auto-loads, what gets summarized, how compaction works when the conversation gets long.
What the system prompt looks like — the harness typically injects thousands of tokens of instructions before you ever type a word. Those instructions massively shape behavior.
What memory is persisted — global, per-project, semantic auto-memory, or none at all.
What sub-processes can spawn — can it launch helper agents? Can those helpers run in isolated contexts so noise doesn't pollute the parent?

A few harness categories worth knowing:

Web chat (ChatGPT web, Claude.ai) — polished UI, instant onboarding, good for conversational work. But no local file access, limited shell, usually single-session. Great for a first experiment, limiting once you want real work.
Terminal / CLI agents (Claude Code and similar) — full file access, shell, Git integration, custom tools. Higher learning curve but dramatically more capable for software work.
IDE plugins (Cursor, Copilot, Continue) — inline with your editor, low friction, scoped to editor concerns. Great for code suggestions, limited for cross-tool work.
Programmatic SDKs (Anthropic SDK, OpenAI SDK, Claude Agent SDK) — embeddable, maximally customizable, but you build the UX yourself.
Managed cloud agents — hosted, schedulable, run headless. Good for productionizing an agent; less transparent than local tools.

Here's the insight most people miss: when you evaluate AI tools, don't just compare models. Compare harnesses. What tools do they expose? How do they manage context? What memory do they persist? What permissions model do they use? How do they handle long tasks? These harness decisions usually matter more than a one-point benchmark difference between the underlying models.

A great harness with a mediocre model often beats a raw model with nothing around it. Picking an AI tool is picking an operating system. Don't just compare CPUs. (The CPU analogy is lossy at the edges; the direction isn't.)

6. Agent — a model that loops autonomously toward a goal

An agent is a daemon or service — a goal-oriented loop that keeps running, not a one-shot command. Starts, observes, acts, observes the result, decides what's next, acts again, keeps going until done.

The difference is the same as between a calculator and a junior engineer. A calculator gives one answer per input. A junior engineer takes a task, figures out what to do, does it, checks the result, and keeps going — asking questions or escalating when stuck.

Agents share a few characteristics:

Loop, not single-shot. The model's output includes tool calls (see §3); the harness runs them and feeds results back into the next turn.
Tools are the agent's vocabulary of action. An agent's entire behavior decomposes into: decide which tool, call it, observe the result, decide again. Take tools away and you have a chatbot.
Judgment required. At every step the model decides: done? retry? ask for help? change approach?
Memory plus context management keeps them coherent across long horizons.

In practice, you encounter agents at several layers:

Interactive agent — a human working with an AI assistant on a task. The assistant loops, the human corrects direction, they converge. This is most of what IT teams experience today.
Sub-agents — isolated agent loops a parent spawns for scoped work ("research this codebase," "review this code"). Each sub-agent has its own context window, so noise doesn't pollute the parent.
Scheduled or hosted agents — run on a schedule or triggered by events, often with less human supervision.
Fully autonomous agents — long-running, headless, no human in the loop for routine decisions. Used cautiously — autonomy without oversight is how things break.

A closer look at sub-agents. The parent agent's context has a finite budget. If you ask it to "research this 50-file codebase and summarize what it does," reading every file would blow that budget on a task that only needs the summary. Instead, the parent spawns a sub-agent with its own fresh context, hands it the scoped task, and receives back just the summary. The parent's context stays clean and focused on the bigger task; the raw noise of 50 file reads never pollutes it. The pattern scales: teams routinely spawn research sub-agents for code exploration, review sub-agents for different perspectives (architect, security, performance, UX), and generation sub-agents for writing isolated pieces in parallel. Each sub-agent costs extra model calls, so they are not free — but for any task where the raw work would swamp the parent's context, the math usually favors isolation.

A common confusion: agents are not scripts. A script follows fixed steps; an agent decides its own. Agents are not chatbots either — chatbots respond to turns, agents pursue goals across turns. And agents are not workflows — workflows are fixed DAGs, agents are adaptive loops. The overlap in everyday language is misleading; the technical differences are real.

How the pieces relate

Agent = a running loop inside a harness — a model using tools, drawing on memory, pursuing a task with some degree of autonomy.

Put all of that inside a harness and let the loop run, and you have an agent. Take any of those ingredients away and you have something less — a chatbot, a script, a prompt, an automation.

The ordering matters: the harness is the outer container that makes the loop possible. It's not another ingredient of the agent — it's the runtime in which the agent runs.

Why this vocabulary pays for itself

Before you had these six words, "AI tools are inconsistent" was a mystery. After, it's diagnosable:

Answer feels generic? Almost certainly a context problem — the right information isn't loaded. Check what the harness is showing the model.
You keep re-explaining yourself? You have a memory problem. The harness either doesn't support memory or you haven't used it.
Tool calls misfire or hallucinate parameters? Probably a tool-schema problem — the tool descriptions are vague and the model is guessing. Fix the descriptions.
Team keeps reinventing the same prompts? You have a skills problem. Codify the top three things everyone keeps asking and distribute.
Same AI feels brilliant here and frustrating there? The harness is different. Don't blame the model.
Tool can answer questions but can't do work? It's a chatbot, not an agent. Different tool class.

A useful final takeaway: Pick the right harness. Fill context with care. Build memory that compounds. Expose tools cleanly. Codify expertise as skills. Then — and only then — let the agent run.

What to do tomorrow

One concrete action: open your most-used AI tool and identify which of the six pieces it gives you and which it doesn't. You'll find at least two gaps. Those gaps are where your productivity ceiling is.

If you manage a team evaluating AI tools, use these six concepts as your evaluation rubric instead of a model leaderboard. You'll make better decisions and your team will catch up faster.

And if any of this resonates — especially if you disagree with something — I'd genuinely like to hear it. Where does this break? That's the question that makes the vocabulary sharper.

iSterna, LLC