Building AI Agents That Actually Matter

AI agents are often marketed as the natural evolution of large language models. The implication is that agents are smarter, more capable, and more useful than a standard chat interface. In reality, many so-called agents fail to add meaningful value and instead introduce friction, duplication, and confusion. To understand why this keeps happening, it helps to clearly separate what modern LLMs already do well from what agents are actually supposed to add.

The term agent itself has been diluted and misused. Truly agentic AI performs actions and may operate autonomously. Much of what is labeled an agent today falls far short of that definition.

What LLMs Already Do Well

There are now many publicly available large language models that nearly anyone can use for free or at very low cost. You have ChatGPT, Gemini, Grok, Claude, Meta’s AI, and Microsoft Copilot, to name a few leaders.

These services feel free, but they are not. Running large models costs real money, and the tradeoff is usually some combination of data reuse, marketing, telemetry, or advertising. In other words, you pay for low-cost AI with your data and your attention. Higher priced tiers typically reduce advertising and offer stronger privacy guarantees.

Large language models are advancing at an incredible pace. They can reason across a wide range of topics, generate usable code, summarize dense material, and increasingly stay current through web integration. This is now well understood by most users.

Where Privacy Becomes the Real Constraint

By default, public LLMs cannot see your private data. That limitation is intentional and important. LLMs know what they have been trained to know and can extend that knowledge by accessing additional data in response to a prompt. Imagine an experienced expert who also has access to web search and a library of books to consult when personal expertise falls short.

We are increasingly encouraged to give LLMs more visibility into our lives. Email, local files, chat histories, meeting transcripts, browser activity, social media activity, finances, medical records, and more are all being positioned as sources that allow an AI assistant to do a better job and make our lives easier.

In many cases, that is genuinely useful. If an AI can manage your inbox, help pay bills, or flag unusual financial activity, it needs access. But access creates risk. This is not unlike the privacy risk of oversharing on social media.

Imagine a popular author working on an unreleased novel and using AI for editing or research. If that content leaks into model training or grounding data, even indirectly, spoilers could surface in future prompts. Information you share with a public LLM could potentially be surfaced by others in future interactions.

Privacy is not just about hiding prompts or files. It is about resisting the constant incentive to expose more and more context simply because it makes AI more useful.

The Enterprise Reality

In enterprise environments, this problem multiplies. Organizations hold enormous volumes of sensitive data, and employees are quickly becoming accustomed to working with their preferred LLMs. That preference is becoming personal, similar to choosing a browser or operating system.

Employees are nearly addicted to the public LLM of their choice, and if not prevented or provided with a ready alternative, they will use public LLMs for work-related tasks and research. Internal information leaks out through prompts and file uploads to LLMs.

Blocking AI websites outright is not a realistic solution. Employees will push back, find workarounds, or use personal devices.

The practical answer has been a combination of monitoring, data loss prevention, and safer alternatives. Vendors like Microsoft offer controls that can detect or block sensitive prompts and uploads to public LLMs.

Enterprise licenses for public LLMs typically include stronger privacy commitments and logging, though organizations may be hesitant to pay for these premium offerings.

More importantly, AI is now embedded directly into enterprise software. Microsoft Copilot inside Microsoft 365 and Windows is the clearest example.

These systems operate using on-behalf-of access. The OBO model is not reading everything you own. It can only access what you already have permission to see, and only when needed to answer a question or perform a task. In practice, this means the LLM can see everything the user sees. Sharing a file with a user effectively makes it available for enterprise prompts.

Where AI Agents Start to Disappoint

This brings us to agents, and to my main frustration.

I am seeing a growing number of platforms that allow users to create their own agents. Examples include custom GPTs in ChatGPT and agents inside Microsoft Teams.

Most of these agents consist of the same basic components:

A name and description
A system prompt that sets tone or behavior
Optional grounding through files or links
The ability to display sample prompts

The marketing pitch is that anyone can now “build an agent.” In practice, most of these agents are little more than thin wrappers around the base model.

A system prompt can shape behavior to a degree. You can make the model sound friendlier, restrict topics, or adopt a specific persona.

Grounding with files can help, but only if that data is unique, authoritative, and not already known to the base model. Too often, agents are grounded on generic documents or publicly available information that the LLM already understands or can infer.

Sample prompts help users know what to ask, but they do not add capability.

The Core Failure Mode

The fundamental problem is simple. If an agent does not do something meaningfully better than the base LLM, it should not exist. Not just marginally better. It needs to be far better to justify using it instead of the base model.

Using a custom agent introduces friction. The user has to find it, bookmark it, remember it, and intentionally switch contexts. In many cases, it is faster and more effective to interact directly with the base model.

For an agent to be worth remembering, it must add real value. That usually means one of two things:

Access to unique data the base model cannot see
The ability to take actions the base model cannot perform

In enterprise systems with on-behalf-of access, this bar is even higher. These environments already combine the power of a public LLM with all of the user’s private and company data. For example, a custom agent for expense reporting would be useless if that same information was already shared with the user. The base LLM already knows it.

What I see instead is a flood of low-value agents. Generic system prompts. Grounding on public or previously shared data. Sample prompts that would be equally effective on the base model. These agents are published into marketplaces alongside hundreds of similarly useless agents.

Then comes the worst part: internal agent marketing. Users are asked to remember which agent to use for which task. Teams promote their agents as if they were products. We are not inundated with reminders and internal spam attempting to drive users to stand-alone agents. People rush to create agents for recognition or to follow the latest trend, often by blindly following creation wizards with no real understanding. Most of the time, the base LLM would do the job just as well.

Why Marketplaces Are the Wrong Model

Agent marketplaces are a dead end.

Users do not want to remember, bookmark, or search for specialized agents. They do not want internal spam advertising the “right” agent for each task. This is reminiscent of hosting websites before the dominance of search engines.

Agents should not be stand-alone destinations. You should not have to remember to open an agent. They should be capabilities.

A better model is for agents to behave like plugins or extensions that are enabled behind the scenes, similar to browser extensions (MCP tools if you know that term). Once enabled, they become part of the primary AI experience and are invoked dynamically when appropriate.

Microsoft Security Copilot already leans in this direction. Custom agents there have descriptions and triggers. The core LLM can select the appropriate agent based on user intent or present options when needed, without forcing the user to remember which agent to use.

That is the right direction.

Where We Should Go Next

Agent builder tools need to get smarter. Creation workflows should actively discourage low-value agents. Before publishing, builders should be shown how their agent compares to the base model and whether it actually adds capability. The creation of meaningless agents should be stopped early.

We also need more action-oriented workflows. Many agents today are little more than chatbots with loose guardrails and some extra data. Agents should take action. They should send reports, respond to emails, schedule tasks, and run repeatable processes. A news agent that automatically delivers a weekly briefing is far more useful than one that simply answers questions.

These basic agents should fade into the background. Users should interact with a single primary AI interface. The system should decide which agents to use, not the human.

Because honestly, nobody wants to manage an agent collection. We just want the AI to do what we ask from one place.