← Back to Blog

LLM Agnosticism Is Not Optional

LLM Agnosticism Is Not Optional

If you built your AI stack around GPT-4 in 2023, you have already rewritten it at least once. Probably twice. The teams that hardcoded their system prompts, evaluation pipelines, token budgets, and message schemas to a single model learned the same lesson that cloud engineers learned a decade earlier: vendor lock-in is not a risk you might face someday. It is a cost you are already paying.

The LLM market has shifted three times in two years. It will shift again. The enterprises that survive these shifts will not be the ones that picked the right model. They will be the ones that built architectures where the model is a replaceable component and the context is the thing that endures.

The Horse Race

In early 2023, OpenAI held roughly 50% of the enterprise LLM market. GPT-4 was the only model that mattered. Teams evaluated alternatives by asking a single question: how close is it to GPT-4? The answer, for most of 2023, was "not close enough."

So enterprises built on GPT-4. They tuned system prompts to its specific instruction-following behavior. They calibrated evaluation suites against its output patterns. They structured RAG pipelines around its 8K context window, then scrambled to restructure them when GPT-4 Turbo expanded to 128K. They stored conversation memory as arrays of OpenAI message objects with role/content pairs in OpenAI's schema. They wrote tool integrations against OpenAI's function calling format. Every layer of the stack was coupled to a single provider.

Then January 2025 happened. DeepSeek released R1, a reasoning model that matched OpenAI's o1 on major benchmarks at a fraction of the cost. DeepSeek V3 delivered frontier-class general performance using a mixture-of-experts architecture trained for dramatically less than GPT-4. The market reaction was a $1 trillion wipeout in AI-related equities in a single day. The strategic reaction was slower but more consequential: the assumption that frontier capability required massive capital concentration was wrong. Any sufficiently resourced and technically capable team could reach frontier performance. The model layer was not a moat. It was a commodity.

Google made the same point from the opposite direction. Gemini 1.0 launched to underwhelming reviews. The AI community wrote Google off as a slow-moving incumbent that had squandered its research lead. Then Gemini 1.5 Pro shipped with a million-token context window and near-perfect needle-in-a-haystack retrieval across the full window. This was not incremental. It invalidated entire architectural patterns. Teams that had built elaborate chunking and retrieval strategies to work within 128K tokens suddenly had a model that could ingest entire document corpora in a single call. Gemini 2.0 and 2.5 pushed into agentic capabilities, multimodal reasoning, and speed optimizations. Google went from "behind" to "competitive on every axis" in twelve months.

Anthropic's Claude followed yet another trajectory. Claude 2 was capable but not dominant. Claude 3 Opus was impressive but expensive. Then Claude 3.5 Sonnet landed and within weeks became the default model for software engineering teams. Not because it won every benchmark. Because it was the most reliably useful model for real work: code generation, complex instruction following, long-context analysis, and structured output. Engineering teams that had standardized on GPT-4 found themselves migrating to Claude not because of a top-down strategic decision, but because individual developers kept switching and the results were better.

Meta's Llama 3 added another variable entirely. Open-weight models reached a quality threshold where enterprises could deploy them on their own infrastructure with no API dependency at all. For use cases where data sovereignty or latency requirements made external API calls impractical, the best option was no longer the best closed model. It was the best model you could run yourself.

This is not a market converging toward a winner. It is structurally divergent. Leadership rotates. Capabilities leapfrog. Pricing races to the bottom. New architectures, mixture of experts, test-time compute scaling, hybrid retrieval-generation, keep reshuffling the competitive landscape. Any enterprise strategy that depends on one provider maintaining its current position is built on a prediction that nobody in this industry has been able to make accurately for more than six months at a time.

The HashiCorp Lesson

We have seen this exact structural problem before, in a domain that has already solved it.

In 2014, the cloud infrastructure market looked remarkably like the LLM market does today. AWS was dominant. Azure was investing aggressively to catch up. Google Cloud was throwing engineering talent and capital at the problem. Every enterprise CTO understood that cloud vendor lock-in was a strategic risk, but the tooling to avoid it did not exist. So they picked a cloud, built on its native services, and accepted the coupling.

HashiCorp's Terraform solved this with a specific architectural pattern that maps directly to the LLM problem. The pattern has three components, and all three matter.

The first component is the provider interface contract. In Terraform, every cloud provider implements a standardized interface. It registers resource types, defines a schema for each resource (attributes, which are required, which are computed at apply time), and implements five operations per resource: Create, Read, Update, Delete, and Import. Terraform's core engine never calls AWS or GCP or Azure directly. It calls the provider interface. The provider translates those calls into vendor-specific API requests.

This is not a thin wrapper. The provider contract specifies how to compute a plan (what would change if this configuration were applied), how to diff current state against desired state, how to handle partially failed operations, and how to import existing unmanaged resources into Terraform's control. Each provider can be hundreds of thousands of lines of code. The abstraction is deep, not cosmetic.

The second component is state management. Terraform maintains a state file that records every resource it manages: the resource type, its unique identifier in the cloud provider, its current attribute values, and its dependency relationships with other resources. This state file is provider-agnostic. It represents an AWS EC2 instance and a GCP Compute Engine instance in the same canonical schema. The state file is the system of record. The cloud provider is the execution target.

This separation is Terraform's most underappreciated design decision. When an enterprise needs to migrate a workload from AWS to GCP, the state management layer, the dependency graph, the change detection, the plan/apply cycle, all of it remains identical. The migration cost is proportional to the number of resource definitions that change, not to the complexity of the infrastructure management system.

The third component is the intent layer. HCL (HashiCorp Configuration Language) expresses what you want to exist, not how to create it. A configuration says "I need a virtual machine with these properties in this network with this security group." The provider translates that into the vendor-specific API sequence required to make it real. The intent is portable. The execution is provider-specific.

Together, these three layers create genuine vendor agnosticism. Not the cosmetic kind where you abstract away vendor differences until you hit an edge case and the whole thing falls apart. The real kind, where switching providers is a bounded engineering effort proportional to the actual differences between providers, not a full rewrite of your infrastructure management system.

The LLM Provider Interface

Apply Terraform's architecture to LLMs and the parallels are precise.

The provider interface for an LLM system must define the same kind of contract that Terraform providers implement. The system that manages context, memory, and tool orchestration should never call OpenAI or Anthropic or Google directly. It should speak to a provider interface that defines five things:

Input schema normalization. Every LLM provider uses a different format for the same concepts. OpenAI expects role/content message arrays with a system message at the top. Anthropic takes a separate system parameter alongside a messages array of alternating user/assistant turns. Google uses a contents array with nested parts. Tool definitions follow different schemas across all three. A provider adapter must translate a canonical input format into whatever the target provider expects. The orchestration layer assembles context once and never considers how Anthropic structures its API requests versus how OpenAI does.

Output schema normalization. Completions come back in different structures. Tool calls use different response schemas. OpenAI returns a function_call object. Anthropic returns tool_use content blocks. Google returns functionCall parts. Streaming uses different event formats. Structured output enforcement works differently across providers: JSON mode, strict tool schemas, grammar-constrained decoding. The provider adapter maps all of this back to a canonical output format that the rest of the system consumes uniformly.

Capability declaration. Not every model supports every feature. GPT-4o supports vision and function calling. Claude supports tool use, extended thinking, and 200K context. Gemini 1.5 Pro supports a million-token window. Some models support prompt caching. Some support batch APIs. Some support fine-tuning. The provider interface must declare capabilities at registration time so the routing layer can make informed decisions at request time instead of discovering limitations at failure time.

Token accounting.Tokenizers differ across providers. The same text produces different token counts on OpenAI's tiktoken, Anthropic's tokenizer, and Google's SentencePiece-based tokenizer. Pricing structures differ. Context window limits differ. The provider adapter must handle token counting, cost estimation, and budget enforcement in provider-specific terms while exposing provider-agnostic metrics upstream.

Error taxonomy. Rate limits, content filter rejections, context length overflows, and service outages all manifest differently. Anthropic returns a 429 with a retry-after header. OpenAI returns a 429 with different retry semantics. Google uses its own error codes. The provider adapter maps these to a canonical error taxonomy so retry logic, fallback routing, and error reporting work uniformly regardless of which provider triggered the error.

A thin wrapper around the OpenAI Python SDK is not LLM agnosticism. It is OpenAI with an extra function call in front of it. Real provider abstraction means the orchestration layer has zero knowledge of which model will process the request. If you can grep your orchestration code for the string "openai" or "anthropic" and get results, you are not agnostic. You have a wrapper. Wrappers break at the boundaries, and the boundaries are where migrations get expensive.

State Management Is the Hard Part

Terraform's provider model got the industry's attention. Terraform's state management is what made multi-cloud actually work. Without the state file, Terraform would have been a CLI that could create cloud resources but could not track them, detect drift, compute changes, or execute migrations. The state layer is what turned provider abstraction from a theoretical exercise into a practical capability.

The equivalent in LLM systems is context and memory management. And this is where most enterprises have the architecture catastrophically wrong.

Here is the common failure pattern. An enterprise builds a RAG pipeline. The pipeline retrieves documents and assembles them into a system prompt formatted for GPT-4. The prompt template exploits GPT-4's specific instruction-following tendencies. Chunk sizes are optimized for GPT-4's 128K window. Retrieval scoring is tuned against GPT-4 output quality using evaluation suites calibrated to GPT-4 reference answers. Conversation memory is stored as a JSON array of OpenAI-format message objects.

Every component of this system is load-bearing on a single provider. The context is not locked at one layer. It is locked at every layer.

Switching to Claude means rewriting prompt templates (Claude handles system prompts as a separate parameter, not a message), re-tuning chunk sizes (different context window, different attention characteristics), re-running retrieval evaluations (relevance ranking changes with a different model), migrating conversation memory from OpenAI message format to Anthropic's message format, and rewriting tool integrations from function calling to tool use. Switching to Gemini means doing it all again with a different message schema, different token budget math, and a million-token context window that changes the entire retrieval strategy.

This is Terraform's lesson stated plainly: if your state is entangled with your provider, you cannot switch providers without a migration. Migrations are expensive, risky, and slow. The enterprises that stored their infrastructure state in CloudFormation JSON paid the migration tax every time they needed to move workloads to Azure or GCP. The enterprises that used Terraform's provider-agnostic state file did not.

Context Must Be Provider-Agnostic

As we argued in Context Engineering, the model layer is commoditizing and the real competitive advantage is how enterprises assemble, govern, and deliver context to their AI systems. That argument has a direct corollary for LLM agnosticism: if context is the moat, then context must not be tied to any single provider. A moat that disappears when you switch models is not a moat. It is a liability.

Provider-agnostic context requires five layers to be completely independent of the LLM.

Document processing and chunking. Retrieval pipelines should produce chunks in a canonical format with metadata (source, timestamp, access control tags, relevance score) that is model-independent. Chunk sizes should be configurable at routing time, not baked in at ingestion time. The same document corpus should be servable to a model with a 128K window and a model with a million-token window without reprocessing. This means storing content at multiple granularities or using dynamic chunk assembly from atomic segments.

Memory systems. Conversation history, learned facts, organizational knowledge, and workflow patterns should be stored in a provider-agnostic schema. Not as OpenAI message arrays. Not as Anthropic conversation blocks. As structured data that a provider adapter serializes into whatever format the target model expects at inference time. The memory store is the system of record. The model is the consumer.

Tool and function definitions.Every tool available to an AI system should be defined once in a canonical schema specifying inputs, outputs, descriptions, and constraints. Provider adapters translate these into OpenAI's function calling format, Anthropic's tool use schema, or Google's function declarations at call time. The tool registry has no knowledge of which model will invoke the tools.

Prompt templates.System prompts should express intent, not provider-specific formatting tricks. A template says "analyze this document against these criteria using these retrieved facts." The provider adapter handles the model-specific formatting, token budget allocation, and behavioral tuning required to get consistent results from different models. The same intent should produce equivalent outcomes across providers without manual prompt rewriting.

Evaluation baselines.Evaluation suites should test outcomes, not outputs. "Did the system correctly identify the contract risk?" Not "did the model produce text that matches this GPT-4 reference output?" Provider-specific evaluation is a trap that measures how well you tuned to one model, not how well your system performs its actual job. Outcome-based evaluation transfers across providers by definition.

The Enterprise Memory Problem

Memory management is where provider lock-in is most acute and least discussed. Most enterprise AI systems store memory, to the extent they store it at all, as raw conversation logs in the provider's native message format. This is the equivalent of storing your cloud infrastructure state in CloudFormation JSON and then wondering why the GCP migration is taking eighteen months.

Enterprise memory has three distinct layers. All three must be owned by the enterprise, stored in the enterprise's schema, and independent of any LLM provider's format.

Episodic memoryis what happened. Conversations held, decisions made, actions taken, outcomes observed. Most systems store this as a sequence of API call logs in the provider's message format. That is provider-locked state. Episodic memory should be stored as structured events: timestamped records with actors, actions, inputs, outputs, and results. A provider adapter serializes these events into whatever conversation format the target model expects. The events themselves never change when the model changes.

Semantic memory is what the system knows. Entity relationships, domain knowledge, business rules, organizational structure, learned facts about the enterprise. This is the organizational knowledge graph. It should be queryable by any model through a context assembly layer, not locked inside conversation history or baked into fine-tuning weights that are bound to a specific model family. When you fine-tune GPT-4 on your organizational knowledge, that knowledge is locked in an OpenAI model checkpoint. When you store it in a knowledge graph and serve it as context, it works with any model you will ever use.

Procedural memoryis how the system has learned to operate. Successful reasoning patterns, preferred workflows, domain-specific heuristics, escalation rules, quality thresholds. This is the most valuable and most fragile memory layer. If your procedural memory is encoded as prompt engineering tuned to GPT-4's specific behavior, with particular token sequences, instruction patterns, and few-shot examples selected because they happen to work well with that model, it is worthless when you switch to Claude. Worse than worthless. It may actively degrade performance because different models respond to different instruction patterns. If procedural memory is stored as structured workflow definitions with intent-level descriptions, it transfers to any model through the provider adapter layer.

These three memory layers represent the accumulated organizational intelligence of every interaction an enterprise has had with its AI systems. Every conversation where an agent learned a user's preferences. Every workflow where the system discovered an efficient process. Every correction where a human taught the system something about the business domain. This intelligence took months or years to accumulate. Losing it in a provider migration because it was stored in OpenAI's message format is not a technical inconvenience. It is an unforced destruction of institutional knowledge. And it is entirely preventable.

The Migration Tax Is Compounding

Every month that an enterprise operates with provider-locked context and memory, the migration cost increases. More conversations stored in provider-specific formats. More prompt templates tuned to provider-specific behavior. More evaluation baselines calibrated against provider-specific outputs. More procedural knowledge encoded in ways that do not transfer. The cost of switching grows linearly with time and exponentially with the number of AI applications in production.

The enterprises that adopted Terraform in 2015 saved years of engineering effort when multi-cloud became a strategic imperative in 2019. The enterprises that built exclusively on CloudFormation and ARM templates spent those years paying the tax: rewriting infrastructure definitions, migrating state, rebuilding automation. Some of them are still paying it.

The LLM market runs on a faster clock. Cloud provider leadership was relatively stable for years at a stretch. LLM leadership has rotated three times in twenty-four months. The migration tax compounds faster because the trigger events arrive faster.

Build your context systems to be model-agnostic. Own your memory layers in your own schema. Define provider interfaces that decouple orchestration from execution. Store state in canonical formats. Evaluate outcomes, not outputs. The model you use today will not be the model you use in eighteen months. Your context and memory will outlive every model you ever plug into them. Make sure they belong to you.