AI 15 min read April 22, 2025

AI Integration Handbook: How to Add LLMs to Any Web App in 2026

Asking whether to add AI to your product in 2026 is like asking whether to add a search bar. The question isn't if — it's how. Here's the practical guide, with real patterns, cost controls, and the mistakes I made so you don't have to.

Umar Saleem

Umar Saleem

Full Stack AI Engineer & Founder

In 2026, AI is just another part of the stack. After integrating LLMs into a dozen production apps, I've learned that the technical part is usually the easiest bit. The hard parts are cost control, failure handling, and resisting the urge to over-engineer things that should stay simple.

Before You Write a Single Line of Code

AI features that fail in production almost always fail because three questions weren't answered upfront: What specific user problem does this solve? How will we handle it when the model is wrong? What does success actually look like? If you can't answer all three, you're not ready to build yet.

Three Tiers of LLM Integration

I categorise every integration into one of three tiers before touching any code:

  1. 1.Tier 1 — Utility: Single bounded task. Summarise this, classify that, extract the key points. Fast, cheap, reliable. Most products only ever need this.
  2. 2.Tier 2 — Conversational: Maintains context across a session. Needs state management, token budgeting, and graceful context truncation. Well-understood patterns exist now.
  3. 3.Tier 3 — Agentic: The model plans, reasons, and takes multi-step actions with real-world consequences. Genuinely powerful and genuinely hard to get right.

Most products only need Tier 1 or Tier 2. If you're reaching for an agent framework before your simpler AI feature is working well, you're getting ahead of yourself.

Setting Up the AI Layer (2026 Edition)

The model landscape has stabilised into clear tiers. Claude Sonnet 4 and GPT-4.1 are the daily workhorses — strong reasoning, fast, affordable. For cost-sensitive work, Claude Haiku 4 and GPT-4.1 mini handle most Tier 1 tasks at a fraction of the price. Gemini 2.0 Flash is my pick for heavy multimodal work. Always start small and upgrade only when you have a concrete reason.

typescript
// lib/ai.ts — initialise once, import everywhere
import Anthropic from '@anthropic-ai/sdk'

export const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
})

// app/api/summarise/route.ts
import { anthropic } from '@/lib/ai'
import { NextRequest } from 'next/server'

export async function POST(req: NextRequest) {
  const { text } = await req.json()

  const response = await anthropic.messages.create({
    model: 'claude-haiku-4-5',  // Haiku for Tier 1 tasks
    max_tokens: 300,
    system: 'You are a precise summariser. Be concise and factual.',
    messages: [
      { role: 'user', content: `Summarise in 3 sentences:\n\n${text}` }
    ],
  })

  const content = response.content[0]
  if (content.type !== 'text') throw new Error('Unexpected response type')
  return Response.json({ summary: content.text })
}

Keeping Costs Under Control

  • Always set max_tokens. A 3-sentence summary doesn't need 4,096 tokens — cap it at 300.
  • Cache deterministic results. If the same document gets processed twice, return the cached response.
  • Track per-user token usage in your database from day one. Set soft limits and notify before hard cuts.
  • Use a cheap model for triage first. Only route complex cases to the expensive one.
  • Log every request with input/output token counts. You cannot optimise what you cannot see.

Building a RAG Pipeline That Actually Works

RAG (Retrieval Augmented Generation) lets your LLM answer questions about your own data. The idea: chunk documents, embed them as vectors, store in a vector database, retrieve the most relevant chunks at query time, and pass them as context. The magic is entirely in retrieval quality — if you surface the wrong chunks, the model hallucinates with complete confidence.

typescript
// Simplified RAG with LangChain
import { ChatAnthropic } from '@langchain/anthropic'
import { OpenAIEmbeddings } from '@langchain/openai'
import { MemoryVectorStore } from 'langchain/vectorstores/memory'

const embeddings = new OpenAIEmbeddings({ model: 'text-embedding-3-small' })
const vectorStore = await MemoryVectorStore.fromTexts(
  documentChunks,
  metadataArray,
  embeddings
)

const retriever = vectorStore.asRetriever(5)
const llm = new ChatAnthropic({ model: 'claude-sonnet-4-6', temperature: 0 })

const relevantDocs = await retriever.invoke(userQuestion)
const context = relevantDocs.map(d => d.pageContent).join('\n\n')
const answer = await llm.invoke(
  `Context:\n${context}\n\nQuestion: ${userQuestion}\n\nAnswer using only the context above.`
)

The Mistakes I Made So You Don't Have To

  • Not streaming from day one. Users will abandon a non-streaming response after about 3 seconds. Add streaming before you launch, not after you see the drop-off.
  • Sending raw user input to the model. Always validate length, sanitise content, and strip anything sensitive before it hits the API.
  • Over-engineering the system prompt. A 2,000-token system prompt for a summariser isn't more powerful — it's just more expensive and harder to debug.
  • No fallback when the API is down. AI providers have outages. Every call needs try/catch and a degraded-mode user experience.
  • Skipping evals before shipping. Test your prompts against 20 to 30 real edge cases before going live. What breaks will genuinely surprise you.

AI is engineering now, not magic. Treat it with the same rigour you'd apply to a payment integration — test it properly, handle the failures, and measure what actually matters to users.

Enjoyed this?

Let's build something together.

If you need help applying any of this to your product, I'm available for consulting and development work.

Book a Call