Building a Multilingual RAG Chatbot for E-commerce: 1,614 Docs, 25 Languages, 70% Ticket Resolution

How I built a RAG chatbot that resolves 70% of support queries across 25 languages — ingestion pipeline, hybrid search, confidence thresholds, and streaming UI.

AI
RAG
Next.js
TypeScript
E-commerce
OpenAI

Your support inbox has a pattern. Every morning: "Where is my order?" "Do you ship to Germany?" "Can I return this?" "What size should I choose?" Seventy percent of tickets are the same twelve questions, translated into eight different languages.

I built Pikkuna — a Finnish e-commerce platform operating across 30 languages and 35 countries. After shipping the multilingual storefront, the support load was the next obvious bottleneck. A human agent can't maintain quality across 25 active languages. I needed a system that could.

This is the architecture that got us to 70% of queries resolved without a human, with P95 retrieval latency under 500ms, working across 25 languages from a single document index in English.

The Architecture in Plain Terms

Documents (FAQ, products, delivery, returns)
  → Python ingestion pipeline (SHA-256 cache)
  → OpenAI text-embedding-3-large (3,072 dimensions)
  → Upstash Vector (hybrid index: semantic + BM25)

User question (any language)
  → embed with same model
  → hybrid search → top-5 chunks
  → GPT-4o-mini with context
  → streaming response in user's language

The key insight: text-embedding-3-large is trained on 100+ languages. A question in Finnish maps to roughly the same embedding space as the equivalent question in English. That means you only need one document index. No per-language duplication, no translation step at query time, no separate pipelines.

Building the Ingestion Pipeline

We indexed 1,614 documents: FAQs, product category descriptions, delivery policy, returns, size guides, payment methods. Chunk size matters more than most RAG tutorials admit.

Chunk size for e-commerce content: 250–500 tokens is the sweet spot. Too small, and you lose context (a return policy paragraph split mid-sentence retrieves poorly). Too large, and the vector represents a blend of topics — retrieval precision drops.

Here is the full Python ingestion pipeline with SHA-256 caching so we only re-embed documents that actually changed:

# ingest.py
import hashlib
import json
import os
from pathlib import Path

import tiktoken
from openai import OpenAI
from upstash_vector import Index

EMBEDDING_MODEL = "text-embedding-3-large"
MAX_TOKENS_PER_CHUNK = 400
OVERLAP_TOKENS = 50
CACHE_FILE = ".embed_cache.json"

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
index = Index(
    url=os.environ["UPSTASH_VECTOR_REST_URL"],
    token=os.environ["UPSTASH_VECTOR_REST_TOKEN"],
)
enc = tiktoken.encoding_for_model("text-embedding-3-large")


def sha256(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()


def load_cache() -> dict[str, str]:
    if Path(CACHE_FILE).exists():
        return json.loads(Path(CACHE_FILE).read_text())
    return {}


def chunk_text(text: str) -> list[str]:
    """Split text into overlapping token chunks."""
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + MAX_TOKENS_PER_CHUNK, len(tokens))
        chunks.append(enc.decode(tokens[start:end]))
        start += MAX_TOKENS_PER_CHUNK - OVERLAP_TOKENS
    return chunks


def embed_batch(texts: list[str]) -> list[list[float]]:
    """Embed a batch of texts. OpenAI allows up to 2048 inputs per call."""
    response = client.embeddings.create(model=EMBEDDING_MODEL, input=texts)
    return [item.embedding for item in response.data]


def ingest_documents(docs_dir: str) -> None:
    cache = load_cache()
    docs_path = Path(docs_dir)
    vectors_to_upsert = []

    for doc_file in sorted(docs_path.glob("**/*.txt")):
        content = doc_file.read_text(encoding="utf-8")
        doc_hash = sha256(content)
        doc_id = str(doc_file.relative_to(docs_path))

        # Skip if this document hasn't changed since last ingest
        if cache.get(doc_id) == doc_hash:
            print(f"  skip (unchanged): {doc_id}")
            continue

        print(f"  indexing: {doc_id}")
        chunks = chunk_text(content)

        for i, chunk in enumerate(chunks):
            vectors_to_upsert.append({
                "id": f"{doc_id}::chunk_{i}",
                "data": chunk,
                "metadata": {
                    "source": doc_id,
                    "chunk_index": i,
                    "total_chunks": len(chunks),
                    "text": chunk,
                },
            })

        cache[doc_id] = doc_hash

    if not vectors_to_upsert:
        print("Nothing to update.")
        return

    # Embed in batches of 100
    texts = [v["metadata"]["text"] for v in vectors_to_upsert]
    all_embeddings: list[list[float]] = []
    for i in range(0, len(texts), 100):
        batch = texts[i : i + 100]
        all_embeddings.extend(embed_batch(batch))
        print(f"  embedded {min(i + 100, len(texts))}/{len(texts)}")

    for vec, embedding in zip(vectors_to_upsert, all_embeddings):
        vec["vector"] = embedding

    index.upsert(vectors=vectors_to_upsert)
    json.dumps(cache)
    Path(CACHE_FILE).write_text(json.dumps(cache, indent=2))
    print(f"Upserted {len(vectors_to_upsert)} chunks.")


if __name__ == "__main__":
    ingest_documents("./docs")

The SHA-256 cache means that updating one FAQ entry re-embeds only that file's chunks. At 1,614 documents, a full re-index costs around $4 in API fees. Incremental updates cost cents.

Why Upstash Vector

I evaluated Pinecone, Weaviate, and pgvector before choosing Upstash Vector:

CriterionUpstash VectorPineconepgvector
ServerlessYesYesNo
Hybrid searchBuilt-inPaid planExtension needed
InfrastructureNoneNonePostgreSQL instance
Cold start~50ms~100msN/A

For a Next.js app on Vercel with no dedicated infrastructure, Upstash Vector was the only option that gave me hybrid search without managing a server.

Cross-Lingual Retrieval: One Index for 25 Languages

When a user asks "Missä on tilaukseni?" (Finnish for "Where is my order?"), the query embedding sits close to the English FAQ chunk that answers the same question — because text-embedding-3-large encodes semantic meaning across languages, not surface-level word matches.

I verified this by testing queries in all 25 active languages against an English-only index. Retrieval precision was within 3% of English-to-English queries.

One ingestion pipeline, one vector index, zero translation infrastructure.

Hybrid Search: Why Semantic Alone Is Not Enough

Pure semantic search has a blind spot: exact matches. If a user types an order number or a specific product SKU, vector search returns "semantically similar" documents — which may be completely wrong. BM25 keyword matching catches exact terms that embeddings generalize away from.

// lib/rag/retrieve.ts
import { Index } from "@upstash/vector";

const index = new Index({
  url: process.env.UPSTASH_VECTOR_REST_URL!,
  token: process.env.UPSTASH_VECTOR_REST_TOKEN!,
});

interface RetrievedChunk {
  text: string;
  source: string;
  score: number;
}

export async function retrieveContext(query: string, topK: number = 5): Promise<RetrievedChunk[]> {
  const results = await index.query({
    data: query, // Upstash embeds the query server-side
    topK,
    includeMetadata: true,
    // 0.0 = pure BM25 keyword, 1.0 = pure semantic
    // 0.6 gives slight semantic bias — works well for e-commerce FAQ
    hybridAlpha: 0.6,
  });

  return results
    .filter((r) => r.score > 0.35) // below this, escalate to human agent
    .map((r) => ({
      text: r.metadata?.text as string,
      source: r.metadata?.source as string,
      score: r.score,
    }));
}

The score > 0.35 threshold is the escalation trigger. When no retrieved chunk crosses this threshold, the chatbot does not hallucinate — it creates a Zoho Desk ticket with the full conversation context.

The Streaming Response

// app/api/chat/route.ts
import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
import { retrieveContext } from "@/lib/rag/retrieve";
import { createZohoDeskTicket } from "@/lib/zoho/desk";

export const maxDuration = 30;

export async function POST(request: Request) {
  const { messages, sessionId } = await request.json();

  const userQuery = messages[messages.length - 1].content as string;
  const chunks = await retrieveContext(userQuery);

  // No confident match — escalate to human
  if (chunks.length === 0) {
    await createZohoDeskTicket({
      subject: `Chatbot escalation: ${userQuery.slice(0, 80)}`,
      description: userQuery,
      sessionId,
    });

    return Response.json({
      role: "assistant",
      content:
        "I couldn't find a confident answer to your question. I've created a support ticket and a team member will follow up shortly.",
    });
  }

  const context = chunks.map((c, i) => `[Source ${i + 1}: ${c.source}]\n${c.text}`).join("\n\n");

  const result = streamText({
    model: openai("gpt-4o-mini"),
    system: `You are a helpful customer support assistant for Pikkuna, a Finnish e-commerce store.
Answer questions based ONLY on the provided context. If the context does not contain enough
information to answer confidently, say so — do not make up details.
Respond in the same language the user is writing in.
Keep answers concise and friendly.

Context:
${context}`,
    messages,
  });

  const response = result.toDataStreamResponse();
  response.headers.set("X-RAG-Sources", JSON.stringify([...new Set(chunks.map((c) => c.source))]));

  return response;
}

The Respond in the same language the user is writing in instruction handles multilingual generation. GPT-4o-mini reliably detects and mirrors the user's language — Finnish question, Finnish answer, even though the retrieved context is in English.

Displaying Sources

Trust is the hardest part of deploying an AI chatbot for a real business. After adding source labels below each answer — "Based on: returns-policy.txt, shipping-faq.txt" — we saw a 20% drop in users clicking "Talk to a human" for questions the bot had already answered correctly.

Small detail, measurable effect.

Measuring Quality: How We Got to 70%

"70% resolved" needs a definition. Mine: a query is considered resolved if the user did not open a Zoho Desk ticket within 10 minutes of receiving the chatbot response.

Over the first 90 days:

  • ~70% — chatbot answered above the confidence threshold, no escalation
  • ~18% — low-confidence escalation triggered, human resolved
  • ~12% — user escalated manually after receiving a bot answer

Every cohort 2 and 3 ticket gets reviewed weekly. If the same question appears three times and the document base should cover it, I add or improve the relevant document and re-ingest.

Gotchas From Production

Chunk size is not a configuration detail. I started with 800-token chunks — too large. Retrieved context covered multiple topics and GPT-4o-mini would sometimes answer a question adjacent to what was asked. Dropping to 350 tokens improved precision measurably.

The OpenAI embedding API has occasional 500 errors. The ingestion pipeline needs retry logic with exponential backoff. Without this, a failed batch leaves gaps in the index that are invisible until a user gets a bad answer.

The confidence threshold needs tuning per domain. I started at 0.5 (standard recommendation) — too aggressive for Pikkuna's document base. After 30 days of data, 0.35 was right. Your number will differ.

Don't share the same Upstash Vector index between environments. Staging ingests pollute the production index. One index per environment.

GPT-4o-mini sometimes responds in English despite the language instruction. Fix: put the language instruction at the end of the system prompt, not the beginning. Recency bias in attention works in your favor.

Results

After 90 days in production at Pikkuna:

  • 70% of support queries resolved without a human agent
  • <500ms P95 retrieval latency (Upstash Vector in EU region)
  • 1,614 documents indexed across delivery, returns, products, payments
  • 25 languages served from a single English document index
  • ~$0.003 per query in combined API costs (embedding + generation)
  • 40% faster resolution time for escalated tickets — agents start with full conversation context

If you're running an e-commerce operation across multiple EU markets, customer support at scale is one of the first things that breaks. I solved it for Pikkuna by building a system that handles the repeatable 70% automatically and routes the rest to humans with full context.

The same architecture applies to any e-commerce with a structured document base — product catalogs, shipping policies, returns, FAQs. If your support team is answering the same questions in multiple languages every day, a RAG chatbot built this way will handle most of it.

If you need a senior developer to build this end-to-end — from ingestion pipeline to streaming chat UI — get in touch. I'm available for freelance projects and long-term engagements.


Related project: Pikkuna AI Chatbot — case study: production RAG chatbot for Pikkuna, ingestion pipeline architecture, and full performance metrics.

Iurii Rogulia

Iurii Rogulia

Senior Full-Stack Developer | Python, React, TypeScript, SaaS, APIs

Senior full-stack developer based in Finland. I write about Python, React, TypeScript, and real-world software engineering.