The right question isn't "which is better"

Every week a CTO asks me the same thing: "is it worth fine-tuning our model?" The right answer is almost always another question — how often does your data change?

If you say "every day" (catalog, prices, tickets, internal policies), RAG is the obvious choice. If you say "rarely" (brand voice, stable internal taxonomy, rigid output format), fine-tuning pays off. And if you have both scenarios at once — which is the case for mid-market Brazilian companies in production — you combine both.

"Fine-tuning is like a tattoo — beautiful, durable, but painful to change. RAG is like wearing a shirt: you change it based on the temperature."

— Principle I use in every technical onboarding

What is RAG (Retrieval Augmented Generation)

Definition
RAG · Retrieval Augmented Generation
RAG is an AI architecture that connects an LLM to an external knowledge base (vector store). With each question, the system retrieves the most relevant passages from the knowledge base and delivers them as context to the model, which generates the answer based on that data — without needing to be retrained.

In practice, the RAG pipeline has three steps:

  1. Ingestion — your documents (Notion, PDFs, databases) are split into chunks and converted to vectors (embeddings).
  2. Retrieval — when the user asks something, the question becomes a vector and we search for the most similar chunks in the base.
  3. Generation — the LLM receives the question + retrieved passages as context and produces the answer, citing sources.

The big win is instant updates: edit the document in Notion? In seconds the AI responds with the new version, no retraining needed.

What is Fine-tuning

Definition
Fine-tuning · model fine-tuning
Fine-tuning is the process of continuing the training of a pre-trained model on a dataset specific to your domain. The result is a model whose weights have been adjusted to reflect the style, vocabulary, and rules of your business — embedded inside the model itself.

There are variations with very different costs:

  • Full fine-tuning — updates all 32B parameters. Expensive, slow, rarely justified.
  • LoRA / QLoRA — adapts only small side matrices. 10-100× cheaper, without losing domain quality.
  • Instruction tuning — teaches response format and style from curated examples.

For most Brazilian companies, QLoRA is the starting point: runs on a single A100 GPU, trains in a few hours, and fits in IT department budget without becoming a flagship project.

Direct comparison · 8 criteria that matter

RAG vs Fine-tuning · 8 criteria
Criterion RAG Fine-tuning (LoRA)
Knowledge updatesInstant (edit doc)Retrain model
Startup costLow (vector DB + embeddings)Medium (GPU hours + curation)
Recurring costEmbedding + storageInference only
Added latency+200-500ms (retrieval)~0 (already in model)
Verifiable citationYes (cites passage)No (opaque weights)
LGPD complianceAuditable documentMore complex (forgetting)
Brand voice / styleLimitedExcellent
Structured format (JSON)Good with promptNearly perfect
~10×
Lower cost (RAG vs full FT)
2-4h
RAG iteration time
7-14d
FT iteration time

The hybrid architecture · LoRA + RAG

In production, the answer is almost never "one or the other". The standard configuration for MDA LLM is:

  1. Lightweight QLoRA on Qwen 3.6 32B to incorporate brand voice, internal vocabulary, and expected output format.
  2. RAG on vector store (Qdrant or pgvector) for volatile facts — prices, policies, customer history, tickets.
  3. Guardrails via LiteLLM to mask PII and prevent topic drift.

Result: responses that sound like your brand, with facts updated in real-time and complete audit trail of the cited passage. It's what Brazilian financial sector CIOs are choosing to unlock AI without legal friction.

Real case · mid-market Brazilian retailer

Customer: retail chain with 1,200 stores and 17,000 SKUs changing price daily. Early 2025 tested monthly full fine-tuning of a Llama 70B for customer service — spent ~USD 18k/month on GPU and still the AI answered with prices from the previous month.

Migrated to MDA LLM hybrid architecture in February 2026:

  • Single QLoRA (cost: ~USD 800 one-time) calibrating brand voice
  • RAG on catalog + policies (vector store updates with each ERP sync, ~3min)
  • Presidio guardrails to mask CPF before prompt

In 90 days: monthly cost dropped to ~USD 2.4k (-87%), correct prices in 99.8% of tickets, and legal approved it for the first time (complete citation traceability).

"What matters isn't which technique you use — it's whether the architecture fits your budget, legal, and business velocity."

Frequently asked questions

What is RAG (Retrieval Augmented Generation)?

RAG is an architecture that connects a language model to an external knowledge base (vector store). With each question, the system retrieves the most relevant passages from the knowledge base and delivers them as context to the LLM, which generates the answer based on that data — without needing to be retrained.

What is fine-tuning an LLM?

Fine-tuning is the process of continuing the training of a pre-trained model on a dataset specific to your domain. The result is a model whose weights have been adjusted to reflect the style, vocabulary, and rules of your business.

When should I use RAG instead of fine-tuning?

Use RAG when data changes frequently (policies, prices, catalog, tickets), when you need verifiable citation, when GPU budget is tight, or when LGPD requires source traceability. RAG is the default choice for 80% of Brazilian B2B cases.

When is fine-tuning worth the investment?

Fine-tuning is worth it when you need very specific brand voice, structured output format (rigid JSON), stable knowledge that doesn't change in 6+ months, or sub-200ms latency where you can't afford the cost of retrieval.

Can I combine RAG and fine-tuning?

Yes, and it's the most robust approach for enterprise. Do lightweight fine-tuning (LoRA) for voice and format, and use RAG for facts and updated knowledge. It's the standard architecture for MDA LLM in enterprise deployments.