Skip to content
Implementa.

Building an AI agent Β· Guide 5 of 6

How to train an AI agent: data, evals, governance

Training an AI agent isn't "upload a PDF" or "fine-tune the model". In 2026, it's almost always building a good RAG system, writing the evals that measure if it responds well, and designing the loop that improves based on real data. The rest is confusing vocabulary β€” and it costs money.

The 4 levels of "training": prompt, RAG, fine-tune, agent training

LevelWhat it isWhen
Prompt engineeringWriting good instructions (system prompt)Always β€” the foundation of everything
RAGConnecting the model to your knowledge baseWhen the agent must use your specific info
Fine-tuningAdjusting the model with your examplesRarely β€” only very specific cases
Agent trainingIterating on the full agent with evalsAlways β€” continuous cycle

Which one applies to your case

  • 90% of cases: prompt + RAG + agent training. No fine-tuning.
  • You need a very specific tone or style the prompt can't reach: consider fine-tuning on a small model.
  • Latency/cost constraints: fine-tuning on Llama or a similar model to run faster and cheaper.
  • Highly specialized data (medical, legal): combination of strong RAG + selective fine-tuning.

How to build evals (the part almost nobody does)

Evals are what separate a serious agent from a pretty demo. And almost nobody builds them. The process:

  1. Gather 50-200 representative inputs of the real cases the agent will handle.
  2. Define the expected output for each β€” or the range of acceptable outputs.
  3. Define automated evaluation criteria β€” measurable metrics (factual correctness, format, absence of hallucinations).
  4. Run after every change to the agent (prompt, RAG config, model). If the score drops, it doesn't ship.
  5. Iterate the set β€” add edge cases you spot in production.

The continuous-improvement loop

  1. Production captures real interactions with feedback (CSAT, detected errors).
  2. Weekly human review: identify error patterns.
  3. Update the KB / prompt / config based on what you found.
  4. Run evals to check there are no regressions.
  5. Deploy the change.
  6. Back to step 1.

Governance and sensitive data

  • Business-plan APIs. OpenAI API and Anthropic API on business plans don't train on your data. Confirm it in your DPA.
  • Anonymize when possible. Patterns help the model; names don't.
  • Encrypted logs. If you store conversations, encrypt at least the ones containing personal data.
  • Minimum retention. Don't keep what you don't need. Clear deletion policy.
  • Regular audit of what data enters the model and from where.

Frequently asked questions

In 2026, almost never. Base models are good enough and RAG is flexible enough that 90% of cases get solved with prompt + RAG done well. Fine-tuning helps when: (1) you need very specific tone/style you can't get from prompting, (2) you have latency/cost constraints that justify a smaller specialized model. If they offer fine-tuning as first option, ask why not RAG.

With evals: a set of representative inputs + expected outputs + automated evaluation criteria. Evals run on every agent change and produce a score. If your agent "works" but you don't have evals, you don't know β€” you have a hunch. The difference between a serious agent and a pretty demo is almost always whether evals exist.

To start: Promptfoo or LangSmith. For serious production: Braintrust or Galileo. Very technical team and you want open-source: DeepEval. The pick matters less than the habit β€” the problem isn't which tool, it's that most teams don't do evals at all.

Want to talk through your specific case?

A 30-minute technical conversation. We tell you what fits, what doesn't and the rough price.

How to train an AI agent: data, evals, governance Β· Implementa