Engineering

•

October 20, 2025

How We Built Agentic Retrieval at Ragie

Matt Kauffman

Principal Engineer

Most RAG systems intercept a message, retrieve similar text with semantic and keyword search, and inject it into context so the LLM can answer. That works for simple lookups. This pattern falls short when questions are messy, multi-step, or ambiguous.

Agentic retrieval adds reasoning to retrieval. It guides how we decompose a question, clean inputs, choose search strategies, and check results before generation. You cannot answer a question like “If we exclude the impact of M&A, which segment has dragged down 3M’s overall growth in 2022?” with one retrieval pass. It requires breaking the problem into parts, following dependencies, and validating evidence.

This is why we built deep-search at Ragie. Its goal is simple: get the right answer from your data with clearly verifiable sources. Below, we show where traditional retrieval breaks, how agentic retrieval fixes it, and how we implemented those ideas.

How Agentic Retrieval Solves These Problems

Agentic Retrieval tackles these gaps in classic RAG head-on by reasoning about the query, choosing retrieval strategies dynamically, and evaluating results along the way.

Query Decomposition

Queries often mix conversational fluff with real search intent, or require multiple facts to answer. Agentic retrieval breaks these down into focused sub-queries, retrieves answers independently, and then synthesizes them into a coherent final response. Take an input like “Can you explain how diabetes affects kidney function and what the main treatment options are for diabetic nephropathy?” A single retrieval will struggle to recall the correct chunks to answer that question, but decomposed into two queries like “How does diabetes affect kidney function” and “what are the main treatment options for diabetic nephropathy” the retrieval system can easily find the knowledge needed to synthesize an answer.

Multiple Search Strategies

Retrieval agents can flexibly choose how they search, tuning parameters like top_k, using hierarchical indexes, or generating hypothetical documents (HYDE) to improve recall.

Instead of a single static retrieval pass, agents can run several strategies in parallel, merge results, and adapt mid-search. This flexibility consistently outperforms rigid retrieval pipelines on large, unpredictable datasets.

Dynamic Planning

The defining feature of an agent is its ability to plan. Advanced retrieval agents can form an initial plan, monitor their own progress, and adapt when results aren’t working.

A traditional RAG query just runs once and hopes for a good match. An agentic retriever can reflect, discard bad directions, or inject new sub-queries to fill gaps. This self-directed problem solving dramatically improves success rates.

Result Evaluation and Feedback Loops

Agentic RAG systems can use LLMs not only to retrieve and generate, but also to evaluate their own results. They can grade the quality of retrieved data and synthesized answers, record what failed, and try again.

This built-in feedback loop reduces hallucinations and catches wrong or unsupported claims, something static systems simply can’t do.

Inside Ragie’s Deep-Search

Now that we’ve covered the general principles, here’s how we built deep-search.

Multi-Agent Architecture

deep-search uses multiple specialized sub-agents, each with a clear role. Context is tightly managed so each agent can focus on its task, and every step produces structured, inspectable output.

Orchestrate – Decides which sub-agent to invoke next.
Plan – Breaks down queries, creates sub-queries, and tracks known information.
Search – Selects retrieval strategies, searches, and reranks results.
Code – Executes complex math or data processing tasks in a sandbox.
Answer – Synthesizes answers from collected evidence, noting sources.
Evaluate – Checks whether answers are well-supported and complete.
Cite – Produces precise, statement-level citations.
Fail – Handles graceful fallback and partial answers when data or budget is insufficient.

Simplified representation of deep-search's processing flow

Dynamic Planning on Rails

The orchestrator controls which sub-agent runs next. While it’s mostly autonomous, we constrain its choices to keep it on track. Think of it as dynamic planning within guardrails.

deep-search behaves like a state machine: sub-agents are conditional transitions. Some examples of this logic:

Don’t answer until supporting evidence exists.
Stop adding new sub-queries when too many are unresolved.
Remove failed sub-queries to avoid loops.
Prioritize wrapping up when nearing token or time limits.

This structure gives the agent freedom to reason, but within boundaries that ensure reliability.

Smarter Search Strategies

deep-search attacks retrieval from multiple angles: query rewriting, chunk expansion, contextual reranking, and adaptive effort scaling when results are weak. By running complementary strategies in parallel and filtering intelligently, it can surface the best possible results, even when the source data is “messy”.

Metadata-Aware Filtering

deep-search runs within a Ragie partition, which can include an optional metadata filter schema (a JSON schema defining filterable fields).

For example, if a partition defines source_type and created_at, and you ask:

“What bugs did customers email me about last week?”

The agent can automatically construct a filter like: source_type = gmail and created_at >= last_week, narrowing the search to the most relevant documents.

Verifiable Citations

Every source used in an answer is tracked throughout the run. When the final answer is produced, deep-search maps each statement to its supporting evidence. The result: clear provenance and confidence in every citation.

Built-In Auto-Correction

deep-search continually evaluates its work, rejecting irrelevant or incorrect evidence and reassessing partial answers. When evaluation fails, it notes the issues and retries with those insights. This self-correcting loop yields more accurate and complete answers over time.

Adjustable Effort Levels

There’s always a tradeoff between quality, cost, and latency. deep-search supports low, medium, and high effort modes:

Low for simple questions that can generally be answered with explicit text found in a handful of documents.
Medium for moderately complex questions that can be answered by breaking the input down to a few sub questions that individually can be answered by explicit text in your documents
High for deep, multi-hop reasoning across complex data where a correct answer may depend on many facts in your data and some of that information can be contradictory.

Effort level tunes model selection, token budgets, turn limits, and how aggressively sub-queries are evaluated. The goal is to match depth of reasoning to the complexity of the question.

Responses API Compatibility

deep-search uses the OpenAI Responses API schema, so compatible clients can integrate with minimal friction. The schema supports streaming, multimodal input, and structured, multi-turn interactions. This compatibility is ideal for agentic workflows.

Inspectable, Structured Outputs

Every step in a deep-search run is transparent. Outputs are structured and traceable, showing what each sub-agent did and why. This makes debugging, auditing, and explaining results straightforward. No black boxes! In user testing, people consistently preferred this transparency to opaque systems.

Benchmark Results

We’re currently preparing full production benchmarks for deep-search. During development, we used FinanceBench—a dataset of 150 financial filings packed with tables, graphs, and difficult questions.

Example question:

“What is the FY2019 cash conversion cycle (CCC) for General Mills? CCC is defined as: DIO + DSO - DPO. DIO is defined as: 365 * (average inventory between FY2018 and FY2019) / (FY2019 COGS). DSO is defined as: 365 * (average accounts receivable between FY2018 and FY2019) / (FY2019 Revenue). DPO is defined as: 365 * (average accounts payable between FY2018 and FY2019) / (FY2019 COGS + change in inventory between FY2018 and FY2019). Round your answer to two decimal places. Address the question by using the line items and information shown within the income statement and the balance sheet.”

Our best traditional RAG run scored 45%, while the deep-search prototype hit 91%, possibly SOTA.

That prototype used no budget limits and the most capable models (o3 for every step). Production deep-search uses more practical configurations, and we’ll soon publish results for all three effort levels. We expect strong accuracy, especially at “high” effort, though 91% remains an upper bound.

How to Try Deep-Search

Base Chat

deep-search is live and free to try in our hosted Base Chat. It’s also open source, so you can see exactly how we integrated it.

API

deep-search is available through our new /responses endpoint, which follows the OpenAI schema.

Here’s an example request:

‍

curl -X POST 'https://api.ragie.ai/responses' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer tnt_REDACTED' \
  -H 'Content-Type: application/json' \
  -d '{
  "input": "If we exclude the impact of M&A, which segment has dragged down 3M’s overall growth in 2022?",
  "instructions": "search for data in company financial filings. Expect to be searching over 10k filings",
  "tools": [
    { "type": "retrieve", "partitions": ["fin_bench"] }
  ],
  "model": "deep-search",
  "reasoning": { "effort": "medium" },
  "stream": true
}'

‍

Key Parameters

input – The question to answer.
instructions – Optional guidance for the agent.
tools – Defines which partition(s) to search. More tool options are in the works
model – Currently supports only "deep-search".
reasoning – Effort level: "low", "medium", "high".
stream – Stream intermediate updates via SSE or wait for the final result.

The response mirrors the Responses API, with a final message containing both the answer and a structured log of the full agent run.

Partition Settings

Partitions now support descriptions and metadata filter schemas in the Ragie web app. This lets you define which metadata fields agents can use for filtering, making your retrieval more focused and reliable.

Wrapping up

These capabilities make deep-search not just a better retriever, but a more reliable reasoning system end-to-end.

Before we conclude, here are the key takeaways from everything above:

RAG alone can’t reason about retrieval.
Agentic retrieval dynamically adapts and verifies.
deep-search operationalizes these ideas for production reliability.

👉 Sign up for Ragie to start building with Agentic Retrieval.
👉 Book a demo with our team and see how Agentic Retrieval can fit your needs.