Skill · AI & Development

RAG Failure Diagnostics & Architect

Diagnoses why a RAG system underperforms and architects the fix, with an evaluation harness and remediation plan.

Category: AI & Development
Deliverable: 1 .skill bundle
Outputs: 5
Last updated: 19 Jun 2026

$7 One-time · lifetime updates

Get it on Agensi

Works in Claude Pro, Team, and Enterprise
Lifetime access to updates
Refundable for 30 days via the marketplace

Or get a free skill every month. Subscribers get one curated skill, free, every 1st. Pick yours →

StrategistKit Affiliate. Purchase happens on the marketplace, which handles payment, delivery and refunds.

Overview

What RAG Failure Diagnostics & Architect does.

This skill applies a structured debugging framework to RAG systems that return confident but wrong answers. You describe your setup — the corpus, chunking strategy, embedding model, retrieval pipeline, and a failing query or question type — and the skill classifies the failure first (retrieval miss, generation hallucination, structural incompatibility with vector search) before recommending anything. It operates in three modes: DIAGNOSE for a specific bad query, ARCHITECT for choosing the right retrieval shape for a given question type, and SCHEMA for designing the institutional-memory layer that embeddings permanently discard.

A typical input: 'Our RAG system answers questions about internal engineering decisions. When users ask why a particular architecture was chosen two years ago, it retrieves plausible documents but stitches together a confident, fabricated rationale. Chunking is paragraph-level, embeddings are OpenAI text-embedding-3-small, no reranker.' The skill identifies this as a causal-chain failure — the answer requires decision provenance, not semantic proximity — and routes it to DIAGNOSE and ARCHITECT modes rather than chunk-size tuning.

The output for that input would include: Failure mode — 'Causal/provenance miss: the rationale was never stored as a traversable relationship; top-k retrieval cannot reconstruct it.' Structural cause — 'Vector search flattens decision context into token proximity; reranking cannot recover what was never indexed.' Architect recommendation — 'Decision-provenance graph with typed edges (decision → option → rationale → author → date); plain RAG is a structural mismatch for this question class.' Remediation plan — ranked steps from schema design to query routing, with effort estimates and the explicit 80% boundary where tuning alone would stop helping.

Who it's for

ML engineers and AI architects who built a RAG system that is underperforming in production and need to know whether the problem is tunable or structurally wrong. Also useful for technical leads scoping a new retrieval system who want to avoid defaulting to pure vector search for question types it cannot reliably answer.

What you get

One skill. 5 outputs.

One .skill bundle. Run it on your material and it returns:

Failure-mode diagnosis

Retrieval vs generation isolation

Chunking/embedding/rerank review

Eval harness design

Prioritized remediation plan

How it works

Three steps. About two minutes.

Install

Add the .skill file to your Claude app. ~10 seconds.

Run it on your work

Invoke the skill and paste in your material.

Apply the output

Review, keep what works, and use it.

In depth

Why a Claude skill beats a prompt template.

A copy-paste prompt runs one static pass and stops. A skill is a bundled program — instructions, examples, and a workflow Claude runs as a unit: it asks for the right input, applies the same pattern every time, and returns the structured outputs above.

FAQ

Common questions.

What do I need to provide as input for the skill to be useful?

Describe your retrieval pipeline (corpus type, chunking approach, embedding model, any reranker or hybrid search in use) and give at least one concrete failing query with the wrong answer it returned. The more specific the failure description, the more targeted the diagnosis.

Will it just tell me to fix my chunk size?

No — the skill explicitly classifies the failure before recommending any parameter tuning. If the query requires multi-hop reasoning, temporal ordering, causal explanation, or aggregation, the skill names that as a structural mismatch and recommends the appropriate retrieval architecture instead of tuning advice that cannot fix the root cause.

What does the evaluation harness output look like?

It designs a harness matched to your failure mode: for retrieval failures it specifies recall metrics against a labeled query set; for generation failures it specifies faithfulness checks against retrieved context. It does not produce runnable code, but it produces the test design, metric selection, and labeling criteria you hand to an engineer to implement.

Does the skill cover knowledge graphs and structured retrieval, or only vector RAG?

It covers the full decision space — vector RAG with hybrid search and reranking, knowledge graph and GraphRAG patterns, temporal and event-sourced indexes, structured query layers, and hybrid routers that dispatch by query type. It recommends the cheapest architecture that actually answers the question class, including cases where the right answer is a SQL query rather than RAG at all.

Can I use this skill while designing a RAG system, before I have a failing query?

Yes. The ARCHITECT mode takes a question type or system description and maps it to the right retrieval shape, explaining what breaks if you default to pure vector search for that workload. The SCHEMA mode designs the institutional-memory layer for organizations that need agents to answer 'why' and 'what caused' questions that embeddings cannot support.

Skills used with this one.

AI & DevelopmentAgensi