BenchRAG: A Modular RAG Evaluation Toolkit
A modular and extensible Retrieval-Augmented Generation (RAG) evaluation framework, including independent modules for query interpretation, retrieval, compression, and answer generation.
This project separates the RAG pipeline into four independent, reusable components:
- Interpreter: Understands query intent, expands or decomposes complex questions
- Retriever: Fetches relevant documents from a corpus
- Compressor: Compresses context using extractive or generative methods
- Generator: Generates answers based on the compressed context
🧱 Project Structure
BenchRAG/
├── interpreter/ # Query understanding and expansion
├── retriever/ # BM25, dense, hybrid retrievers
├── compressor/ # LLM or rule-based compressors
├── generator/ # LLM-based answer generators
├── datasets/ # Loaders for BEIR, MTEB, HotpotQA, Bright
├── pipelines/ # Full RAG pipeline runner
├── examples/ # examples for running each component
├── requirements.txt
└── README.md
⚙️ Installation
git clone https://github.com/gomate-community/BenchRAG.git
cd BenchRAG
pip install -r requirements.txt
🗄️ Datasets
Dataset | Task | Pubyear | Documents | Questions | Answers | Metrics |
---|---|---|---|---|---|---|
Natural Questions (NQ) | Factoid QA | 2019 | Wikipedia | 323,045 questions with each an wikipedia page | paragraph/span | Rouge, EM |
TriviaQA | Factoid QA | 2017 | 662,659 evidence documents | 95,956 QA pairs | text string (92.85% wikipedia titles) | EM |
NarrativeQA (NQA) | Factoid QA | 2017 | 1,572 stories (books,movie scripts) & human generated summaries | 46,765 human generated questions | human written, short, averaging 4.73 tokens | Rouge |
SQuAD | Factoid QA | 2016 | 536 articles | 107,785 question-answer pairs | spans | EM |
PopQA | Factoid QA | 2023 | wikipedia | 14k questions | long-tail entites | EM |
HellaSwag | Factoid QA | 2019 | 25k Activity contexts and 45k WikiHow contexts | 70k examples | classification | Accuracy |
StrategyQA | Factoid QA | 2021 | wikipedia (1,799 Wikipedia terms) | 2,780 strategy questions | its decomposition, evidence paragraphs | EM |
Fermi | Factoid QA | 2021 | - | 928 FPs (a question Q, an answer A, supporting facts F, an explanation P) | spans | Accuracy |
2WikiMultihopQA | Multi-Hop QA | 2020 | articles from wikipedia and wikidata | 192,606 questions each with a context | textual spans, sentence-level supporting facts, evidence (tiples) | F1 |
HotpotQA | Multi-Hop QA | 2018 | The whole wikipedia dump | 112,779 question-answer pairs | text span | F1 |
BRIGHT | Long-Form QA | 2025 | - | 12 tasks, each ~100 questions | multiple sentences | NDCG@10, LLMScore |
ELI5 | Long-Form QA | 2019 | 250 billion pages from Common Crawl | 272K questions | multiple sentences | Citation Recall, Citation Precision, Claim Recall |
WikiEval | Long-Form QA | 2023 | 50 wikipedia pages | 50 questions | text spans (sentences) | Ragas |
ASQA | Long-Form QA | 2022 | wikipedia | 6,316 ambiguous factoid questions | long-form answers | disambig F1, RougeL, EM |
WebGLM-QA | Long-Form QA | 2023 | - | 44979 samples | sentences | RougeL, Citation Recall, Citation Precision |
TruthfulQA | Multiple Choice QA | 2021 | - | 817 questions that span 38 categories | sentence answer/multiple choice | EM |
MMLU | Multiple Choice QA | 2021 | - | 15,908 multiple-choice questions | 4-way multiple choice | Accuracy |
OpenBook QA | Multiple Choice QA | 2018 | 7326 facts from a book | 5,957 questions | 4-way multiple-choice | Accuracy |
QuALITY (QLTY) | Multiple Choice QA | 2022 | - | 6,737 questions | 4-way multiple choices | Accuracy |
WikiAsp | Open-Domain Summarization | 2021 | Wikipedia articles from 20 different domains | 320,272 samples | 1) aspect selection (section title), 2) summary generation (section paragraph) | ROUGE, F1, UniEval |
Scifact | Fact-checking | 2020 | 5,183 abstracts | 1409 claim-abstract pairs | 3-class classification (support/refutes/Noinfo) | nDCG@10 |
FEVER | Fact-checking | 2018 | 50,000 popular pages from wikipedia | 185,445 claims | 3-class classification | Accuracy |
Feverous | Fact-checking | 2021 | wikipedia | 87,026 claims | 3-class classification/evidence retrieval | Accuracy |