HotpotQA benchmark

HotpotQA dataset is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison.

Performance

1. Leaderboard from SOTA

Paper Year Model Model Details NDCG@10 Recall@5 EM
Bridging the Preference Gap between Retrievers and LLMs 2024 BGM R: T5-XXL(11B), G: PaLM2-S - - 45.37
Baseline1 R:GTR, G: PaLM2-S - - 43.79
Baseline2 R: :x:, G: PaLM2-S - - 33.07
ACTIVERAG: Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents (only sample 500 q for eval) 2024 ActiveRAG ![](https://img.shields.io/github/stars/OpenMatch/ActiveRAG.svg?style=social) R:DPR ,G:ChatGPT-4oMINI - - 71.6
R:DPR ,G:Llama-3-Ins-70B - - 70.8
R:DPR ,G:Llama-3-Ins-8B - - 65.0
Baseline1 R: :x:, G: Llama3-8-Ins8B - - 39.2
Baseline2 R: :x:, G: Llama3-8-Ins70B - - 54.2
Large Dual Encoders Are Generalizable Retrievers 2021 GTR R: GTR-XXL(4.8B),G: :x: 56.8 - -

2. LLM-based Methods (Reproducable)