HotpotQA benchmark
HotpotQA benchmark
HotpotQA dataset is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison.
Performance
1. Leaderboard from SOTA
Paper | Year | Model | Model Details | NDCG@10 | Recall@5 | EM |
---|---|---|---|---|---|---|
Bridging the Preference Gap between Retrievers and LLMs | 2024 | BGM | R: T5-XXL(11B), G: PaLM2-S | - | - | 45.37 |
Baseline1 | R:GTR, G: PaLM2-S | - | - | 43.79 | ||
Baseline2 | R: ![]() |
- | - | 33.07 | ||
ACTIVERAG: Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents (only sample 500 q for eval) | 2024 | ActiveRAG  | R:DPR ,G:ChatGPT-4oMINI | - | - | 71.6 |
R:DPR ,G:Llama-3-Ins-70B | - | - | 70.8 | |||
R:DPR ,G:Llama-3-Ins-8B | - | - | 65.0 | |||
Baseline1 | R: ![]() |
- | - | 39.2 | ||
Baseline2 | R: ![]() |
- | - | 54.2 | ||
Large Dual Encoders Are Generalizable Retrievers | 2021 | GTR | R: GTR-XXL(4.8B),G: ![]() |
56.8 | - | - |