HotpotQA benchmark

HotpotQA dataset is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison.

Performance

1. Leaderboard from SOTA

Paper	Year	Model	Model Details	NDCG@10	Recall@5	EM
_{Bridging the Preference Gap between Retrievers and LLMs}	2024	BGM	R: T5-XXL(11B), G: PaLM2-S	-	-	45.37
		Baseline1	R:GTR, G: PaLM2-S	-	-	43.79
		Baseline2	R: , G: PaLM2-S	-	-	33.07
_{ACTIVERAG: Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents} _{(only sample 500 q for eval)}	2024	ActiveRAG ![](https://img.shields.io/github/stars/OpenMatch/ActiveRAG.svg?style=social)	R:DPR ,G:ChatGPT-4oMINI	-	-	71.6
			R:DPR ,G:Llama-3-Ins-70B	-	-	70.8
			R:DPR ,G:Llama-3-Ins-8B	-	-	65.0
		Baseline1	R: , G: Llama3-8-Ins8B	-	-	39.2
		Baseline2	R: , G: Llama3-8-Ins70B	-	-	54.2
_{Large Dual Encoders Are Generalizable Retrievers}	2021	GTR	R: GTR-XXL(4.8B),G:	56.8	-	-

GoMate

HotpotQA benchmark

HotpotQA benchmark

Performance

1. Leaderboard from SOTA

2. LLM-based Methods (Reproducable)