Awesome papers about retriever
Introduction
Coming soon β¦
1. Retrieval Methods
1.1 Two-tower Retriever (Tuning Embeddings)
Date | Title | Authors | Orgnization | Abs | Base | Dataset |
---|---|---|---|---|---|---|
2024/12/17 (πππππ) |
LLMs are Also Effective Embedding Models: An In-depth Overview | Chongyang Tao, Tao Shen, Shen Gao, et. al. | Beihang University, Tencent | <summary>Survey</summary>β¦ |
- | - |
2024/8/29 (πππ) |
Conan-embedding: General Text Embedding with More and Better Negative Samples | Shiyu Li, Yang Tang, Shi-Zhen Chen, et. al. | Peking University, Tencent | <summary>This paper presents conan-embedding β¦</summary>This work present Conan-embedding, which maximizes the utilization of more and higher-quality negative examples. |
MTEB | Β |
2024/4/9 | LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders [code: |
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, et. al. | McGill University, Meta | <summary>This paper presents LLM2Vec β¦</summary>This work introduces LLM2Vec, an unsupervised method to transform decoder-only large language models (LLMs) into powerful text encoders. The approach comprises three steps: (1) enabling bidirectional attention, (2) masked next token prediction, and (3) unsupervised contrastive learning. Applied to models ranging from 1.3B to 8B parameters, LLM2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available data. The method is parameter-efficient and does not require expensive adaptation or synthetic data. |
MTEB | Β |
2024/3/29 (πππ) |
Gecko: Versatile Text Embeddings Distilled from Large Language Models | Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, et. al. | Google Deepmind | <summary>This paper presents Gecko β¦</summary>This work present Gecko, a compact and versatile text embedding modle, which employs a two-stage distillation process by generating data and refining data quality based on large language models |
MTEB | Β |
2024/2/23 (πππ) |
Repetition Improves Language Model Embeddings [code: |
Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, et. al. | CMU | <summary>This paper presents echo embeddings β¦</summary>This work present βecho embeddingsβ, in which they repeat the input twice in context and extract embeddings from the second occurrence (i.e., repetition captures bidirectional information) |
MTEB | Β |
2024/2/15 (πππ) |
Generative Representational Instruction Tuning, [code: |
Niklas Muennighoff, Hongjin Su, Liang Wang, et. al. | Contextual AI, The University of Hong Kong, Microsoft | <summary>This paper presents GRIT β¦</summary>This work introduces generative representational instruction tuning (GRIT) whereby a large language model is trained to handle both generative and embedding tasks by distinguishing between them through instructions |
MTEB | Β |
2024/2/10 (ππππ) |
BGE M3-Embedding: Multi-Lingual, Multi-Functionality,Multi-Granularity Text Embeddings Through Self-Knowledge Distillation | Liang Wang, Nan Yang, Xiaolong Huang, et. al. | BAAI, USTC | <summary>This paper presents M3-embedding β¦</summary>This work present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity |
MIRACL, MKQA, MLDR | Β |
2024/1/19 (ππ) |
Improving Text Embeddings with Large Language Models | Jianlv Chen, Shitao Xiao, Peitian Zhang, et. al. | Microsoft | <summary>This paper presents β¦</summary>This work introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. |
BEIR, MTEB | Β |
2023/10/25 (πππ) |
Retrieve Anything To Augment Large Language Models [code: |
AutPeitian Zhang, Shitao Xiao, Zheng Liu, et. al. | BAAI, Renmin Univeristy of China, University of Montreal | <summary>This paper presents LLM-Embedder β¦</summary>This work present a novel approach, the LLM-Embedder, which comprehensively supports the diverse retrieval augmentation needs of LLMs with one unified embedding model. |
MMLU, PopQA | Β |
2023/8/7 | Towards General Text Embeddings with Multi-stage Contrastive Learning | Zehan Li, Xin Zhang, Yanzhao Zhang, et. al. | Alibaba | <summary>This paper presents GTE β¦</summary>This work introduces GTE (General Text Embeddings), a model trained using a multi-stage contrastive learning framework. The training involves large-scale unsupervised pre-training followed by supervised fine-tuning across diverse datasets. Despite its relatively modest parameter count of 110M, GTEbase outperforms larger models and even surpasses the performance of OpenAIβs black-box embedding API on the Massive Text Embedding Benchmark (MTEB). Notably, GTE also demonstrates strong capabilities in code retrieval tasks by treating code as text, achieving superior results without additional fine-tuning on specific programming languages. |
MTEB, BEIR, code retrieval | Β |
2023/5/30 (ππππ) |
One Embedder, Any Task: Instruction-Finetuned Text Embeddings [code: |
Hongjin Su, Weijia Shi, Jungo Kasai, et. al. | University of Hong Kong, University of Washingtong, Meta AI, Allen Institute for AI | <summary>This paper presents INSTRUCTOR β¦</summary>This work introduce INSTRUCTOR, a new method for computing text embeddings given task instructions: every text input is embedded together with instructions explaining the use case (e.g., task and domain descriptions). |
GTR | MTEB |
2022/9/23 (πππ) |
Promptagator: Few-shot Dense Retrieval From 8 Examples | Zhuyun Dai, Vincent Y. Zhao, Ji Ma, et. al. | Google Research | <summary>This paper presents Promptagator β¦</summary>This work propose Prompt-base Query Generation for Retriever (Promptagator), which leverages large language models (LLM) as a few-shot query generator, and creates task-specific retrievers based on the generated data. |
FLAN | BEIR |
1.2 LLM-based Retriever
Date | Title | Authors | Orgnization | Abs | Dataset |
---|---|---|---|---|---|
2025/04/29 (π) |
ReasonIR: Training Retrievers for Reasoning Tasks [code: |
Rulin Shao, Rui Qiao, Varsha Kishore, et al. | FAIR at Meta | <summary>ReasonIR: pointwise, finetune, LLaMA3.1-8B.</summary>This work includes ** ReasonIR**, which trained on ReasonIR-Synthesizer data (1,383,877 public samples, 244,970 varied-length samples, 100,521 hard samples). |
BRIGHT, MMLU, GPQA |
2023/12/24 (ππππ) |
Making Large Language Models A Better Foundation For Dense Retrieval [code: |
Chaofan Li, Zheng Liu, Shitao Xiao, et. al. | BAAI, BUPT | <summary>This paper presents LLaRA β¦</summary>This work includes LLaRA (LLM Adapted for dense RetriAl), which introduce two pretext training tasks EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression) to improve LLaMA for dense retrieval. |
MS MARCO passage&document, BEIR |
1.3 LLM-guided Retriever
Date | Title | Authors | Orgnization | Abs | Dataset |
---|---|---|---|---|---|
2024/03/27 (ππ) |
LLatrieval: LLM-Verified Retrieval for Verifiable Generation [code: |
Xiaonan Li, Changtai Zhu, Linyang Li, et. al. | Fudan University | <summary>This paper presents LLatrieval β¦</summary>This work proposes LLatrieval (LLM-verified Retrieval), where the LLM iteratively provides feedback to the retrieval through verify-update iterations. 1) Retrieval Verification is implemented by prompting LLM to give binary label, and 2) Retrieval Update takes LLM to progressively scan the document candidates returned by the retriever and selects the supporting documents. |
ALCE |
2024/3/15 (πππ) |
DRAGIN: Dynamic Retrieval Augmented Generation based on the Real-time Information Needs of Large Language Models [code: |
Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, Yiqun Liu | Tsinghua University, Beijing Institute of Technology | <summary>DRAGIN β¦</summary>This work introduce a new framework, DRAGIN, i.e., Dynamic Retrieval Augmented Generation based on the real-time Information Needs of LLMs. This framework is specifically designed to make decisions on when and what to retrieve based on the LLMβs real-time information needs during the text generation process. |
2WikiMHQA, HotpotQA, IIRC, StrategyQA |
2023/5/26 (ππ) |
Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In [code: |
Zichun Yu, Chenyan Xiong, Shi Yu, Zhiyuan Liu | Tsinghua University, Microsoft | <summary>This paper presents AAR β¦</summary>This work introduce augmentation-adapted retriever (AAR), which takes a black-box LLM to score positive documents (so called LLM-preferred signals) for fine-tuning a pre-trained retriever. |
MMLU, PopQA |
1.4 Structured Retriever
Date | Title | Authors | Orgnization | Abs | Dataset |
---|---|---|---|---|---|
2025/04/04 (π) |
EnrichIndex: Using LLMs to Enrich Retrieval Indices Offline [code: |
Peter Baile Chen, Tomer Wolfson, Michael Cafarella, Dan Roth | MIT, University of Pennsylvania | <summary>EnrichIndex: zeroshot, GPT-4o-mini.</summary> The *EnrichIndex** uses off-the-shelf GPT-4o-mini to enrich each object with three additional representations during indexing phase: 1) its purpose, 2) a summary, and 3) question-answer pairs. The final score of each query-object is a weighted sum of the similarity scores between q and each representation. |
BRIGHT, Spider2, Beaver, Fiben |
2024/01/31 (ππππ) |
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval [code: |
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, Christopher D. Manning | Stanford | <summary>RAPTOR β¦</summary>We introduce the novel approach of recursively embedding, clustering, and summarizing chunks of text, constructing a tree with differing levels of summarization from the bottom up. At inference time, our RAPTOR model retrieves from this tree, integrating information across lengthy documents at different levels of abstraction. On question-answering tasks that involve complex, multi-step reasoning, we show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the QuALITY benchmark by 20% in absolute accuracy. |
NarrativeQA, QASPER, QuALITY |
2. ReRanking Methods
2.1 LLM for Ranking
These methods try to leverage LLMs to directly rerank documents, usually in listwise setting. The core of these methods lies on how to divide the list into small local groups, and then how to aggregate local results into global ranking.
Date | Title | Authors | Orgnization | Abs | Dataset |
---|---|---|---|---|---|
2025/02/06 (πππ) |
TourRank: Utilizing Large Language Models for Documents Ranking with a Tournament-Inspired Strategy [code: |
Yiqun Chen, Qi Liu, Yi Zhang, et al. | Renmin University & Baidu | <summary>TourRank: listwise, zeroshot, GPT-3.5.</summary> This paper propose a zero-shot document ranking method called TourRank. It first groups candidate documents and prompt LLM to select the most relevant one in each group. It also designes a points systems to assign different points to each document based on its ranking in each round tournament. |
BEIR, TREC-DL |
2024/06/21 (ππππ) |
FIRST: Faster Improved Listwise Reranking with Single Token Decoding [code: |
Revanth Gangi Reddy, JaeHyeok Doo, Yifei Xu, et al. | UIUC | <summary>FIRST: listwise, finetune, Zephyr-Ξ².</summary> This work introduces FIRST, which simply extracts the output logits of candidate identifier tokens while generating the first identifier y1 and returns the passage ranking in the order of decreasing logit values. It uses 40k GPT-4 labeled instances (5k queries from MS MARCO) from Rankzephyr for finetunning LLM reranks. |
BEIR |
2024/05/30 (πππ) |
A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models [code: |
Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, Guido Zuccon | CSIRO, Australia | <summary>Setwise Prompting: setwise, zeroshot, Flan-t5.</summary> This work focus on LLM-based zero-shot document ranking, and introduce an Setwise prompting strategy. It instructs LLMs to select the most relevant document to the query from a set of candidate documents. |
BEIR, TREC-DL |
2024/05/23 (πππ) |
Top-Down Partitioning for Efficient List-Wise Ranking [code: |
Andrew Parry, Sean MacAvaney, Debasis Ganguly | University of Glasgow | <summary>TDPart: listwise, zeroshot, GPT-3.5.</summary> This work partitions a ranking to depth k and processes documents to-down. At each round, it selects a pivot document, and compared it with documents from each group. Those winner documents of each group are collected as the input of next round. |
MSMARCO, TREC-DL, BEIR |
2024/03/28 (ππ) |
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting | Zhen Qin, Rolf Jagerman, Kai Hui, et al. | Google Research | <summary>PRP: pairwise, zeroshot, FLAN-T5(3B,11B,20B).</summary>This work introduces Pairwise Ranking Prompting (PRP) for ranking with LLMs. Three variant of PRP: 1) all pair comparisons $O(N^2)$, 2) Sorting-based, i.e., Heapsort, $O(N\times \log(N))$, 3) Sliding window, i.e., Bubble Sort for Top-K, $O(N)$. |
TREC-DL2019&2020, BEIR |
2024/02/20 (ππ) |
Bridging the Preference Gap between Retrievers and LLMs | Zixuan Ke, Weize Kong, Cheng Li, et al. | Google Research | <summary>BGM: listwise, finetune, T5-XXL(11B).</summary>This work trains a seq2seq bridge model (directly generates passage IDs) to jointly accomplish reranking and selection, adapting the retrieved information to be LLM-friendly. It employs a SL and RL training scheme to optimize the adapation process. |
NQ, HotpotQA, Avocado Email, Amazon Book |
2023/12/05 (ππ) |
RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! [code: |
Ronak Pradeep, Sahel Sharifymoghaddam, Jimmy Lin | University of Waterloo | <summary>RankZephyr: listwise, finetune, Zephyr-Ξ².</summary> The RankZephyr is trained with two stages: the first stage is trained on 100k queries from msmarco v1, where 20 candidate documents of each query are ranked by RankGPT3.5. The second stage is trained with less than 5k queries labeled by RankGPT4. When ranking the top-100 candidates, it employed a sliding window akin to RankGPT and RankVicuna. |
BEIR, TREC-DL 19&20, 21, 22 |
2023/10/21 (ππ) |
Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels | Honglei Zhuang, Zhen Qin, Kai Hui, Junru Wu, et al. | Google Research | <summary>RG-S: pointwise, zeroshot, FLAN PaLM2S.</summary>This work proposes to incorporate fine-grained relevance labels (not-relevant/somewhat-relevant/highly-relevant) into the prompt for point-wise LLM rankers. The method is called RG-S, which is rating scale 0-k Relevance Generation (RG-S(0,k)). It directly prompts LLM to rate the relevance for each q-d pair using a scale from 0 to k. |
BEIR |
2023/10/20 (ππ) |
Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking [code: |
Shengyao Zhuang, Bing Liu, Bevan Koopman, Guido Zuccon | CSIRO | <summary>LLM-QLM: pointwise,zeroshot,LLaMA(7B).</summary>This work finds that open-source LLMs can be effective point-wise rankers by asking them to generate the query given the content of a document. |
BEIR |
2023/10/12 (πππ) |
Fine-Tuning LLaMA for Multi-Stage Text Retrieval ] | Xueguang Ma, Liang Wang, Nan Yang, et al. | University of Waterloo | <summary>RepLLaMA and RankLLaMA: pointwise, finetune, LLaMA.</summary> This paper introduces RepLLaMA and RankLLaMA, which finetunes LLaMA model as dense retriever and pointwise reranker using MS MARCO datasets. |
MS MARCO passage/document, BEIR |
2023/09/26 (πππ) |
RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models [code: |
Ronak Pradeep, Sahel Sharifymoghaddam, Jimmy Lin | University of Waterloo | <summary>RankVicuna: listwise, finetune, Vicuna.</summary> The RankVicuna is trained on the ranked list generated by RankGPT3.5 for 100k queries from msmarco v1. Each query has 20 candidates provided by BM25. They filtered noises: 1) malformed generations, where 12% outputs incorrect list formatting. 2) shuffle the order of candidate documents as data augmentation. |
TREC-DL 19&20 |
2023/04/19 (ππππ) |
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents [code: |
Weiwei Sun, Lingyong Yan, Xinyu Ma, et al., | Shandong University | <summary>RankGPT: listwise, zeroshot, GPT-3.5.</summary> This work first investigates prompting ChatGPT on passage re-ranking tasks, and find LLMs show limited performance. It proposes an instructional permutation generation method, and use sliding window to address context length limitation of LLMs. |
BEIR, TREC-DL |
2.2 Reasoning for Ranking (fine-tuned LLM)
These methods try to improve the reasoning ability of LLMs in ranking document. Usually, they first construct a large-scale dataset which contains the βreasoing chainβ in each sample. Then they finetune LLM on the dataset to obtain reason-enhanced reranking model.
Date | Title | Authors | Orgnization | Abs | Dataset |
---|---|---|---|---|---|
2025/02/25 (ππ) |
Rank1: Test-Time Compute for Reranking in Information Retrieval [code: |
Orion Weller, Kathryn Ricci Eugene Yang, Andrew Yates, et al. | Johns Hopkins University | <summary>Rank1: pointwise, finetune, Qwen2.5.</summary> This work sample 635,000 examples of R1βs reasoning on the MS MARCO dataset. It then finetunes the qwen 2.5 model on these reasoning chains and find they show remarkable reasoning capabilities. |
BRIGHT |
2024/10/31 (π) |
JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking | Tong Niu, Shafiq Joty, Ye Liu, et al. | Salesforce AI Research | <summary>JudgeRank: pointwise, zeroshot, Llama-3.1.</summary> The JudgeRank estimate the relevance of query-document pairs following three steps: 1) query analysis to identify the core problem (query expansion), 2) document analysis to extract a query-aware summary, 3) relevance judgement to provide score based on the probability of βyesβ and βnoβ. |
BRIGHT, BEIR |
3. Analysis about Retrieval
Date | Title | Authors | Orgnization | Abs | Dataset |
---|---|---|---|---|---|
2024/7/14 (ππππ) |
The Power of Noise: Redefining Retrieval for RAG Systems [code: |
F. Cuconasu, G. Trappolini, F. Siciliano, etl. al. | Sapienza University of Rome | <summary>This paper studies noise passages β¦</summary>This paper studied the impact of noise passages in RAG, and found that adding random documents in the prompt improves the LLM accuracy by up to 35% on the NQ dataset. |
NQ-open (subset of NQ) |