Introduction

Coming soon …

1. Instruction Fine-tuning

1.1 Knowledge Enhance

Date Title Authors Orgnization Abs Dataset
2024/03/15
(🌟🌟🌟🌟)
RAFT: Adapting Language Model to Domain Specific RAG [code: ] Tianjun Zhang, Shishir G. Patil, Naman Jain, Sheng Shen, et al. UC Berkeley
<summary>This paper presents RAFT …</summary> RAFT leverages fine-tuning with question-answer pairs while referencing the documents in a simulated imperfect retrieval setting β€” thereby effectively preparing for the open-book exam setting. The RAFT is trained to answer the question (Q) from Document(s) (D) to generate answer (A), where A includes chain-of-thought reasoning.
PubMed, HotpotQA, Gorilla
2023/5/18
(🌟🌟)
Augmented Large Language Models with Parametric Knowledge Guiding Ziyang Luo, Can Xu, Pu Zhao, et. al., Hong Kong Baptist University, Microsoft
<summary>This paper presents PKG …</summary>This work propose Parametric Knowledge Guiding (PKG), which injects domain knowledge for LLaMa-7B via instruction fine-tuning to capture the necessary expertise. Then, the PKG is used to generage context for a given question as the background-augmented prompting for LLMs.
FM2, NQ-Table, MedMC-QA, ScienceQA

1.2 Attribution Enhance

Date Title Authors Orgnization Abs Dataset
2025/02/13
(🌟🌟🌟🌟)
SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih Meta FAIR, MIT
<summary>This paper presents SelfCite …</summary>SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations.
LongBench-Cite
2024/12/19
(🌟🌟🌟)
VISA: Retrieval Augmented Generation with Visual Source Attribution Xueguang Ma, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Wenhu Chen, Jimmy Lin University of Waterloo, CSIR, University of Queensland
<summary>This paper presents VISA …</summary>This work proposes Retrieval-Augmented Generation with Visual Source Attribution (VISA), which processes single or multiple retrieved document images, and generates an answer as well as the bounding box of the relevant region within the evidence document. They curated two datasets: Wiki-VISA and Paper-VISA, to fine-tune the QWen2-VL-72B
Wiki-VISA, Paper-VISA
2024/09/10
(🌟🌟🌟)
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA [code: ] Jiajie Zhang, Yushi Bai, Xin Lv, et. al. Tsinghua University
<summary>This paper presents LongCite …</summary>This work proposes CoF (abbr. for β€œCoarse to Fine”), that utilizes off-the-shelf LLMs to automatically construct long-context QA instances with precise sentence-level citations. CoF comprises four stages: (1) Starting with a long text material, CoF first invokes the LLM to produce a query and its answer through Self-Instruct. (2) CoF uses the answer to retrieve several chunks from the context, which are then fed into the LLM to incorporate coarse-grained chunk-level citations within the answer. (3) The LLM identifies relevant sentences from each cited chunk to produce fine-grained citations. (4) instances with an insufficient number of citations are discarded.
LongBench-Cite
2024/08/20
(🌟🌟🌟🌟)
INSTRUCTRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales [code: ] Zhepei Wei, Wei-Lin Chen, Yu Meng University of Virginia
<summary>This paper presents InstructRAG …</summary>This work proposes InstructRAG to generate rationales along with the answer, enhancing both the generation accuracy and trustworthiness. It first prompt an instruction-tuned LLM (rational generator) to synthesize rationales, which is to explain how to derive correct answer from noisy retrieved documents. Then, it guid the LM to learn explict denoising by leveraging these rationals as either in-context learning demonstrations or as supervised fine-tuning data.
PopQA, TriviaQA, NQ, ASQA, 2WikiMHQA
2024/08/08
(🌟🌟🌟🌟)
Learning Fine-Grained Grounded Citations for Attributed Large Language Models [code: ] Lei Huang, Xiaocheng Feng, Weitao Ma, et. al. Harbin Institute of Technology, Harbin
<summary>This paper presents FRONT …</summary>This work introduces FRONT, a two-stage training framework designed to teach LLMs to generate Fine-gRained grOuNded ciTations, consisting of Grounding Guided Generation (G3) and Consistency-Aware Alignment (CAA). During the G3 stage, the LLM first selects supporting quotes from retrieved sources (grounding) and then conditions the generation process on them (generation). The CAA stage then utilizes preference optimization to further align the grounding and generation process by automatically constructing preference signals.
ALCE(ASQA, ELI5, QAMPARI)
2024/07/01
(🌟🌟🌟)
Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation Sirui Xia, Xintao Wang, Jiaqing Liang, et al. Fudan University, AntGroup
<summary>This paper presents ReClaim …</summary>Contributions: 1) ReClaim alternately generates citations and answer sentences, to enable large models to generate answer with citations. 2) For ReClaim, they constructed a training dataset and fine-tuned the model using different approaches to improve its attribution capability. 3) Through multiple experiments, they demonstrated the effectiveness of the method in enhancing the model’s verifiability and credibility.
ASQA, ELI5
2024/03/27
(🌟)
Improving Attributed Text Generation of Large Language Models via Preference Learning [code: ] Dongfang Li, Zetian Sun, Baotian Hu, et. al. Harbin Institute of Technology (Shenzhen)
<summary>This paper presents APO …</summary>This work conceptualize the attribution task for LLMs as preference learning and proposing an Automatic Preference Optimization (APO) framework. They assemble a curated dataset comprising 6,330 examples sourced and refined from existing datasets for posttraining. Beside, they further propose an automatic method to synthesize attribution preference data resulting in 95,263 pairs.
ASQA, StrategyQA, ELI5
2024/03/04
(🌟🌟)
Citation-Enhanced Generation for LLM-based Chatbots Weitao Li, Junkai Li, Weizhi Ma, Yang Liu Tsinghua University
<summary>This paper presents CEG …</summary>This work proposes a post-hoc Citation-Enhanced Generation (CEG) approach combined with RAG. It consists of three components: 1) Retrieval Augmentation Module uses NLTK as sentence tokenizer to obtain claims, then uses dense retrieval (SimCSE Bert) to retrieve documents; 2) Citation Generation Module first uses NLI model to determine the relationship between each claim-document pair to select valid reference for each claim; 3) Response Regeneration Module takes the question, original response, nonfactual claims, and relevant docs, as prompt input to regenerate the new response.
WikiBio GPT-3, FELM, HaluEval, WikiRetr

1.3 Long-context Enhance

Date Title Authors Orgnization Abs Dataset
2025/06/04
(🌟🌟🌟🌟)
Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models [code: ] Alex Laitenberger, Christopher D. Manning, Nelson F. Liu Stanford University
<summary>This paper compares DOS RAG …</summary>This paper aims to study ``With long-context LLMs (GPT-4o), do multistage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches’’. The results show that DOS RAG consistently matches or outperforms more intricate methods on ∞Bench, QuALITY, NarrativeQA.
∞Bench, QuALITY, NarrativeQA
2024/09/01
(🌟🌟🌟)
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs [code: ] Ziyan Jiang, Xueguang Ma, Wenhu Chen University of Waterloo
<summary>This paper presents LongRAG …</summary>The LongRAG consists of a β€œlong retriever” and a β€œlong reader”, which processes the entire Wikipedia into 4K-token units, which is 30x longer than before. It adopts off-the-shelf BGE as retriever and Gemini1.5-Pro or GPT-4o as readers without any further tuning.
NQ, HotpotQA, Qasper, MultiFieldQA-en
2024/07/11
(🌟🌟🌟🌟🌟)
LLM Maybe LongLM: SelfExtend LLM Context Window Without Tuning [code: ] ZHongye Jin, Xiaotian Han, Jingfeng Yang, et. al. Texas A&M University
<summary>This paper presents SelfExtend …</summary>SelfExtend extend the context window of LLMs by construncting bi-level attention information without fine-tuning: 1) The grouped attention captures the dependencies amongo tokens that are far apart; 2) The neighbor attention captures dependencies among adjacent tokens within a specified range
LongBench, L-Eval
2024/05/29
(🌟🌟🌟)
Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models Xindi Wang, Mahsa Salmani, Parsa Omidi, et. al. Huawei Tech. Canada, University of Western Ontario
<summary>This paper presents a survey …</summary>This paper survey works in enabling LLMs to handle long sequences, including length extrapolation, attention approximation, attention-free transformers, model compression, and hardware-aware transformers.
None

1.4 Reasoning Enhance

Date Title Authors Orgnization Abs Dataset
2025/05/07
(🌟🌟)
ZEROSEARCH: Incentivize the Search Capability of LLMs without Searching [code: ] Hao Sun, Zile Qiao, Jiayan Guo, et al. Tongyi Lab, Alibaba Group
<summary>ZeroSearch: RL, Qwen2.5.</summary> This work proposes ZeroSearch, a RL framework that incentivizes the search capability of LLMs. It transforms LLM into a retrieval module by supervised fine-tuning.It introduces a curriculum rollout mechanism to progressively elicit model’s reasoning ability by exposing it to increasingly challenging retrieval scenarios. The rollout trajectory contains [].</small></details>
NQ, TrivaiQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle
2025/03/27
(🌟🌟)
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning [code: ] Mingyang Chen, Tianpeng Li, Haoze Sun, et al. Baichuan Inc.
<summary>ReSearch: RL, Qwen2.5.</summary> The ReSearch trains LLMs to reason with search via RL without any supervised data. The rollout trajectory contains [].</small></details>
HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle
2025/03/18
(🌟🌟)
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [code: ] Huatong Song, Jinhao Jiang, Yingqian Min, et al. Renmin University
<summary>R1-Search: RL, Qwen2.5.</summary> The R1-Search utilizes a two-stage, outcome-based training strategy. The first stage uses retrieve-reward to incentivize the model to conduct retrieval operations. The second stage introduce answer-reward to encourage model to learn to use external retrieval system to solve questions. The rollout trajectory contains [].</small></details>
HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle
2025/03/12
(🌟🌟)
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [code: ] Bowen Jin, Hansi Zeng, Zhenrui Yue, et al. UIUC, Google
<summary>Search-R1: RL, Qwen2.5.</summary> The Search-R1 optimizes LLM to [search/reason/answer] within multi-turn search interactions, leveraging retrival token masking for stable RL training and a smiple outcome-based reward function.The rollout trajectory contains [].</small></details>
NQ, TrivaiQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle
2025/01/09
(🌟🌟🌟)
Search-o1: Agentic Search-Enhanced Large Reasoning Models [code: ] Xiaoxi Li, Guanting Dong, Jiajie Jin, et al. Renmin University
<summary>Search-o1: Zeroshot, QwQ-32B.</summary> The Search-o1 combines the reasoning process with an agentic RAG mechanism and a knowledge refinement module. The reason-in-document module operates independently from the main reasoning chain, which conducts a thorough analysis of retrieved documents and produces refined information.
GPQA, MATH500, AMC2023, AIME2024, LiveCodeBench, NQ, TriviaQA, HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle
2024/09/01
(🌟🌟)
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates [code: ] Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang Princeton University, Peking University
<summary>ReasonFlux: Finetuning, Qwen2.5.</summary>It train the ReasonFlux-32B model with 8 GPUs and introduces three innovations: (i) a structured thought template library, containing around 500 high-level thought templates; (ii) performing hierarchical reinforcement learning on a sequence of thought templates, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time.
MATH, AIME 2024, AMC 2023, OlympiadBench, Gaokao, En 2023

2. Haullucinations

Date Title Authors Orgnization Abs Dataset
2023/09/13
(🌟🌟)
Cognitive Mirage: A Review of Hallucinations in Large Language Models Hongbin Ye, Tong Liu, Aijia Zhang, Wei Hua, Weiqiang Jia Zhejiang Lab
<summary>This paper presents taxonomy of hallucinations …</summary>This work provides a literature review on hallucinations, which presents a taxonomy of hallucinations from several text generation tasks, and mechanism analysis (three types: data collection, knowledge gap, and optimization process), detection methods and improvement approaches.
-

3. Datasets

3.1 Factoid QA

Date Title Authors Orgnization Abs Dataset
2024/01/26 Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs Nan Hu, Jiaoyan Chen, Yike Wu, Guilin Qi, Sheng Bi, Tongtong Wu, Jeff Z. Pan. Southeast University, The University of Manchester,The University of Edinburgh
<summary>This paper presents CAQA …</summary>CAQA is a new benchmark for complex question answering attribution, which is designed to evaluate the ability of LLMs to answer complex questions with the help of knowledge graphs.
CAQA
2022/04/12 ASQA: Factoid Questions Meet Long-Form Answers [dataset] Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, Ming-Wei Chang Carnegie Mellon University, Duke University, Google Research
<summary>This paper presents ASQA …</summary>ASQA is the first long-form question answering dataset that focuses on ambiguous factoid questions.
ASQA

3.2 Long-form QA