Introduction

Coming soon …

1. Instruction Fine-tuning

1.1 Knowledge Enhance

Date Title Authors Orgnization Abs Dataset
2025/04/29
(🌟🌟🌟)
Systematic Knowledge Injection into Large Language Models via Diverse Augmentation for Domain-Specific RAG [code: ] Kushagra Bhushan, Yatin Nandwani, Dinesh Khandelwal, et al. IIT(ISM) Dhanbad, IBM
<summary>This paper presents PA-RAG …</summary> PARAG: Paraphrase Augmentation for RetrievalAugmented Generation, a novel fine-tuning framework that improves knowledge injection into LLMs for domain-specific RAG tasks. PA-RAG introduces two different ways of training data augmentation: 1) it uses context augmentation to simulate both retrieval success and retrieval failure scenarios for all the training questions; 2) it synthetically generates multiple answers for each training question to mitigate canonical answer overfitting.
MMLU, GSM8k, Hellaswag, TruthfulQA
2024/03/15
(🌟🌟🌟🌟)
RAFT: Adapting Language Model to Domain Specific RAG [code: ] Tianjun Zhang, Shishir G. Patil, Naman Jain, Sheng Shen, et al. UC Berkeley
<summary>This paper presents RAFT …</summary> RAFT leverages fine-tuning with question-answer pairs while referencing the documents in a simulated imperfect retrieval setting β€” thereby effectively preparing for the open-book exam setting. The RAFT is trained to answer the question (Q) from Document(s) (D) to generate answer (A), where A includes chain-of-thought reasoning.
PubMed, HotpotQA, Gorilla
2023/5/18
(🌟🌟)
Augmented Large Language Models with Parametric Knowledge Guiding Ziyang Luo, Can Xu, Pu Zhao, et. al., Hong Kong Baptist University, Microsoft
<summary>This paper presents PKG …</summary>This work propose Parametric Knowledge Guiding (PKG), which injects domain knowledge for LLaMa-7B via instruction fine-tuning to capture the necessary expertise. Then, the PKG is used to generage context for a given question as the background-augmented prompting for LLMs.
FM2, NQ-Table, MedMC-QA, ScienceQA

1.2 Attribution Enhance

Date Title Authors Orgnization Abs Dataset
2025/04/03
(🌟🌟🌟🌟)
ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations [code: ] Yubo Wang, Xueguang Ma, Ping Nie, et al. University of Waterloo, CMU
<summary>This paper presents ** ScholarCopilot** …</summary> ScholarCopilot is a unified framework designed to enhance existing large language models for generating professional academic articles with accurate and contextually relevant citations. ScholarCopilot dynamically determines when to retrieve scholarly references by generating a retrieval token [RET].
500k arXiv, 10M citations
2025/02/13
(🌟🌟🌟🌟)
SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, et al. Meta FAIR, MIT
<summary>This paper presents SelfCite …</summary>SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations.
LongBench-Cite
2024/12/19
(🌟🌟🌟)
VISA: Retrieval Augmented Generation with Visual Source Attribution Xueguang Ma, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Wenhu Chen, Jimmy Lin University of Waterloo, CSIR, University of Queensland
<summary>This paper presents VISA …</summary>This work proposes Retrieval-Augmented Generation with Visual Source Attribution (VISA), which processes single or multiple retrieved document images, and generates an answer as well as the bounding box of the relevant region within the evidence document. They curated two datasets: Wiki-VISA and Paper-VISA, to fine-tune the QWen2-VL-72B
Wiki-VISA, Paper-VISA
2024/09/10
(🌟🌟🌟)
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA [code: ] Jiajie Zhang, Yushi Bai, Xin Lv, et. al. Tsinghua University
<summary>This paper presents LongCite …</summary>This work proposes CoF (abbr. for β€œCoarse to Fine”), that utilizes off-the-shelf LLMs to automatically construct long-context QA instances with precise sentence-level citations. CoF comprises four stages: (1) Starting with a long text material, CoF first invokes the LLM to produce a query and its answer through Self-Instruct. (2) CoF uses the answer to retrieve several chunks from the context, which are then fed into the LLM to incorporate coarse-grained chunk-level citations within the answer. (3) The LLM identifies relevant sentences from each cited chunk to produce fine-grained citations. (4) instances with an insufficient number of citations are discarded.
LongBench-Cite
2024/08/20
(🌟🌟🌟🌟)
INSTRUCTRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales [code: ] Zhepei Wei, Wei-Lin Chen, Yu Meng University of Virginia
<summary>This paper presents InstructRAG …</summary>This work proposes InstructRAG to generate rationales along with the answer, enhancing both the generation accuracy and trustworthiness. It first prompt an instruction-tuned LLM (rational generator) to synthesize rationales, which is to explain how to derive correct answer from noisy retrieved documents. Then, it guid the LM to learn explict denoising by leveraging these rationals as either in-context learning demonstrations or as supervised fine-tuning data.
PopQA, TriviaQA, NQ, ASQA, 2WikiMHQA
2024/08/08
(🌟🌟🌟🌟)
Learning Fine-Grained Grounded Citations for Attributed Large Language Models [code: ] Lei Huang, Xiaocheng Feng, Weitao Ma, et. al. Harbin Institute of Technology, Harbin
<summary>This paper presents FRONT …</summary>This work introduces FRONT, a two-stage training framework designed to teach LLMs to generate Fine-gRained grOuNded ciTations, consisting of Grounding Guided Generation (G3) and Consistency-Aware Alignment (CAA). During the G3 stage, the LLM first selects supporting quotes from retrieved sources (grounding) and then conditions the generation process on them (generation). The CAA stage then utilizes preference optimization to further align the grounding and generation process by automatically constructing preference signals.
ALCE(ASQA, ELI5, QAMPARI)
2024/07/01
(🌟🌟🌟)
Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation Sirui Xia, Xintao Wang, Jiaqing Liang, et al. Fudan University, AntGroup
<summary>This paper presents ReClaim …</summary>Contributions: 1) ReClaim alternately generates citations and answer sentences, to enable large models to generate answer with citations. 2) For ReClaim, they constructed a training dataset and fine-tuned the model using different approaches to improve its attribution capability. 3) Through multiple experiments, they demonstrated the effectiveness of the method in enhancing the model’s verifiability and credibility.
ASQA, ELI5
2024/03/27
(🌟)
Improving Attributed Text Generation of Large Language Models via Preference Learning [code: ] Dongfang Li, Zetian Sun, Baotian Hu, et. al. Harbin Institute of Technology (Shenzhen)
<summary>This paper presents APO …</summary>This work conceptualize the attribution task for LLMs as preference learning and proposing an Automatic Preference Optimization (APO) framework. They assemble a curated dataset comprising 6,330 examples sourced and refined from existing datasets for posttraining. Beside, they further propose an automatic method to synthesize attribution preference data resulting in 95,263 pairs.
ASQA, StrategyQA, ELI5
2024/03/04
(🌟🌟)
Citation-Enhanced Generation for LLM-based Chatbots Weitao Li, Junkai Li, Weizhi Ma, Yang Liu Tsinghua University
<summary>This paper presents CEG …</summary>This work proposes a post-hoc Citation-Enhanced Generation (CEG) approach combined with RAG. It consists of three components: 1) Retrieval Augmentation Module uses NLTK as sentence tokenizer to obtain claims, then uses dense retrieval (SimCSE Bert) to retrieve documents; 2) Citation Generation Module first uses NLI model to determine the relationship between each claim-document pair to select valid reference for each claim; 3) Response Regeneration Module takes the question, original response, nonfactual claims, and relevant docs, as prompt input to regenerate the new response.
WikiBio GPT-3, FELM, HaluEval, WikiRetr

1.3 Long-context Enhance

Date Title Authors Orgnization Abs Dataset
2025/06/04
(🌟🌟🌟🌟)
Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models [code: ] Alex Laitenberger, Christopher D. Manning, Nelson F. Liu Stanford University
<summary>This paper compares DOS RAG …</summary>This paper aims to study ``With long-context LLMs (GPT-4o), do multistage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches’’. The results show that DOS RAG consistently matches or outperforms more intricate methods on ∞Bench, QuALITY, NarrativeQA.
∞Bench, QuALITY, NarrativeQA
2024/09/01
(🌟🌟🌟)
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs [code: ] Ziyan Jiang, Xueguang Ma, Wenhu Chen University of Waterloo
<summary>This paper presents LongRAG …</summary>The LongRAG consists of a β€œlong retriever” and a β€œlong reader”, which processes the entire Wikipedia into 4K-token units, which is 30x longer than before. It adopts off-the-shelf BGE as retriever and Gemini1.5-Pro or GPT-4o as readers without any further tuning.
NQ, HotpotQA, Qasper, MultiFieldQA-en
2024/07/11
(🌟🌟🌟🌟🌟)
LLM Maybe LongLM: SelfExtend LLM Context Window Without Tuning [code: ] ZHongye Jin, Xiaotian Han, Jingfeng Yang, et. al. Texas A&M University
<summary>This paper presents SelfExtend …</summary>SelfExtend extend the context window of LLMs by construncting bi-level attention information without fine-tuning: 1) The grouped attention captures the dependencies amongo tokens that are far apart; 2) The neighbor attention captures dependencies among adjacent tokens within a specified range
LongBench, L-Eval
2024/05/29
(🌟🌟🌟)
Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models Xindi Wang, Mahsa Salmani, Parsa Omidi, et. al. Huawei Tech. Canada, University of Western Ontario
<summary>This paper presents a survey …</summary>This paper survey works in enabling LLMs to handle long sequences, including length extrapolation, attention approximation, attention-free transformers, model compression, and hardware-aware transformers.
None

1.4 Reasoning Enhance

Date Title Authors Orgnization Abs Dataset
2024/09/01
(🌟🌟)
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates [code: ] Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang Princeton University, Peking University
<summary>ReasonFlux: Finetuning, Qwen2.5.</summary>It train the ReasonFlux-32B model with 8 GPUs and introduces three innovations: (i) a structured thought template library, containing around 500 high-level thought templates; (ii) performing hierarchical reinforcement learning on a sequence of thought templates, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time.
MATH, AIME 2024, AMC 2023, OlympiadBench, Gaokao, En 2023

2. Haullucinations

Date Title Authors Orgnization Abs Dataset
2023/09/13
(🌟🌟)
Cognitive Mirage: A Review of Hallucinations in Large Language Models Hongbin Ye, Tong Liu, Aijia Zhang, Wei Hua, Weiqiang Jia Zhejiang Lab
<summary>This paper presents taxonomy of hallucinations …</summary>This work provides a literature review on hallucinations, which presents a taxonomy of hallucinations from several text generation tasks, and mechanism analysis (three types: data collection, knowledge gap, and optimization process), detection methods and improvement approaches.
-

3. Understanding of LLM

Date Title Authors Orgnization Abs Dataset
2025/05/26
(🌟🌟🌟🌟)
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training Tianzhe Chu, Yuexiang Zhai, Jihan Yang, et al. HKU
<summary>This paper studies the comparative effect of SFT and RL on generalization and memorization … </summary>
This paper introduces GeneralPoints, an arithmetic reasoning card game, and also consider V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants. Findings: 1) RL, especially trained with an outcome-based reward, generalizes in both rule-based textual and visual environments. 2) SFT, tends to memorize the training data and struggle to generalize out-of-distribution in either scenario.
GeneralPoints, V-IRL
2025/05/02
(🌟🌟🌟🌟)
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers Zeyuan Allen-Zhu Meta/FAIR Labs
<summary>This paper studies architectural differences in LMs… </summary>
This paper introduces controlled synthetic pretraining tasks that isolate and evaluate core model capacities. They discover Canon layers: lightweight architectural components, that promote horizontal infromation flow across neighboring tokens.
controlled biography dataset
2024/04/08
(🌟🌟🌟🌟)
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws Zeyuan Allen-Zhu, Yuanzhi Li Meta/FAIR Labs
<summary>This paper studies knowledge capacity scaling laws… </summary>
This paper investigate the number of knowldge bits a mall stores. They focus on factual knowledge represented as tuples. Findings: 1) LMs can only store 2 bits of knowledge per parameter, even when quantized to int8, and 7B model can store 14B bits of knowledge. 2) The GPT-2 architecture, with rotary embedding, matches or even surpasses LLaMA/Mistral architectures in knowledge storage, particularly over shorter training durations. 3)Prepending training data with domain names (e.g., wikipedia.org) significantly increases a model’s knowledge capacity.
controlled biography dataset
2023/09/18
(🌟🌟🌟🌟)
Physics of Language Models: Part 3.2, Knowledge Manipulation Zeyuan Allen-Zhu, Yuanzhi Li Meta/FAIR Labs
<summary>This paper studies knowledge manipulation of LLMs… </summary>
This paper investigate four knowledge manipulation tasks: retrieval, classification, comparison, and inverse search. Findings: 1) LLM excel in knowledge retrieval but struggle even in the simplest classification or comparison tasks unless Chain of Thoughts (CoT), and 2) the performance of inverse knowledge search is virtually 0%, regardless of the prompts.
controlled biography dataset
2023/09/18
(🌟🌟🌟🌟)
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction Zeyuan Allen-Zhu, Yuanzhi Li Meta/FAIR Labs
<summary>This paper studies knowledge storage and Extraction of LLMs… </summary>
This paper investigate whether the question-answering capabilities of LLMs stem from pattern recognition and memorization or from a genuine ability to reason and extract knowledge from their training data. Findings: 1) rewrite the pre-training data-using small, auxiliary models-to provide knowledge augmentation, and 2) incorporate more instruction-finetuning data into the pretraining stage before it becomes too late.
controlled biography dataset

4. Datasets

4.1 Factoid QA

Date Title Authors Orgnization Abs Dataset
2024/01/26 Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs Nan Hu, Jiaoyan Chen, Yike Wu, Guilin Qi, Sheng Bi, Tongtong Wu, Jeff Z. Pan. Southeast University, The University of Manchester,The University of Edinburgh
<summary>This paper presents CAQA …</summary>CAQA is a new benchmark for complex question answering attribution, which is designed to evaluate the ability of LLMs to answer complex questions with the help of knowledge graphs.
CAQA
2022/04/12 ASQA: Factoid Questions Meet Long-Form Answers [dataset] Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, Ming-Wei Chang Carnegie Mellon University, Duke University, Google Research
<summary>This paper presents ASQA …</summary>ASQA is the first long-form question answering dataset that focuses on ambiguous factoid questions.
ASQA

3.2 Long-form QA