1. Survey papers

Date Title Organization Code
2025/04/10 LLM-based NLG Evaluation: Current Status and Challenges Peking University -
2024/05/13 Evaluation of Retrieval-Augmented Generation: A Survey Tencent Code
2024/01/30 RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture Microsoft No

2 Evaluation papers


2.1 Short answer evaluation

2.2 Long answer evaluation

Date Title Organization Method Metric Dataset Code
2023/09/15 Investigating Answerability of LLMs for Long-Form Question Answering Salesforce Prompting GPT-4 to rate answers on a scale from 0 to 3.
<summary>Coherency, Relevance, Factual consistency, and Accuracy</summary> Coherency: Answer should be well-structured and well-organized and should not just be a heap of related information. Relevance: Answer should be relevant to the question and the context. The answer should be concise and avoid drifting from the question being asked. Factual consistency: The context should be the primary source for the answer. The answer should not contain fabricated facts and should entail information present in the context. Accuracy: Answer should be satisfactory and complete to the question being asked. Measure the correctness of the answer by checking if the response answers the presented question.
- -

2.3 Context evaluation

2.4 Documents evaluation

3 Tools and Benchmarks

Date Title Organization Code
2024/10/10 HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly Princeton Code
2024/08/16 RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models NewsBreak Code
2024/08/16 RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation Amazon Code
2024/04/21 Evaluating Retrieval Quality in Retrieval-Augmented Generation UMASS No
2024/04/08 FaaF: Facts as a Function for the evaluation of generated text IMMO Capital Code
2024/02/19 CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models University of Science and Technology of China Code
2024/1/11 Seven Failure Points When Engineering a Retrieval Augmented Generation System Applied Artificial Intelligence Institute No
2023/12/20 Benchmarking Large Language Models in Retrieval-Augmented Generation Chinese Information Processing Laboratory Code
2023/11/16 ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems Stanford Code
2023/11/14 RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge Peking University No
2023/10/31 Enabling Large Language Models to Generate Text with Citations Princeton University Code
2023/09/26 RAGAS: Automated Evaluation of Retrieval Augmented Generation Exploding Gradients Code
2021/08/05 TruLens:Evaluation and Tracking for LLM Experiments TruEra Code