Awesome papers about evaluator

1. Survey papers

Date	Title	Organization	Code
2025/04/10	LLM-based NLG Evaluation: Current Status and Challenges	Peking University	-
2024/05/13	Evaluation of Retrieval-Augmented Generation: A Survey	Tencent	Code
2024/01/30	RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture	Microsoft	No

Date	Title	Organization	Method	Metric	Dataset	Code
2023/09/15	Investigating Answerability of LLMs for Long-Form Question Answering	Salesforce	Prompting GPT-4 to rate answers on a scale from 0 to 3.	<summary>Coherency, Relevance, Factual consistency, and Accuracy</summary> Coherency: Answer should be well-structured and well-organized and should not just be a heap of related information. Relevance: Answer should be relevant to the question and the context. The answer should be concise and avoid drifting from the question being asked. Factual consistency: The context should be the primary source for the answer. The answer should not contain fabricated facts and should entail information present in the context. Accuracy: Answer should be satisfactory and complete to the question being asked. Measure the correctness of the answer by checking if the response answers the presented question.	-	-

Date	Title	Organization	Code
2024/10/10	HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly	Princeton	Code
2024/08/16	RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models	NewsBreak	Code
2024/08/16	RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation	Amazon	Code
2024/04/21	Evaluating Retrieval Quality in Retrieval-Augmented Generation	UMASS	No
2024/04/08	FaaF: Facts as a Function for the evaluation of generated text	IMMO Capital	Code
2024/02/19	CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models	University of Science and Technology of China	Code
2024/1/11	Seven Failure Points When Engineering a Retrieval Augmented Generation System	Applied Artificial Intelligence Institute	No
2023/12/20	Benchmarking Large Language Models in Retrieval-Augmented Generation	Chinese Information Processing Laboratory	Code
2023/11/16	ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems	Stanford	Code
2023/11/14	RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge	Peking University	No
2023/10/31	Enabling Large Language Models to Generate Text with Citations	Princeton University	Code
2023/09/26	RAGAS: Automated Evaluation of Retrieval Augmented Generation	Exploding Gradients	Code
2021/08/05	TruLens:Evaluation and Tracking for LLM Experiments	TruEra	Code