[ad_1]
Responsibility and safety
Our comprehensive benchmark and online leaderboard provide a much-needed measure of how accurately LLMs base their answers on the source material provided and avoid hallucinations
Large language models (LLMs) are changing the way we access information, but their impact on factual accuracy remains incomplete. They can “hallucinate” false information, especially when there is complex input. This, in turn, can undermine confidence in LLMs and limit their real-world applications.
Today we introduce FACTS Grounding, a comprehensive benchmark for assessing the ability of LLMs to generate responses that are not only factually correct relative to given inputs, but also sufficiently detailed to provide satisfactory answers to user queries.
We hope our benchmark will drive industry-wide advances in factuality and groundedness. To track progress, we're also launching the FACTS leaderboard on Kaggle. We have already tested leading LLMs with FACTS Grounding and filled the first leaderboard with their grounding results. We will maintain and update the leaderboard as the field progresses.
FACTS Ground Dataset
To accurately assess the factuality and rationale of a particular LLM, the FACTS Grounding dataset includes 1,719 examples, each carefully crafted to require detailed responses based on the contextual document provided. Each example consists of a document, a system instruction requiring the LLM to refer exclusively to the document provided, and an accompanying user request.
All examples are divided into a “public” sentence (860) and a “private” (859) sentence. We're releasing the public set today so anyone can use it to assess an LLM. Of course, we know it's important to protect yourself from issues like benchmark contamination and leaderboard hacking. Therefore, in accordance with industry practice, we keep private valuations open. FACTS leaderboard results are average performance in both public and private sets.
To ensure diversity in input, FACTS Grounding examples include documents of varying lengths, up to a maximum of 32,000 tokens (approximately 20,000 words), covering areas such as finance, technology, retail, medical and legal. User requests are similarly broad and include requests for summarization, Q&A generation, and rewriting tasks. We did not include examples that might require creativity, mathematics, or complex thinking – skills that might require the model to apply more advanced thinking in addition to reasoning.
Collective assessment by leading LLMs
To be successful in a given example, an LLM must synthesize the complex information in the document and generate a long response that is both a comprehensive answer to the user query and is entirely attributable to that document.
FACTS Grounding automatically evaluates model answers using three LLM border judges – namely Gemini 1.5 Pro, GPT-4o and Claude 3.5 Sonnet. We selected a combination of different judges to mitigate potential bias that could arise if a judge gave higher ratings to the answers of a member of their own model family. The automated assessment models were extensively evaluated against a test set to find the best-performing assessment prompt templates and verify consistency with human raters.
Each FACTS grounding example is assessed in two phases. First, the answers are checked for suitability and disqualified if they do not sufficiently address the user's request. Secondly, answers are judged to be factually correct if they are based entirely on the information contained in the document provided and do not contain any hallucinations.
After the suitability and reasoning accuracy of a given LLM answer are separately evaluated by multiple AI judge models, the results are then aggregated to determine whether the LLM successfully mastered the example. The final score for the entire grounding task is the average of the scores from all judge models in all examples. For more details on our FACTS Grounding assessment methodology, see our document.
FACTS Grounding will continue to evolve
We recognize that benchmarks can quickly be overtaken by progress, so the launch of our FACTS Grounding benchmark and leaderboard is just the beginning. Facticity and grounding are among the key factors that will shape the future success and usefulness of LLMs and broader AI systems, and we are committed to expanding and iterating on FACTS Grounding as the field progresses, continually raising the bar.
We encourage the AI community to participate in FACTS Grounding, evaluate their models using the open example sets, or submit their models for evaluation. We believe that comprehensive benchmarking methodologies, coupled with continuous research and development, will further improve AI systems.
Acknowledgments
FACTS is a collaboration between Google DeepMind and Google Research.
FACTS Grounding was led by: Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Dipanjan Das, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu and Nate Keating.
We are also very grateful for contributions from: Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang and Sasha Goldshtein.
We would also like to thank Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu, and Yossi Matias for their continued support.
[ad_2]
Source link