FACTS Grounding: A new benchmark for assessing the facticity of large language models

[ad_1]

Responsibility and safety

Published: December 17, 2024
Authors: FACTS team

Our comprehensive benchmark and online leaderboard provide a much-needed measure of how accurately LLMs base their answers on the source material provided and avoid hallucinations

Large language models (LLMs) are changing the way we access information, but their impact on factual accuracy remains incomplete. They can “hallucinate” false information, especially when there is complex input. This, in turn, can undermine confidence in LLMs and limit their real-world applications.

Today we introduce FACTS Grounding, a comprehensive benchmark for assessing the ability of LLMs to generate responses that are not only factually correct relative to given inputs, but also sufficiently detailed to provide satisfactory answers to user queries.

We hope our benchmark will drive industry-wide advances in factuality and groundedness. To track progress, we're also launching the FACTS leaderboard on Kaggle. We have already tested leading LLMs with FACTS Grounding and filled the first leaderboard with their grounding results. We will maintain and update the leaderboard as the field progresses.

Current leaderboard

FACTS Ground Dataset

To accurately assess the factuality and rationale of a particular LLM, the FACTS Grounding dataset includes 1,719 examples, each carefully crafted to require detailed responses based on the contextual document provided. Each example consists of a document, a system instruction requiring the LLM to refer exclusively to the document provided, and an accompanying user request.

An example from the FACTS Grounding dataset

All examples are divided into a “public” sentence (860) and a “private” (859) sentence. We're releasing the public set today so anyone can use it to assess an LLM. Of course, we know it's important to protect yourself from issues like benchmark contamination and leaderboard hacking. Therefore, in accordance with industry practice, we keep private valuations open. FACTS leaderboard results are average performance in both public and private sets.

Xem thêm What's Predictive analytics? | TechRadar

To ensure diversity in input, FACTS Grounding examples include documents of varying lengths, up to a maximum of 32,000 tokens (approximately 20,000 words), covering areas such as finance, technology, retail, medical and legal. User requests are similarly broad and include requests for summarization, Q&A generation, and rewriting tasks. We did not include examples that might require creativity, mathematics, or complex thinking – skills that might require the model to apply more advanced thinking in addition to reasoning.

Prompt distribution

Collective assessment by leading LLMs

To be successful in a given example, an LLM must synthesize the complex information in the document and generate a long response that is both a comprehensive answer to the user query and is entirely attributable to that document.

FACTS Grounding automatically evaluates model answers using three LLM border judges – namely Gemini 1.5 Pro, GPT-4o and Claude 3.5 Sonnet. We selected a combination of different judges to mitigate potential bias that could arise if a judge gave higher ratings to the answers of a member of their own model family. The automated assessment models were extensively evaluated against a test set to find the best-performing assessment prompt templates and verify consistency with human raters.

Each FACTS grounding example is assessed in two phases. First, the answers are checked for suitability and disqualified if they do not sufficiently address the user's request. Secondly, answers are judged to be factually correct if they are based entirely on the information contained in the document provided and do not contain any hallucinations.

After the suitability and reasoning accuracy of a given LLM answer are separately evaluated by multiple AI judge models, the results are then aggregated to determine whether the LLM successfully mastered the example. The final score for the entire grounding task is the average of the scores from all judge models in all examples. For more details on our FACTS Grounding assessment methodology, see our document.

Xem thêm Here's why Google's Gemini AI, if given a proper memory, could save lives

A factually correct answer that does not properly address the user's query will fail in the benchmarking example. Here we see three examples of model answers that the automated LLM judges deemed ineligible

FACTS Grounding will continue to evolve

We recognize that benchmarks can quickly be overtaken by progress, so the launch of our FACTS Grounding benchmark and leaderboard is just the beginning. Facticity and grounding are among the key factors that will shape the future success and usefulness of LLMs and broader AI systems, and we are committed to expanding and iterating on FACTS Grounding as the field progresses, continually raising the bar.

We encourage the AI community to participate in FACTS Grounding, evaluate their models using the open example sets, or submit their models for evaluation. We believe that comprehensive benchmarking methodologies, coupled with continuous research and development, will further improve AI systems.

Acknowledgments

FACTS is a collaboration between Google DeepMind and Google Research.
FACTS Grounding was led by: Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Dipanjan Das, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu and Nate Keating.

We are also very grateful for contributions from: Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang and Sasha Goldshtein.

We would also like to thank Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu, and Yossi Matias for their continued support.

[ad_2]

Source link

Xem thêm What are AI Hallucinations? When AI goes improper

FACTS Grounding: A new benchmark for assessing the facticity of large language models

FACTS Ground Dataset

Collective assessment by leading LLMs

FACTS Grounding will continue to evolve

Acknowledgments

By

Trả lời Hủy

You Missed

Rumored features of the Samsung Galaxy S25 Ultra: the most important upgrades for the S25 Ultra

Do you often get stuck in games? Google's potential Circle to Search could solve your gaming problems

Google Messages finally undoes an annoying change and makes organizing your contacts easier

Samsung Galaxy S25 rumored features: key upgrades to the S25 range

FACTS Ground Dataset

Collective assessment by leading LLMs

FACTS Grounding will continue to evolve

Acknowledgments

By

Related Post

Trả lời Hủy

You Missed