Copyright Notice:

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Publications of SPCL

M. Besta, L. Paleari, M. Copik, R. Gerstenberger, A. Kubicek, P. Nyczyk, P. Iff, E. Schreiber, T. Srindran, T. Lehmann, H. Niewiadomski, T. Hoefler:

 CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

(arXiv:2406.02524. Jun. 2025)

Abstract

Large Language Models (LLMs) are transforming a wide range of domains, yet verifying their outputs remains a significant challenge, especially for complex open-ended tasks such as consolidation, summarization, and knowledge extraction. To address this, we introduce CheckEmbed (CE): a simple, scalable, and accurate verification method. CE reduces each LLM answer to a single embedding vector using powerful modern embedding LLM models like SFR-Embedding-Mistral. Prior methods such as BERTScore and SelfCheckGPT relied on weaker encoders like BERT, forcing them to operate at token or sentence granularity. In contrast, CE performs fast, semantically rich comparisons directly at the whole-answer level, overcoming key limitations in both accuracy and scalability. We conduct a comprehensive design and time complexity analysis across 13 verification baselines, including classical text scorers (e.g., BLEU), stability-based methods (e.g., SelfCheckGPT), and generative evaluators (e.g., LLM-as-a-Judge), which highlights the effectiveness, efficiency, versatility, and simplicity of CE. Empirical results show that CE reliably detects hallucinations in both closed and open-ended tasks. We further present evidence that CE generalizes beyond text to other modalities such as vision, establishing it as a practical and versatile verification framework.

Documents

download article:
access preprint on arxiv:
 

BibTeX

@article{besta2024checkembed,
  author={Maciej Besta and Lorenzo Paleari and Marcin Copik and Robert Gerstenberger and Ales Kubicek and Piotr Nyczyk and Patrick Iff and Eric Schreiber and Tanja Srindran and Tomasz Lehmann and Hubert Niewiadomski and Torsten Hoefler},
  title={{CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks}},
  journal={arXiv:2406.02524},
  year={2025},
  month={06},
  doi={10.48550/arXiv.2406.02524},
}