Copyright Notice:

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Publications of SPCL

M. Chrapek, M. Copik, E. Mettaz, T. Hoefler:

 Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs

(In Proceedings of the 2025 IEEE International Symposium on Workload Characterization (IISWC), presented in Irvine, CA, USA, IEEE Press, Oct. 2025)

Abstract

Large Language Models (LLMs) are increasingly deployed on converged Cloud and High-Performance Computing (HPC) infrastructure. However, as LLMs handle confidential inputs and are fine-tuned on costly, proprietary datasets, their heightened security requirements slow adoption in privacy-sensitive sectors such as healthcare and finance. We investigate methods to address this gap and propose Trusted Execution Environments (TEEs) as a solution for securing end-to-end LLM inference. We validate their practicality by evaluating these compute-intensive workloads entirely within CPU and GPU TEEs. On the CPU side, we conduct an in-depth study running full Llama2 inference pipelines (7B, 13B, 70B) inside Intel’s TDX and SGX, accelerated by Advanced Matrix Extensions (AMX). We derive 12 insights, including that across various data types, batch sizes, and input lengths, CPU TEEs impose under 10% throughput and 20% latency overheads, further reduced by AMX. We run LLM inference on NVIDIA H100 Confidential Compute GPUs, contextualizing our CPU findings and observing throughput penalties of 4–8% that diminish as batch and input sizes grow. By comparing performance, cost, and security trade-offs, we show how CPU TEEs can be more cost-effective or secure than their GPU counterparts. To our knowledge, our work is the first to comprehensively demonstrate the performance and practicality of modern TEEs across both CPUs and GPUs for enabling confidential LLMs (cLLMs).

Documents

download article:
access preprint on arxiv:
download slides:
 

BibTeX

@article{chrapek2025confidentialLLMs,
  author={Marcin Chrapek and Marcin Copik and Etienne Mettaz and Torsten Hoefler},
  title={{Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs}},
  year={2025},
  month={10},
  booktitle={Proceedings of the 2025 IEEE International Symposium on Workload Characterization (IISWC)},
  location={Irvine, CA, USA},
  publisher={IEEE Press},
  doi={10.48550/arXiv.2509.18886},
}