How reliable are LLMs when it comes to playing dice?

基本信息

来源: arxiv
原始来源: https://arxiv.org/abs/2606.07515v1
作者: Luca Avena, Gianmarco Bet, Bernardo Busoni
分类: cs.CL
论文时间: 2026-06-05T17:59:42Z
论文 PDF: https://arxiv.org/pdf/2606.07515v1.pdf

来源摘要/节选

We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.

来源说明

当前只保存了官方论文摘要，不代表论文全文。请以原始来源为准。

本页只呈现已做哈希绑定的来源证据，不包含基于旧正文或缺失原文的扩展推断。

How reliable are LLMs when it comes to playing dice?

基本信息

来源摘要/节选

来源说明

应用场景

AI/ML项目

大语言模型

从首次观测到传播链