The Value Axis: Language Models Encode Whether They're on the Right Track

基本信息

来源: arxiv
原始来源: https://arxiv.org/abs/2606.17056v1
作者: Nick Jiang, Isaac Kauvar, Jack Lindsey
分类: cs.CL
论文时间: 2026-06-15T17:59:58Z
论文 PDF: https://arxiv.org/pdf/2606.17056v1.pdf

来源摘要/节选

We investigate whether language models internally track the value of their current trajectory, defined as the likelihood that their ongoing strategy will achieve their goals. Using synthetic, in-context reinforcement learning data, we construct a “value” axis for Qwen3-8B. We find that activations along this axis distinguish between high vs. low verbalized confidence, rollouts without and with backtracking, and correct vs. corrupted code. Steering towards high value causally suppresses self-correction and reduces explanatory verbosity, while steering towards low value induces backtracking and exploration. We demonstrate that direct preference optimization (DPO) can increase the internal value of rewarded behaviors (e.g. use a certain word), causing the model to act more confidently after exhibiting them. Finally, we apply the value axis to study in-the-wild settings. For example, we find that Qwen assigns low value to politically sensitive chat queries after post-training and that supervised fine-tuning increases internal confidence within the training domain. Our results suggest that language models linearly encode an estimate of expected goal success that modulates their confidence in pursuing a direction.

来源说明

当前只保存了官方论文摘要，不代表论文全文。请以原始来源为准。

本页只呈现已做哈希绑定的来源证据，不包含基于旧正文或缺失原文的扩展推断。

基本信息

来源摘要/节选

来源说明

从首次观测到传播链