目录
基准测试
条目:129
2026年三月
1 篇
| 类型 | 阅读 | 条目 |
|---|---|---|
[自动]
[BLOGS_PODCASTS] | 3min | mic
Anthropic 模型蒸馏与 SWE-Bench 作弊机制分析 03-01
模型蒸馏
合成数据
SWE-Bench |
2026年二月
90 篇
| 类型 | 阅读 | 条目 |
|---|---|---|
[自动]
[BLOGS_PODCASTS] | 3min | mic
Anthropic模型蒸馏与SWE-Bench失效机制分析 02-28
模型蒸馏
SWE-Bench
Anthropic |
[自动]
[BLOGS_PODCASTS] | 3min | mic
Anthropic蒸馏与模型作弊机制:SWE-Bench失效分析 02-27
Anthropic
模型蒸馏
宪法AI |
[自动]
[BLOGS_PODCASTS] | 2min | mic
OpenAI 联合西北太平洋国家实验室推出 DraftNEPABench,加速联邦许可流程 02-27
OpenAI
PNNL
DraftNEPABench |
[自动]
[BLOGS_PODCASTS] | 2min | mic
OpenAI 与太平洋西北国家实验室推基准测试,加速联邦许可流程 02-27
OpenAI
基准测试
AI 编程代理 |
[自动]
[BLOGS_PODCASTS] | 2min | mic
Anthropic 模型蒸馏与 SWE-Bench 作弊机制分析 02-27
模型蒸馏
SWE-bench
奖励黑客 |
[自动]
[BLOGS_PODCASTS] | 2min | mic
OpenAI 与西北太平洋国家实验室合作推出 DraftNEPABench 加速联邦许可审批 02-27
OpenAI
PNNL
DraftNEPABench |
[自动]
[ARXIV] | 4min | school
高效自动化翻译基准测试与数据集的流水线 02-26
LLM
多语言模型
数据集 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
OpenAI 与西北太平洋国家实验室合作推出 DraftNEPABench 加速联邦许可流程 02-26
OpenAI
AI 编程代理
DraftNEPABench |
[自动]
[BLOGS_PODCASTS] | 2min | mic
OpenAI 与西北太平洋国家实验室推基准测试,加速联邦许可流程 02-26
OpenAI
PNNL
DraftNEPABench |
[自动]
[HACKER_NEWS] | 7min | newspaper
PA基准:评估Web智能体在真实个人助理工作流中的表现 02-26
Web智能体
PA基准
个人助理 |
[自动]
[ARXIV] | 4min | school
CxMP:评估语言模型构式理解的语言学最小对子基准 02-26
CxMP
构式语法
最小对子 |
[自动]
[HACKER_NEWS] | 4min | newspaper
PA Bench:评估前沿模型多标签页任务能力 02-25
PA Bench
多标签页
模型评估 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
OpenAI前沿评估负责人探讨SWE-Bench Verified后的下一步 02-25
OpenAI
SWE-Bench
智能体 |
[自动]
[BLOGS_PODCASTS] | 2min | mic
OpenAI前沿评估团队:迈向智能体评测的下一步 02-25
OpenAI
SWE-Bench
智能体评测 |
[自动]
[BLOGS_PODCASTS] | 4min | mic
OpenAI前沿评估负责人:SWE-Bench Verified后的智能体评测新方向 02-25
OpenAI
SWE-Bench
智能体 |
[自动]
[ARXIV] | 4min | school
面向大规模视频推理的综合基准测试套件 02-25
视频推理
VBVR
基准测试 |
[自动]
[ARXIV] | 4min | school
Skill-Inject:评估智能体技能文件攻击的脆弱性 02-25
LLM智能体
提示注入
Agent安全 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
OpenAI前沿评估团队:SWE-Bench Verified后的智能体评估新方向 02-25
OpenAI
SWE-Bench
智能体 |
[自动]
[ARXIV] | 4min | school
面向大规模视频推理的综合基准测试套件 02-24
视频推理
VBVR
基准测试 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
OpenAI 前沿评估团队探讨迈向智能体评估的下一阶段 02-24
OpenAI
SWE-Bench
智能体评估 |
[自动]
[HACKER_NEWS] | 5min | newspaper
Hugging Face Skills 功能上线与模型评估体系更新 02-24
Hugging Face
模型评估
LLM |
[自动]
[BLOGS_PODCASTS] | 4min | mic
SWE-bench Verified 数据泄露与缺陷分析:为何应转向 SWE-bench Pro 02-24
SWE-bench
数据泄露
数据污染 |
[自动]
[BLOGS_PODCASTS] | 2min | mic
OpenAI前沿评估团队:从SWE-Bench Verified看智能体评估演进 02-24
OpenAI
SWE-Bench
智能体 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
SWE-bench Verified 存在数据污染与评估偏差,建议改用 SWE-bench Pro 02-24
SWE-bench
数据污染
基准测试 |
[自动]
[BLOGS_PODCASTS] | 2min | mic
OpenAI前沿评估团队探讨SWE-Bench Verified后的下一步 02-24
OpenAI
SWE-Bench
Agent |
[自动]
[BLOGS_PODCASTS] | 3min | mic
SWE-bench Verified 数据污染与测度失准分析及替代方案 02-24
SWE-bench
数据污染
代码生成 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
OpenAI 前沿评估团队:SWE-Bench Verified 之后的下一步 02-24
OpenAI
SWE-Bench
智能体 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
Gemini 3.1 Pro发布:ARC-AGI 2得分达3.0两倍 02-24
Gemini
Google
ARC-AGI |
[自动]
[HACKER_NEWS] | 4min | newspaper
53款模型“洗车”测试 02-24
模型评估
基准测试
LLM |
[自动]
[HACKER_NEWS] | 3min | newspaper
53款模型“洗车”测试:评估多模态AI在物理场景中的表现 02-24
多模态
物理场景
模型评估 |
[自动]
[HACKER_NEWS] | 3min | newspaper
53款模型“洗车”测试:评估代码生成与修复能力 02-24
代码生成
模型评估
Bug修复 |
[自动]
[BLOGS_PODCASTS] | 2min | mic
SWE-bench Verified 存在数据污染与缺陷,建议迁移至 SWE-bench Pro 02-24
SWE-bench
数据污染
基准测试 |
[自动]
[BLOGS_PODCASTS] | 2min | mic
OpenAI前沿评测团队:SWE-Bench Verified后的智能体评测演进 02-24
OpenAI
SWE-Bench
Agent |
[自动]
[BLOGS_PODCASTS] | 4min | mic
Gemini 3.1 Pro发布:ARC-AGI 2得分达3.0两倍 02-24
Gemini
Google
ARC-AGI |
[自动]
[ARXIV] | 4min | school
基准测试图神经网络在解决难约束满足问题中的性能 02-24
GNN
图神经网络
约束满足问题 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
SWE-bench Verified 数据泄露与测试缺陷分析:为何应迁移至 SWE-bench Pro 02-24
SWE-bench
数据泄露
基准测试 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
OpenAI 推进智能体评估:SWE-Bench Verified 后续方向 02-24
OpenAI
SWE-Bench
智能体评估 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
Gemini 3.1 Pro发布:ARC-AGI 2得分达3.0两倍 02-24
Gemini
Google
ARC-AGI |
[自动]
[ARXIV] | 4min | school
基准测试图神经网络在求解难约束满足问题中的性能 02-23
GNN
图神经网络
约束满足问题 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
SWE-bench Verified 数据泄漏与测试缺陷分析:为何推荐改用 SWE-bench Pro 02-23
SWE-bench
数据泄漏
基准测试 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
OpenAI前沿评估团队:超越SWE-Bench Verified的智能体评估新阶段 02-23
OpenAI
SWE-Bench
智能体评估 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
Gemini 3.1 Pro 发布:ARC-AGI 2 得分达 3.0 两倍 02-23
Gemini 3.1 Pro
Google
ARC-AGI 2 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
OpenAI提出SWE-Bench-Dead:智能体前沿评估的下一步 02-23
OpenAI
SWE-Bench
Agent |
[自动]
[HACKER_NEWS] | 4min | newspaper
53 款模型参与“洗车”基准测试 02-23
基准测试
模型评估
LLM |
[自动]
[BLOGS_PODCASTS] | 2min | mic
SWE-bench Verified 数据污染严重,推荐使用 SWE-bench Pro 02-23
SWE-bench
数据污染
基准测试 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
Gemini 3.1 Pro发布:ARC-AGI 2评测分数达3.0两倍 02-23
Gemini
Google
ARC-AGI |
[自动]
[BLOGS_PODCASTS] | 3min | mic
Gemini 3.1 Pro发布:ARC-AGI 2得分达3.0两倍 02-23
Gemini
Google
ARC-AGI |
[自动]
[BLOGS_PODCASTS] | 2min | mic
Gemini 3.1 Pro发布:ARC-AGI 2得分达3.0两倍 02-23
Gemini
Google
ARC-AGI |
[自动]
[BLOGS_PODCASTS] | 4min | mic
Gemini 3.1 Pro 发布:ARC-AGI 2 得分达 3.0 两倍 02-23
Gemini 3.1 Pro
Google
ARC-AGI |
[自动]
[BLOGS_PODCASTS] | 3min | mic
Gemini 3.1 Pro发布:ARC-AGI 2评测分数达3.0两倍 02-22
Gemini
Google
ARC-AGI |
[自动]
[BLOGS_PODCASTS] | 3min | mic
Gemini 3.1 Pro发布:ARC-AGI 2得分达3.0两倍 02-21
Gemini
Google
ARC-AGI |
[自动]
[BLOGS_PODCASTS] | 4min | mic
谷歌Gemini 3.1 Pro发布:ARC-AGI 2测试性能达3.0两倍 02-21
Gemini 3.1 Pro
Google
ARC-AGI |
[自动]
[BLOGS_PODCASTS] | 2min | mic
Anthropic 发布自主智能体 METR 基准测试数据 02-20
Anthropic
智能体
Agent |
[自动]
[BLOGS_PODCASTS] | 3min | mic
Anthropic 公布 Agent 自主性研究及 METR 基准数据 02-20
Anthropic
Agent
自主性 |
[自动]
[BLOGS_PODCASTS] | 2min | mic
Anthropic 发布自主智能体 METR 基准测试数据 02-20
Anthropic
METR
自主智能体 |
[自动]
[BLOGS_PODCASTS] | 4min | mic
Gemini 3.1 Pro发布:ARC-AGI 2得分达3.0两倍 02-20
Gemini 3.1 Pro
Google
ARC-AGI |
[自动]
[ARXIV] | 3min | school
评估2025年中期LLM辅助对生物学初学者表现的影响 02-19
LLM
生物安全
AI评估 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
Anthropic 公布 METR 数据评估 Agent 自主能力 02-19
Anthropic
METR
Agent |
[自动]
[BLOGS_PODCASTS] | 3min | mic
IBM与UC Berkeley发布IT-Bench及MAST:诊断企业Agent失败原因 02-19
Agent
IT-Bench
MAST |
[自动]
[BLOGS_PODCASTS] | 2min | mic
IBM与加州大学伯克利分校发布IT-Bench与MAST诊断企业智能体失败原因 02-19
IBM
UC Berkeley
IT-Bench |
[自动]
[BLOGS_PODCASTS] | 2min | mic
IBM联合UC Berkeley发布IT-Bench与MAST:诊断企业智能体失败原因 02-19
IBM
UC Berkeley
IT-Bench |
[自动]
[BLOGS_PODCASTS] | 2min | mic
IBM与UC Berkeley发布IT-Bench及MAST诊断企业智能体失败原因 02-19
IBM
UC Berkeley
IT-Bench |
[自动]
[BLOGS_PODCASTS] | 2min | mic
IBM与UC Berkeley发布IT-Bench及MAST诊断企业智能体失败原因 02-18
IBM
UC Berkeley
IT-Bench |
[自动]
[BLOGS_PODCASTS] | 2min | mic
IBM与UC Berkeley利用IT-Bench和MAST诊断企业智能体失败原因 02-18
IBM
UC Berkeley
IT-Bench |
[自动]
[JUEJIN] | 2min | sticky_note_2
SkillsBench 论文解读:跨任务基准测试如何揭示 Agent 技能的实际效用 02-18
Agent
LLM
SkillsBench |
[自动]
[JUEJIN] | 3min | sticky_note_2
SkillsBench论文:评估Agent技能在多任务中的实际效用 02-17
Agent
LLM
SkillsBench |
[自动]
[HACKER_NEWS] | 7min | newspaper
评测 AGENTS.md:对编程 AI 智能体的实际效用分析 02-17
AI Agent
LLM
代码生成 |
[自动]
[HACKER_NEWS] | 7min | newspaper
SkillsBench:评估智能体技能在多样化任务中的表现基准 02-17
SkillsBench
智能体
Agent |
[自动]
[BLOGS_PODCASTS] | 2min | mic
Z.ai发布GLM-5开源模型,性能超越Opus 4.5 02-17
GLM-5
Z.ai
Opus 4.5 |
[自动]
[BLOGS_PODCASTS] | 2min | mic
Z.ai GLM-5:开放权重新一代SOTA大模型 02-14
GLM-5
Z.ai
SOTA |
[自动]
[BLOGS_PODCASTS] | 2min | mic
OpenEnv 实践:评估真实环境中的工具调用智能体 02-13
智能体
工具调用
OpenEnv |
[自动]
[HACKER_NEWS] | 6min | newspaper
仅更换测试框架,一下午提升15个大模型代码能力 02-13
LLM
代码生成
基准测试 |
[自动]
[BLOGS_PODCASTS] | 2min | mic
OpenEnv 实战:评估真实环境中的工具调用智能体 02-13
智能体
工具调用
OpenEnv |
[自动]
[BLOGS_PODCASTS] | 2min | mic
OpenEnv实践:评估真实环境中的工具调用智能体 02-13
智能体
工具调用
OpenEnv |
[自动]
[HACKER_NEWS] | 7min | newspaper
仅调整框架,一下午提升15个大模型编程能力 02-13
LLM
代码生成
模型评估 |
[自动]
[ARXIV] | 3min | school
GENIUS:生成式流体智能评估套件 02-13
GENIUS
流体智力
多模态评估 |
[自动]
[BLOGS_PODCASTS] | 2min | mic
OpenEnv实践:评估真实环境中的工具调用智能体 02-12
智能体
工具调用
OpenEnv |
[自动]
[BLOGS_PODCASTS] | 2min | mic
Z.ai发布GLM-5开源模型:性能超越Opus 4.5 02-12
GLM-5
Z.ai
Opus 4.5 |
[自动]
[HACKER_NEWS] | 5min | newspaper
MiniMax M2.5 发布:SWE-bench Verified 得分 80.2% 02-12
MiniMax
M2.5
SWE-bench |
[自动]
[HACKER_NEWS] | 5min | newspaper
MiniMax M2.5 发布:SWE-bench Verified 得分 80.2% 02-12
MiniMax
M2.5
SWE-bench |
[自动]
[BLOGS_PODCASTS] | 2min | mic
Z.ai GLM-5开源:性能超越Opus 4.5 02-12
GLM-5
Z.ai
SOTA |
[自动]
[ARXIV] | 2min | school
GEBench:将图像生成模型评估为GUI环境的基准 02-11
GEBench
GUI生成
图像生成 |
[自动]
[ARXIV] | 3min | school
GEBench: Benchmarking Image Generation Models as GUI En 02-10
GEBench
GUI生成
图像生成 |
[自动]
[HACKER_NEWS] | 4min | newspaper
BioTradingArena:预测生物科技股走势的LLM基准 02-06
LLM
基准测试
金融预测 |
[自动]
[HACKER_NEWS] | 4min | newspaper
BioTradingArena:用于评估LLM预测生物科技股票走势的基准 02-06
LLM
基准测试
金融预测 |
[自动]
[ARXIV] | 4min | school
CRoSS:面向可扩展强化学习的持续机器人仿真套件 02-06
强化学习
机器人仿真
Gazebo |
[自动]
[HACKER_NEWS] | 4min | newspaper
面向真实场景的AI代码审查基准测试 02-05
代码审查
基准测试
AI |
[自动]
[HACKER_NEWS] | 5min | newspaper
AI 代码审查的真实世界基准测试 02-05
代码审查
基准测试
AI 编程 |
[自动]
[HACKER_NEWS] | 5min | newspaper
AI代码审查的真实世界基准测试 02-04
代码审查
基准测试
AI 编程 |
[自动]
[ARXIV] | 4min | school
UEval:统一多模态生成基准 02-02
UEval
多模态
基准测试 |
2026年一月
38 篇
| 类型 | 阅读 | 条目 |
|---|---|---|
[自动]
[ARXIV] | 4min | school
UEval:统一多模态生成基准 01-31
多模态
UEval
基准测试 |
[自动]
[ARXIV] | 3min | school
UEval:统一多模态生成基准 01-30
UEval
多模态
统一模型 |
[自动]
[HACKER_NEWS] | 7min | newspaper
Claude Code 每日基准测试用于性能退化追踪 01-30
Claude Code
基准测试
性能退化 |
[自动]
[HACKER_NEWS] | 7min | newspaper
Claude Code 每日基准测试用于性能退化追踪 01-30
Claude Code
基准测试
性能退化 |
[自动]
[HACKER_NEWS] | 4min | newspaper
AGENTS.md 架构在智能体评估中超越 Skills 技能 01-30
智能体
评估
AGENTS.md |
[自动]
[HACKER_NEWS] | 7min | newspaper
Claude Code 每日基准测试用于性能退化追踪 01-30
Claude
基准测试
性能追踪 |
[自动]
[HACKER_NEWS] | 4min | newspaper
Claude Code 每日基准测试用于性能退化追踪 01-30
Claude Code
基准测试
性能退化 |
[自动]
[HACKER_NEWS] | 7min | newspaper
Claude Code 每日基准测试用于性能退化追踪 01-30
Claude Code
基准测试
性能退化 |
[自动]
[HACKER_NEWS] | 4min | newspaper
Claude Code 每日基准测试:追踪性能退化 01-30
Claude Code
基准测试
性能退化 |
[自动]
[HACKER_NEWS] | 7min | newspaper
Claude Code 基准测试:追踪每日性能退化 01-30
Claude Code
基准测试
性能退化 |
[自动]
[HACKER_NEWS] | 4min | newspaper
AGENTS.md 架构在智能体评估中超越 Skills 技能 01-30
智能体
Agent
评估 |
[自动]
[ARXIV] | 4min | school
机器翻译评估中的跨向污染问题研究 01-30
机器翻译
数据污染
FLORES-200 |
[自动]
[ARXIV] | 4min | school
SokoBench:评估大模型长周期规划与推理能力 01-30
SokoBench
长周期规划
推理能力 |
[自动]
[HACKER_NEWS] | 7min | newspaper
Claude Code 每日基准测试用于性能退化追踪 01-30
Claude Code
基准测试
性能退化 |
[自动]
[ARXIV] | 3min | school
机器翻译评估中的跨向污染问题研究 01-29
机器翻译
数据污染
FLORES |
[自动]
[ARXIV] | 4min | school
SokoBench:评估大模型长程规划与推理能力 01-29
SokoBench
长程规划
推理能力 |
[自动]
[HACKER_NEWS] | 7min | newspaper
Claude Code 每日基准测试:用于性能退化追踪 01-29
Claude Code
基准测试
性能退化 |
[自动]
[HACKER_NEWS] | 7min | newspaper
OTelBench基准测试:Opus 4.5在简单SRE任务中得分仅29% 01-29
LLM
SRE
基准测试 |
[自动]
[HACKER_NEWS] | 4min | newspaper
Claude Code 每日基准测试:追踪模型性能退化 01-29
Claude
LLM
基准测试 |
[自动]
[HACKER_NEWS] | 6min | newspaper
Claude Code 每日基准测试用于性能退化追踪 01-29
Claude Code
基准测试
性能退化 |
[自动]
[HACKER_NEWS] | 4min | newspaper
Claude Code 每日基准测试用于性能退化追踪 01-29
Claude
基准测试
性能退化 |
[自动]
[HACKER_NEWS] | 6min | newspaper
Opus 4.5 在 OTelBench 基准测试中得分仅 29% 01-29
Opus 4.5
OTelBench
SRE |
[自动]
[BLOGS_PODCASTS] | 2min | mic
Alyah:评估阿拉伯语大模型阿联酋方言能力 01-29
LLM
模型评估
阿拉伯语 |
[自动]
[ARXIV] | 4min | school
💥MortalMATH:当推理目标遇上紧急场景,AI会“翻车”吗? 01-28
LLM
推理模型
MortalMATH |
[自动]
[BLOGS_PODCASTS] | 3min | mic
🇦🇪 Alyah ⭐️:揭秘阿拉伯LLM方言鲁棒评估! 01-28
LLM
阿拉伯语
方言评估 |
[自动]
[ARXIV] | 4min | school
MortalMATH:当推理目标遇上紧急语境,冲突何解?🧠🔥 01-27
LLM
模型评估
安全对齐 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
⭐️Alyah:阿联酋方言能力评估!阿拉伯语LLM新突破! 01-27
LLM
阿拉伯语
方言评估 |
[自动]
[HACKER_NEWS] | 3min | newspaper
🧠炸裂!Gemini Flash在俄罗斯大战Opus胜率66%!🚀 01-27
Gemini Flash
Claude Opus
TetrisBench |
[自动]
[BLOGS_PODCASTS] | 3min | mic
🚀AssetOpsBench:打破AI基准与工业现实的壁垒!🤝 01-27
AI Agent
AssetOpsBench
基准测试 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
AssetOpsBench:填补AI基准与工业现实的鸿沟!🤖🏭🚀 01-27
AssetOpsBench
AI Agent
LLM |
[自动]
[ARXIV] | 5min | school
🚀BONO-Bench:可追溯Pareto集的双目标优化基准测试! 01-27
多目标优化
基准测试
Pareto集 |
[自动]
[HACKER_NEWS] | 4min | newspaper
⚡️俄罗斯方块爆杀Opus!Gemini Flash胜率66%震撼实测🎮 01-27
LLM
Gemini Flash
Claude Opus |
[自动]
[BLOGS_PODCASTS] | 3min | mic
🔥AssetOpsBench填平鸿沟!AI Agent基准测评如何真实落地工业场景? 01-27
AI Agent
LLM
基准测试 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
AssetOpsBench:打破AI Agent评测与工业现实的壁垒!🚀 01-26
AI Agent
AssetOpsBench
工业智能 |
[自动]
[ARXIV] | 5min | school
AgentDrive:首个开放基准!🚗 LLM生成场景驱动Agent智能推理 01-26
AgentDrive
自动驾驶
基准测试 |
[自动]
[ARXIV] | 6min | school
🔥BONO-Bench震撼发布!首套可追溯Pareto集的双目标优化基准测试! 01-26
BONO-Bench
双目标优化
Pareto集 |
[自动]
[BLOGS_PODCASTS] | 3min | mic
AssetOpsBench:AI Agent基准测试与工业现实鸿沟如何跨越?🤖🔥 01-26
AI Agent
基准测试
工业运维 |
[自动]
[BLOGS_PODCASTS] | 4min | mic
AssetOpsBench:连接AI测评与工业现实!填补鸿沟🚀 01-25
AssetOpsBench
AI Agent
工业运维 |
无匹配条目