AI has all the answers. Even the wrong ones | 不论答案对错,AI知道一切? - FT中文网
登录×
电子邮件/用户名
密码
记住我
请输入邮箱和密码进行绑定操作:
请输入手机号码,通过短信验证(目前仅支持中国大陆地区的手机号):
请您阅读我们的用户注册协议隐私权保护政策,点击下方按钮即视为您接受。
FT英语电台

AI has all the answers. Even the wrong ones
不论答案对错,AI知道一切?

ChatGPT has the appearance of a brilliant logician and that’s a problem
大型语言模型解决逻辑谜题的准确性与可信度探究。
00:00

Can large language models solve logic puzzles? There’s one way to find out, which is to ask. That’s what Fernando Perez-Cruz and Hyun Song Shin recently did. (Perez-Cruz is an engineer; Shin is the head of research at the Bank for International Settlements as well as the man who, in the early 1990s, taught me some of the more mathematical pieces of economic theory.)

The puzzle in question is commonly known as the “Cheryl’s birthday puzzle”. Cheryl challenges her friends Albert and Bernard to guess her birthday, and for puzzle-reasons they know it’s one of 10 dates: May 15, 16 or 19; June 17 or 18; July 14 or 16; or August 14, 15 or 17. To speed up the guessing, Cheryl tells Albert her birth month, and tells Bernard the day of the month, but not the month itself.

Albert and Bernard think for a while. Then Albert announces, “I don’t know your birthday, and I know that Bernard doesn’t either.” Bernard replies, “In that case, I now know your birthday.” Albert responds, “Now I know your birthday too.” What is Cheryl’s birthday?* More to the point, what do we learn by asking GPT-4?

The puzzle is a challenging one. Solving it requires eliminating possibilities step by step while pondering questions such as “what is it that Albert must know, given what he knows that Bernard does not know?” It is, therefore, hugely impressive that when Perez-Cruz and Shin repeatedly asked GPT-4 to solve the puzzle, the large language model got the answer right every time, fluently elaborating varied and accurate explanations of the logic of the problem. Yet this bravura performance of logical mastery was nothing more than a clever illusion. The illusion fell apart when Perez-Cruz and Shin asked the computer a trivially modified version of the puzzle, changing the names of the characters and of the months.

GPT-4 continued to produce fluent, plausible explanations of the logic, so fluent, in fact, it takes real concentration to spot the moments when those explanations dissolve into nonsense. Both the original problem and its answer are available online, so presumably the computer had learnt to rephrase this text in a sophisticated way, giving the appearance of a brilliant logician.

When I tried the same thing, preserving the formal structure of the puzzle but changing the names to Juliet, Bill and Ted, and the months to January, February, March and April, I got the same disastrous result. GPT-4 and the new GPT-4o both authoritatively worked through the structure of the argument but reached false conclusions at several steps, including the final one. (I also realised that in my first attempt I introduced a fatal typo into the puzzle, making it unsolvable. GPT-4 didn’t bat an eyelid and “solved” it anyway.)

undefined

Curious, I tried another famous puzzle. A game show contestant is trying to find a prize behind one of three doors. The quizmaster, Monty Hall, allows a provisional pick, opens another door to reveal no grand prize, and then offers the contestant the chance to switch doors. Should they switch?

The Monty Hall problem is actually much simpler than Cheryl’s Birthday, but bewilderingly counterintuitive. I made things harder for GPT4o by adding some complications. I introduced a fourth door and asked not whether the contestant should switch (they should), but whether it was worth paying $3,500 to switch if two doors were open and the grand prize were $10,000.**

GPT-4’s response was remarkable. It avoided the cognitive trap in this puzzle, clearly articulating the logic of every step. Then it fumbled at the finishing line, adding a nonsensical assumption and deriving the wrong answer as a result.

What should we make of all this? In some ways, Perez-Cruz and Shin have merely found a twist on the familiar problem that large language models sometimes insert believable fiction into their answers. Instead of plausible errors of fact, here the computer served up plausible errors of logic.

Defenders of large language models might respond that with a cleverly designed prompt, the computer may do better (which is true, although the word “may” is doing a lot of work). It is also almost certain that future models will do better. But as Perez-Cruz and Shin argue, that may be besides the point. A computer that is capable of seeming so right yet being so wrong is a risky tool to use. It’s as though we were relying on a spreadsheet for our analysis (hazardous enough already) and the spreadsheet would occasionally and sporadically forget how multiplication worked.

Not for the first time, we learn that large language models can be phenomenal bullshit engines. The difficulty here is that the bullshit is so terribly plausible. We have seen falsehoods before, and errors, and goodness knows we have seen fluent bluffers. But this? This is something new.

*If Bernard was told 18th (or 19th) he would know the birthday was June 18 (or that it was May 19). So when Albert says that he knows that Bernard doesn’t know the answer, that rules out these possibilities: Albert must have been told July or August instead of May or June. Bernard’s response that he now knows the answer for certain reveals that it can’t be the 14th (which would have left him guessing between July or August). The remaining dates are August 15 or 17, or July 16. Albert knows which month, and the statement that he now knows the answer reveals the month must be July and that Cheryl’s birthday is July 16.

**The chance of initially picking the correct door is 25 per cent, and that is not changed when Monty Hall opens two empty doors. Therefore the chance of winning $10,000 is 75 per cent if you switch to the remaining door, and 25 per cent if you stick with your initial choice. For a sufficiently steely risk-taker, it is worth paying up to $5,000 to switch.

Follow @FTMag to find out about our latest stories first and subscribe to our podcast Life and Art wherever you listen

版权声明:本文版权归FT中文网所有,未经允许任何单位或个人不得转载,复制或以任何其他方式使用本文全部或部分,侵权必究。

Lex专栏:机器人的崛起将极大推动英伟达发展

对于创始人黄仁勋来说,物理人工智能是人工智能的下一个前沿领域。

特朗普将难以推动油价下降

特朗普不可能同时实现低能源价格和创纪录的国内油气产量。美国能源产量将增长,但增产部分更多将来自天然气。

Meta对顶级广告客户免除标准内容审核流程

社交媒体巨头的“护栏”旨在保护高支出广告客户,因为担心其自动化审核系统错误地惩罚顶级品牌。

FT社评:马斯克对欧洲民主的威胁必须得到遏制

科技监管不能像扎克伯格本周指控的那样扼杀创新,但对欧洲内容审核的指责只是特朗普、马斯克和扎克伯格政治和个人目的的烟幕弹。

反对派领袖:叙利亚盟友倒台后,委内瑞拉军方可能抛弃马杜罗

委内瑞拉反对派领袖玛莉亚•科里纳•马查多认为,军方首领担心会遭遇与阿萨德军方同样的命运。

欧洲科技企业家:尽管美国占据主导地位,但欧洲仍可在AI领域获胜

欧洲最成功的科技企业家之一赞斯特罗姆表示,不是每家公司都必须研发出大型语言模型,欧洲企业可以基于美国的AI平台开发应用。
设置字号×
最小
较小
默认
较大
最大
分享×