Andrej Karpathy引争议:AI 没有魔法,只是模仿,离“真”强化学习还远着呢

全文9067字,阅读约需26分钟,帮我划重点

划重点

01AI专家AndreJ Karpathy指出,AI的本质并非全知全能的“魔法”,而是高度复杂的“数据标注员合集”。

02AI模型通过模仿训练数据中的范例来给出回答,而非创造答案,具有随机性、非推理性和统计驱动的特点。

03尽管AI在许多领域表现优异,但其局限性在于无法处理模糊或对立观点,以及受数据偏差影响。

04目前,AI主要通过模仿人类行为来提升能力,而非独立创新,但未来的发展方向有望实现情境感知增强和智能进化。

以上内容由腾讯混元大模型生成,仅供参考

图片



当你向人工智能提问时,你是否好奇过,它的回答来自何处?是某种超越人类的智慧,还是复杂数据的机械化堆叠?AndreJ Karpathy(前OpenAI联合创始人)最新的一段讨论为我们拨开迷雾:AI的本质并非全知全能的“魔法”,而是高度复杂的“数据标注员合集”。这篇文章将深入探讨这一观点,解构AI是如何工作的,帮助我们更真实地理解这种技术背后的逻辑。

一、什么是AI的本质?从“神秘魔法”到“统计数据模型”

Karpathy的核心观点是,AI模型并非某种神秘的“智慧存在”,而是高度复杂的数据标注员的化身。这可以理解为:当你向AI提问时,实际上是在向“平均水平的标注员”提问

图片

数据标注员的角色

AI的训练依赖于人类标注员提供的示例数据,这些人来自不同领域,有的可能是程序员,有的可能是医生。这些标注员的职责是生成或验证答案,形成一个“理想回答”的范例库。例如: 如果你问AI“阿姆斯特丹的十大景点”,可能有标注员专门搜索资料并生成答案,或者对已有答案进行评估。AI则通过模仿这些范例数据来给出类似的回答

AI并非真正的“创造者”

Karpathy指出,AI的回答更多是统计意义上的“搬运工”行为。它并非创造了答案,而是从庞大的数据中寻找模式,用以生成符合问题预期的回应。这种机制在创意写作、代码生成、甚至医学领域都表现得尤为明显

二、AI的局限性:随机性、非推理性和统计驱动

尽管AI模型在许多领域表现优异,但它的局限性也显而易见:

随机性:统计驱动的“掷骰子”,不会平衡对立的观点

当问题涉及模糊或对立观点时,AI往往不能给出中立的综合性回答。例如:

如果100位标注员认为某物品是蓝色,另有100位认为是黄色,AI的回答将随机为蓝或黄,而不会说“这存在争议”。这种行为源于AI模型对训练数据的匹配,并非真正的逻辑推理

非推理性:不擅长处理复杂问题

当AI面对超出数据覆盖范围的任务时,它依然无法真正理解问题本质。例如在数学领域,尽管模型通过模仿顶级数学家(如Terence Tao)的部分训练数据可以生成复杂解答,但这并不意味着它“理解”了数学原理,而是依靠统计模式进行回答

数据来源的偏差

Karpathy还提到,AI的训练过程可能受数据偏差影响。例如,在网络上关于“月球登陆”的讨论中,许多数据可能来自阴谋论者。只有通过后续微调,AI才会学习到更符合现实的回答模式

三、AI是如何训练的?从“大杂烩”到“助理化”

预训练阶段:互联网知识的大规模整合

AI模型的第一阶段是从互联网上抓取大量文档,建立起一个庞大的知识储备。这使得它“知道一切”,无论是科学理论还是阴谋论。这一阶段赋予模型广泛的基础知识

微调阶段:助理风格的定制化训练

微调阶段是AI变得“有用”的关键:在这一阶段,模型被训练模仿“助理式对话风格”,由标注员生成的对话数据为模型提供示例。AI的回答开始展现出“帮助、诚实、无害”的特性。这种风格的形成,使得它看起来更像一个“人类助理”,但本质上仍是统计模式的产物

四、RLHF的作用与局限:从生成到判别

什么是RLHF?

RLHF(基于人类反馈的强化学习)是当前提升AI性能的主要方法之一。Karpathy将其比喻为让AI从“生成型人类”水平进化到“判别型人类”水平

  • • 生成型任务:如创作一首诗,需要AI模拟人类的创造力

  • • 判别型任务:如选择最佳答案,则更依赖AI的统计判别能力

RLHF并非真正的“超人类”

尽管RLHF在某些领域(如医学问答)表现出“超人类”水平,但Karpathy认为,这实际上是多位顶尖专家智慧的综合,而非AI本身的突破性创新

Karpathy直言,RLHF目前仍处于“刚刚达到强化学习门槛”的阶段。真正的强化学习(RL)需要明确的奖励函数,而这一机制尚未广泛应用于复杂场景。因此,当前AI更多是通过模仿人类行为来提升能力,而非独立创新

RLHF的潜力

未来,RLHF可能通过更精细的链式推理和上下文管理,让AI在复杂决策中表现得更为出色。例如:

  • • 医学诊断:AI将更高效地处理有限信息,帮助医生进行决策

  • • 任务分解:通过分步推理,AI能够在解决复杂问题时自行纠错

五、从“统计模仿”到“真正智能”

尽管当前的AI模型仍主要依赖“统计模仿”,但OpenAI领导思维链推理的Jason Wei认为未来的发展方向令人期待:

情境感知增强: 通过改进检索和长上下文处理能力,AI有望超越当前的知识整合能力,真正理解问题背景

智能进化: Jason Wei预测,未来的AI将具备“决策树式”推理能力,这将是迈向“真正智能”的重要一步

英文整理:

Andrej Karpathy:

People have too inflated sense of what it means to "ask an AI" about something. The AI are language models trained basically by imitation on data from human labelers. Instead of the mysticism of "asking an AI", think of it more as "asking the average data labeler" on the internet.

Few caveats apply because e.g. in many domains (e.g. code, math, creative writing) the companies hire skilled data labelers (so think of it as asking them instead), and this is not 100% true when reinforcement learning is involved, though I have an earlier rant on how RLHF is just barely RL, and "actual RL" is still too early and/or constrained to domains that offer easy reward functions (math etc.).

But roughly speaking (and today), you're not asking some magical AI. You're asking a human data labeler. Whose average essence was lossily distilled into statistical token tumblers that are LLMs. This can still be super useful ofc ourse. Post triggered by someone suggesting we ask an AI how to run the government etc. TLDR you're not asking an AI, you're asking some mashup spirit of its average data labeler.

Example when you ask eg “top 10 sights in Amsterdam” or something, some hired data labeler probably saw a similar question at some point, researched it for 20 minutes using Google and Trip Advisor or something, came up with some list of 10, which literally then becomes the correct answer, training the AI to give that answer for that question. If the exact place in question is not in the finetuning training set, the neural net imputes a list of statistically similar vibes based on its knowledge gained from the pretraining stage (language modeling of internet documents).

Marshal the Martian:

This doesn't feel right. The data labelers aren't hand-writing each curated list. They're grading whether the answer satisfies their RLHF rules. 

The llm weights should map the high dimensional surface of internet data about "good vacation spots" with positive sentiment.

Andrej Karpathy:

Clearly there's too many locations. The data labelers hand-write SOME of these curated lists, identifying (by example and statistics) the kind of correct answer. When asked that kind of question about something else & new, the LLM matches the form of the answer but pulls out and substitutes new locations from a similar region of the embedding space (e.g. good vacation spots with positive sentiment), now conditioned on the new location. (Imo that this happens is a non-intuitive and empirical finding and the magic of finetuning). But it is still the case that the human labeler programs the answer, it's just done via the statistics of the kinds of spots they picked out in their lists in the finetuning dataset. And imo it's still the case that what the LLM ends up giving you instantly right there and then is roughly what you'd get 1 hour later if you submitted your question directly to their labeling team instead.

Leo:

It doesn’t interpolate, does it? 

If I ask “What color is a Gropy?”, and we had 100 labellers say it’s blue and 100 labellers say it’s yellow, it’s going to randomly say blue or yellow - but never “It’s a debated question, some say blue, some say yellow”. Right?


Andrej Karpathy:

Excellent question and yes exactly, it responds with blue or yellow with 50% probability. Saying “It’s a debated question, some say blue, some say yellow” is just a sequence of tokens that would be super unlikely, it doesn't match the statistics of the training data at all.

Ian:

It says “it’s a debated question” on almost everything that’s a debated question. Try it.

Andrej Karpathy:

The human labelers are instructed in their training documentation to say stuff like that to keep things neutral.

roon:

RLHF can create superhuman outcomes

Andrej Karpathy:

Hmm. RLHF is still RL from _Human_ feedback, so I wouldn't say that exactly? RLHF moves the performance to "discriminative human" grade, up from SFT which is at "generative human" grade. But this is not so much "in principle" but more "in practice", because discrimination is easier for an average person than generation (e.g. label which of these 5 poems about X is best vs. write a poem about X). Separately you also get a separate boost from the wisdom of crowds effect, i.e. your LLM performance is not at human level, but at ensemble of human level. So with RLHF in principle the best you can hope for is to reach a performance where a panel of e.g. the top 10 human experts on some topic, with enough time given, will pick your answer over any other. So in some sense this counts as superhuman. To go proper superhuman in the way people think about it by default I think, you want to go to RL instead of RLHF, in the style of my earlier post on RLHF is just barely RL

Ig Nim:

I feel like the instant access to "skilled data labelers" in many domain is such a profound and useful function that we lacked prior to the LLM. We shouldn't take this new found accessibility feature for granted.

Andrej Karpathy:

💯 great way to put it

Alan Nicolas:

While you're technically right about training data, this view seems reductionist. The emergent patterns and insights I'm seeing in AI conversations go beyond simple averaging of labeler responses. It's like reducing human consciousness to 'just neurons firing'. Sometimes the whole becomes more than the sum of its parts.

Andrej Karpathy:

Agree that there can be a kind of compressed, emergent awareness that no individual person can practically achieve. We see hints of it but not clearly enough yet probably. See my short story on the topic 

Liam McCoy, MD MSc:

How do you square this with the recurrently superhuman performance in medical question answering domains? 

Are you implying they hire the best physicians to label? Or is it just that the breadth of factual knowledge retrieval makes up for the reasoning gaps

Andrej Karpathy:

Yes they hire professional physicians to label. You don't need to label every single possible query. You label enough that the LLM learns to answer medical questions in the style of a trained physician. For new queries, the LLM can then to some extent lean on and transfer from its general understanding of medicine from reading all the internet documents and papers and such.

Famously, for example, Terence Tao (a top tier mathematician) contributed some training data to LLMs. This doesn't mean that the LLMs can now answer at his level for all questions in math. The underlying knowledge and reasoning capability might just not be there in the underlying model. But it does mean that you're getting something much better than a redditor or something.

So basically "the average labeler" are allowed to be professionals - programmers, or doctors, or etc., in various categories of expertise. It's not necessarily a random person on the internet. It depends on how the LLM companies ran their hiring for these data labeler roles. Increasingly, they try to hire more higher-skilled workers.  You're then asking questions to a kind of simulation of those people, to the best of LLMs ability.

AI Furry Art (SFW-ish):

I disagree with it being the average. By volume, the average discussion around the moon landing is probably moon landing denial, because most of the people still discussing it on a regular basis are deniers, but most LLMs will not deny it. They learn some sense of correctness.

Andrej Karpathy:

First there is the pretraining stage where the AI is trained on everything, included moon landing denying.
In the second finetuning stage is where the dataset suddenly changes from internet documents to conversations between a "human" and an "Assistant", where the Assistant text comes from human labeler data, collected by paid workers. It's in this second stage that the token statistics are "matched up" to those in this finetuning dataset, which now looks like a helpful, honest, harmless Assistant.
The non-intuitive and slightly magical, empirical and not very well understood part is that the LLM (which is a couple hundred billion parameter neural net) retains the knowledge from the pretraining stage (Stage 1), but starts to match the style of the finetuning data (Stage 2). It starts to imitate an Assistant.
Because the Assistant data all has the same "vibe" (helpful, honest, harmless), the LLM ends up taking on that role. It still has all of the knowledge somewhere in there (of moon landing denying), but it's also adapted to the kind of person who would reject that as a hoax

Openai Jason Wei:

Andrej’s tweet is the right way to think about it right now but I totally believe that in one or two years we will start relying on AI for very challenging decisions like diagnosing disease under limited information. Key thing to note here is that big decisions can be viewed as a tree of individual reasoning steps and RL on chain of thought seems like a feasible way for AI to do any single step pretty well and probably recover if there is a mistake

In addition, with better scaffolding like improvements in retrieval, browsing, and long context management, AI will be able to leverage its inherent advantages over humans like not getting tired or distracted, having nearly infinite memory, and not being clouded by emotions. So i think we will reach the “magical AI feeling” soon :)



⭐星标AI寒武纪,好内容不错过

用你的在看告诉我~