数据驱动的“愚公移AGI”思考|5Y View

   作者



石允丰 五源副总裁





我的假设受到业界朋友和 Andrej Karparthy 的启发。他最近在 Twitter 上提到:“人们对‘向 AI 提问’的理解往往过于神秘化。其实,AI 是基于模仿人类标注数据训练出来的语言模型。因此,与其把它看作一种神秘的‘智慧交流’,不如将其视为在‘向互联网上的普通数据标注者提问’。”


AI,或者说大模型,无论可解释性或者理论研究如何试图揭示所谓的智能涌现,本质上并不存在什么魔法。无论是 scaling law,RL,还是任何炫目的技术名词,这里更多是对motivation的装饰。卸妆后,剩下的其实是两条朴素的第一性原理:


1. 获得更好的更全面的data

2. 更高效地建模1获得的data


推动生产力变革所需的下一代计算范式,未必是具备“人类级别智能”的 AI,而是能够“替代人类完成工作的智能”。如果我们定义 AGI 的目标为能够解决 90% 的白领工作,我们当然可以投入大量时间去研究和复现 OpenAI 或 Anthropic 的技术路径,试图揣测Ilya和John Schulman们的天才想法,沿着2不断挑战极限。但是实现生产力变革更简单的方法不一定是2,最简单直接的方法是坚定不移地走1:直接从各种来源为pretrain和post-training阶段收集最广泛的,最符合且能概括人类各行各业的工作习惯的长尾数据作为ground truth,以3-5年为周期,穷举并且全部merge到语言模型内,之后祈祷代替所有人类完成基本逻辑工作的智能就自然涌现了。


高质量垂类数据驱动AGI这条路径的成本对于头部的LLM团队是完全可接受的:


  • 目标:3年时间为期,覆盖 400 个垂域

  • 每个垂域对应一个水平不错的计算机本科生3-5人小团队1个季度的人效

    • 跑通所在领域的 data engineering 管线,调prompt,对齐需求,etc

    • 这样的数据投入,基本足以产出垂直领域的SOTA模型

  • 形成固定工作流在后期可以继续加速

  • 大约30个这样的本科生小组(某种意义上给AI请的家教)足以在3年左右覆盖400个垂域


大力出奇迹地堆数据可以不完全依赖Ilya这样独特的天才,在2024年的今天我们事实上也拥有世界上最奢侈的,最多样化的闲置智力资源,依赖这些资源,艰难而笔直的路线可能是通向AGI的“蠢人捷径”:10X 跨领域专家标注高质量数据量,精细踏实的数据工程,对比 frontier model 5%-10%的算力投入,获得高实用价值的智能系统。


Data is all you need. 如果有团队在这条可能的「AGI蠢人捷径」上探索,探索垂直领域的数据生成/标注,是代码或是operator use等其他领域的数据,是算法还是管理模式的创新,欢迎联系我:stevenshi@5ycap.com





以下为Andrej Karpathy的原始内容,来自于AI翻译,供参考


作者:Andrej Karpathy


Founder of Eureka Labs. Previously, Director of AI at Tesla, founding team member at OpenAI, CS231n/PhD at Stanford.


人们对“向AI提问”这件事的意义往往有过于夸大的理解。AI其实是通过模仿人类标注者的数据训练出来的语言模型。因此,与其将其神秘化为“向AI提问”,不如将其看作是在“向互联网上的普通数据标注者提问”。


当然也有一些需要注意的地方,比如在某些领域(如编程、数学、创意写作),公司会雇佣有技能的标注者(所以可以把它看作是在向这些人提问)。当涉及强化学习时情况也不完全如此,不过我之前曾有过一段吐槽,讲的是强化学习通过人类反馈(RLHF)其实勉强算是强化学习,而“真正的强化学习”仍处于非常早期阶段,或者局限于那些能够提供简单奖励函数的领域(比如数学等)。


但大致来说,在今天,你并不是在向某种神奇的AI提问,而是在向一个人类数据标注者提问。这些标注者的平均“本质”被以有损的方式提取,浓缩成了统计性地拼接词语的LLM(大语言模型)。当然,这仍然可能非常有用。这段话的起因是有人建议我们应该“问AI如何治理国家”等等,总结来说,你不是在问AI,而是在问它所代表的那些平均化的数据标注者的集合体。


举个例子,当你问类似“阿姆斯特丹排名前10的景点”这样的问题时,很可能某个被雇佣的标注者曾经遇到过类似的问题,并用Google或TripAdvisor等平台研究了20分钟,整理出一个包含10个景点的列表。而这个列表就会被视为正确答案,用来训练AI回答这一类问题。如果提问中具体的景点不在微调训练集里,神经网络就会根据预训练阶段(对互联网上文档的语言建模)所获得的知识,推测出一个统计上与问题最相关的“类似风格”的列表。



Q&A


Q:RLHF(通过人类反馈的强化学习)可以创造超人级别的成果吗?


AK:RLHF仍然是基于人类反馈的强化学习,所以我不完全这么认为。RLHF将模型的表现从“生成型人类”水平(SFT,监督微调)提升到了“判别型人类”水平。但这更多是实践中的结果,而不是理论上的突破,因为对普通人来说,判断(比如标注哪一首诗更好)比创作(比如写一首诗)更容易。此外,你还会因为“群体智慧效应”获得额外的提升,也就是说,LLM(大语言模型)的表现并不是单个人类的水平,而是人类集体智慧的水平。


因此,理论上,通过RLHF,你最好的期望是达到这样的表现:比如让10位某领域的顶尖专家组成的评审团,在有足够时间的情况下,更倾向于选择你的答案而非其他答案。从某种意义上来说,这可以算作“超人级别”。


不过,如果想实现人们通常理解的那种真正的“超人级别”,我认为需要转向真正的强化学习(RL),而不是RLHF。关于这一点,我之前写过一篇帖子,解释为什么RLHF最多勉强算是强化学习:

https://x.com/karpathy/status/1821277264996352246


Q:它不会进行interpolate,对吧?如果我问“Gropy是什么颜色?”,有100个标注者说是蓝色,另有100个标注者说是黄色,模型会随机回答蓝色或黄色——但它永远不会回答“这是一个有争议的问题,有人说是蓝色,有人说是黄色”,对吗?


AK:非常好的问题,确实如此。它会以50%的概率回答蓝色或黄色。回答“这是一个有争议的问题,有人说是蓝色,有人说是黄色” 是非常不可能的,因为它完全不符合训练数据的统计特性。


Q:但它几乎会对每个有争议的问题都说“这是一个有争议的问题”。你试试看。


AK:人类标注者在他们的培训文档中被明确要求这样回答,以保持中立。


Q:我觉得在许多领域中,能够即时访问“有技能的数据标注者”是一个如此深刻且有用的功能,而在LLM出现之前我们完全缺乏这种能力。我们不应该对这种新获得的能力视而不见。


AK:完全同意,你的表达非常准确!



原文


People have too inflated sense of what it means to "ask an AI" about something. The AI are language models trained basically by imitation on data from human labelers. Instead of the mysticism of "asking an AI", think of it more as "asking the average data labeler" on the internet.


Few caveats apply because e.g. in many domains (e.g. code, math, creative writing) the companies hire skilled data labelers (so think of it as asking them instead), and this is not 100% true when reinforcement learning is involved, though I have an earlier rant on how RLHF is just barely RL, and "actual RL" is still too early and/or constrained to domains that offer easy reward functions (math etc.).


But roughly speaking (and today), you're not asking some magical AI. You're asking a human data labeler. Whose average essence was lossily distilled into statistical token tumblers that are LLMs. This can still be super useful ofc ourse. Post triggered by someone suggesting we ask an AI how to run the government etc. TLDR you're not asking an AI, you're asking some mashup spirit of its average data labeler.


Example when you ask eg “top 10 sights in Amsterdam” or something, some hired data labeler probably saw a similar question at some point, researched it for 20 minutes using Google and Trip Advisor or something, came up with some list of 10, which literally then becomes the correct answer, training the AI to give that answer for that question. If the exact place in question is not in the finetuning training set, the neural net imputes a list of statistically similar vibes based on its knowledge gained from the pretraining stage (language modeling of internet documents).


Q: RLHF can create superhuman outcomes


AK: Hmm. RLHF is still RL from _Human_ feedback, so I wouldn't say that exactly? RLHF moves the performance to "discriminative human" grade, up from SFT which is at "generative human" grade. But this is not so much "in principle" but more "in practice", because discrimination is easier for an average person than generation (e.g. label which of these 5 poems about X is best vs. write a poem about X). Separately you also get a separate boost from the wisdom of crowds effect, i.e. your LLM performance is not at human level, but at ensemble of human level. So with RLHF in principle the best you can hope for is to reach a performance where a panel of e.g. the top 10 human experts on some topic, with enough time given, will pick your answer over any other. So in some sense this counts as superhuman. To go proper superhuman in the way people think about it by default I think, you want to go to RL instead of RLHF, in the style of my earlier post on RLHF is just barely RL

https://x.com/karpathy/status/1821277264996352246


Q: It doesn’t interpolate, does it?


If I ask “What color is a Gropy?”, and we had 100 labellers say it’s blue and 100 labellers say it’s yellow, it’s going to randomly say blue or yellow - but never “It’s a debated question, some say blue, some say yellow”. Right?


AK:Excellent question and yes exactly, it responds with blue or yellow with 50% probability. Saying “It’s a debated question, some say blue, some say yellow” is just a sequence of tokens that would be super unlikely, it doesn't match the statistics of the training data at all.


Q:It says “it’s a debated question” on almost everything that’s a debated question. Try it.


AK:The human labelers are instructed in their training documentation to say stuff like that to keep things neutral.


Q: I feel like the instant access to "skilled data labelers" in many domain is such a profound and useful function that we lacked prior to the LLM. We shouldn't take this new found accessibility feature for granted.


AK:100% great way to put it.