Marvin Minsky(V): The Society of Mind

马文·明斯基(V)：《心智社会》

**Translation available**

Preface: First Principle thinking, and Minsky’s second book that inspired Jensen Huang to create GPUs. Co-written with Gemini.

When Minsky proposed the two impossibles in solving artificial intelligence in his first book, Perceptron, in 1958, he didn’t know how to solve them. He knew that the hardware infrastructure at the time was impossible to carry the complex calculations we’d need, and he didn’t know how to teach an algorithm to a machine. When Minsky's second book, The Society of Mind, was published in 1987, Jensen was completing his Master’s in Electrical Engineering at Stanford University (graduated 1992). This was a period when Minsky’s theories were being heavily debated in academic circles. Jensen has often discussed how "First Principles" thinking—a cornerstone of Stanford’s engineering culture—led him to look at computation through a biological and modular lens. We hear the phrase “First Principles” tossed around a lot, but essentially it’s the act of boiling a problem down to its most fundamental, indisputable truths and building a solution from the ground up, rather than relying on how things were done in the past. It is often contrasted with "Reasoning by Analogy," which is how most of us think. Analogy says, "We should do X because it's like Y, and Y worked." First Principles says, "Forget Y. What are the physical laws and basic facts we are working with here?"

To think from first principles, you generally follow this loop. List everything you think you know about the problem (e.g., "Rockets are expensive because they’ve always been $60M"), then strip away the "baggage" until you hit the "atomic" facts (e.g., "What is a rocket actually made of? Aluminum, titanium, and fuel. What do those raw materials cost on the market?"), lastly use those raw facts to build a new path (e.g., "If the materials only cost $2M, how can we combine them more efficiently to build a cheaper rocket?"). Most people don't use first principles because it is mentally exhausting. Analogy is a shortcut; it saves energy by letting us copy existing patterns. First Principles requires you to "doubt everything" (a method famously used by René Descartes) until you find something you can't doubt anymore. When NVIDIA started, the standard was to build CPUs to handle everything. Jensen went back to first principles: "What does graphics math actually require?" He realized it wasn't one complex "genius" logic gate, but millions of tiny, simple "addition" steps. He didn't build a better CPU; he built an entirely different "machine" (the GPU). When Jensen, Chris Malachowsky, and Curtis Priem met at Denny's in 1993, they didn't ask, "How do we make a better graphics card?" They asked: "What is the most difficult problem in computing that people will pay to solve?" The team broke 3D graphics down to its atomic level. They realized that a 3D image isn't a "picture"—it's a massive collection of triangles and pixels. The traditional CPU is one powerful processor that does everything. It calculates where a triangle is, then how light hits it, then what color it should be, one after another. Because every pixel’s color is independent of the others, you don't need a "genius" processor; you need a "society of simple workers" working in parallel.

In the early 90s, the CPU was the "boss" of the computer. It did all the thinking and just sent the final results to the graphics card to be displayed. The "wire" (bus) between the CPU and the graphics card was too slow. As games got more complex, the CPU couldn't keep up. At the most basic level, a computer chip does math using an ALU (Arithmetic Logic Unit). A CPU has a few, very large, very complex ALUs. They are "Generalists." They can add, subtract, predict which way a program will branch, and handle "if/then" logic at 5.0 GHz. They realized that to draw a 3D world, you don't need "if/then" logic; you just need to multiply and add numbers over and over (Matrix Math). This is the fundamental "fork in the road" between the two most important types of chips in history. To understand it, we have to look at why "if/then" logic is the enemy of speed, and why Matrix Math is the secret to reality.

A CPU is designed to handle unpredictable tasks. When you use Excel, browse the web, or run an OS, the computer is constantly making decisions:"If the user clicks this button, then open this menu; else, stay idle.", or"If the password matches, then grant access." To do this at 5.0 GHz, which is 5 billion times per second, the CPU uses a massive amount of its "brain power" on Branch Prediction. It literally tries to guess the future. It guesses which way an "if" statement will go and starts working on it before you even click. If it guesses wrong, it has to throw away all that work and start over. This makes the CPU a "genius," but a very lonely one—it spends most of its time managing its own thoughts.

When you render a 3D world (or train an AI), there are almost no "if" statements. You don't ask, "If this pixel is red..." You say, "Multiply these 1 million light values by these 1 million surface colors." This is Matrix Math. In physics and graphics, almost everything is a "Dot Product"—a fancy way of saying "multiply a bunch of pairs and add them up."

A CPU is built to be a high-performance soloist. To make sure it never stops working, it uses massive amounts of silicon for Branch Prediction, which is a giant "guessing engine" that tries to predict which way your code will turn (if/else). It also uses Deep Caches, which are massive pools of memory built right into the chip to store data "just in case" the CPU needs it. And then there is the Out-of-Order Execution Logic, which is a complex circuitry that re-arranges your code on the fly to find parts that can be done faster. You can see how in this setup, a lot of the calculation power is used to predict, and calculate in advance, and stored in caches, just in case it’s needed, and when it’s not, that information is tossed away. In a typical CPU, only about 10% to 20% of the actual chip surface is dedicated to the ALU (the part that actually does the math). The rest is just "the manager" making sure that one ALU is always busy.

NVIDIA’s "First Principles" realization was that for graphics, stalls don't matter. If one pixel takes a millisecond longer than another, the user won't notice, as long as all 8 million pixels show up on time. Because they didn't care about "latency" (speed of a single task), they could strip away the Branch PredictionI, since the GPU assumes every thread is doing the same thing. Instead of huge internal memory, they built a massive "highway" (high bandwidth) to external memory, and Deep Caches are therefore replaced by tiny caches. Instead of having Out-of-Order Execution Logic that re-arranges the code on the fly to find parts that can be done faster, GPUs use simple control logic, which allows one "manager" to handle a group of 32 or more ALUs. By removing the "Management Tax," NVIDIA could fit thousands of ALUs in the same square millimeter where a CPU could only fit a few. They stripped away the "smart" parts of the ALU and made them as small as possible so they could cram thousands of them onto one piece of silicon. This was the transition from Serial Processing to Parallel Processing. In Serial Processing, the chip solves problem A, then problem B, then problem C. If you have 1,000 pixels to color, the CPU does them one by one. Even if it's very fast, it’s still a "queue." However, the Massively Parallel (The NVIDIA Way) has the GPU takes all 1,000 pixels and hands one pixel to each of its 1,000 tiny cores. They all "fire" at the exact same time. The task is finished in the time it takes to do one calculation, rather than one thousand.

And this concept of Parallel Processing was from Minsky’s second book, The Society of Mind. In The Society of Mind (1986), Marvin Minsky argues that the human mind is not a single, unified "thing," but a vast society of tiny, mindless processes called agents. His core question is: How can intelligence emerge from non-intelligence? His answer is that while each individual agent is simple and "dumb" (capable of only one specific, tiny task), when they are joined together in specific organizational structures, they produce the complex behaviors we call "thinking," "feeling," and "consciousness." This is from the concept of Emergence. Emergence is the phenomenon where a complex system exhibits properties that none of its individual parts possess. It is the scientific version of the phrase: "The whole is greater than the sum of its parts." A classic example is The Ant Colony. An individual ant is "dumb." It has a tiny brain, limited senses, and a very simple set of rules (e.g., "If you find food, leave a scent trail"). However, when putting together millions of simple ants, the Emergence is an Ant Colony. The colony as a whole acts like a "super-organism." It can solve complex geometry, bridge gaps, farm fungus, and wage war. No single ant "understands" the blueprint of the nest, yet the nest gets built. The wetness of water is another classic example of Emergence. Hydrogen and Oxygen atoms. Neither of these gases is "wet." When you combine them into H2O at the right temperature, "wetness" emerges. You cannot find the "wet" property by looking at a single molecule; it only exists when trillions of them interact. This is exactly how ChatGPT and other Large Language Models (LLMs) work. Billions of simple mathematical "weights" on an NVIDIA GPU (the agents). Each weight just does a simple "multiply and add" calculation (the task). Developers did not "program" the AI to understand sarcasm or how to write code. They simply built a massive "society" of math agents, and as the system got larger, these complex abilities emerged spontaneously. Philosophers like Daniel Dennett often distinguish between two types. For a Weak Emergence, we can see how the parts create the whole, even if it's complex (e.g., a car engine's power comes from thousands of explosions). For a Strong Emergence, the new property is so different from the parts that it’s almost impossible to explain how it happened (e.g., how "meat" in the brain creates the "feeling" of love or consciousness). In the context of Marvin Minsky’s Society of Mind and NVIDIA’s AI, emergence is the "magic" that happens when you connect enough simple "agents" together.

Minsky breaks the mind down into a hierarchy. These are the smallest units of thought. An agent might be responsible for "seeing a red color," "moving a finger," or "recognizing a vertical line." By itself, an agent has no "mind." When agents work together to accomplish a larger task, they form an agency. For example, a "Builder" agency might consist of separate agents for "Find a block," "Pick up a block," and "Add it to the tower." Minsky introduces the concept of K-lines (Knowledge-lines) to explain memory. A K-line is essentially a mental "wire" that, when activated, turns on a specific group of agents that were active when you learned something or solved a problem in the past. Instead of storing "data" like a hard drive, the mind stores configurations. Memory is the act of re-activating a previous "state" of your mental society. To explain how we handle new situations, Minsky uses Frames. A frame is a mental template or "skeleton" with slots for details. When you walk into a "birthday party," you don't have to relearn what a chair is. You invoke a "Birthday Frame" that already has slots for cake, presents, and guests. You only need to fill in the specific details (e.g., "The cake is chocolate"). Because the mind is a society, different agents often want different things. One agency might want to "Sleep," while another wants to "Finish this book." Minsky explains that we have "Manager" agents whose only job is to settle disputes between subordinate agents. If the "Sleep" agency is stronger, it suppresses the "Read" agency. Minsky believes that consciousness is a "user illusion"—a simplified story that our higher-level agents tell themselves to keep track of what the millions of lower-level agents are doing. More on this in the next post. ☀️

References & Recommended Reading

I. The Core Theoretical Foundations

These are the primary works that shaped the "Society of Mind" philosophy and the "AI Winter" that preceded NVIDIA's rise.

Minsky, Marvin. The Society of Mind. Simon & Schuster, 1986.
- The Inspiration: Jensen Huang’s "bible" for parallel architecture. It argues that intelligence emerges from thousands of simple, mindless agents working together.
Minsky, Marvin, and Seymour Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, 1969.
- The Catalyst: This book famously shut down early neural network research by proving the limitations of single-layer "Perceptrons," leading to the first AI Winter. It set the stage for the hardware breakthroughs we have today.
Rosenblatt, Frank. "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." Psychological Review, 1958.
- The Origin: The original paper that proposed a machine could "learn" like a brain.

II. First Principles & Cognitive Science

To understand how Jensen thought through these problems, these readings cover the methodology of First Principles and the philosophy of the mind.

Dennett, Daniel C. Consciousness Explained. Little, Brown and Co., 1991.
- The Philosophy: Explains the "User Illusion" and the "Intentional Stance," providing the philosophical basis for why a society of simple agents can feel like a single "mind."
Pinker, Steven. How the Mind Works. W. W. Norton & Company, 1997.
- The Bridge: Connects evolutionary biology to the computational theory of mind—the idea that the brain is a system of specialized, evolved "gadgets."
Descartes, René. Meditations on First Philosophy. 1641.
- The Method: The historical root of "Doubting Everything" to find fundamental truths, which Jensen and Musk have modernized into "First Principles Thinking."

III. The History of the GPU & NVIDIA

For the "hardware" side of the story—how the silicon actually changed to match the philosophy.

Peddie, Jon. The History of the GPU (Series). Springer, 2023.
- Vol 1: Steps to Invention
- Vol 2: Eras and Environment
- Vol 3: From Inception to AI
- Why Read: The most comprehensive technical history of how NVIDIA won the graphics war and created the "World's First GPU" (the GeForce 256).
"Jensen Huang on How to Use First-Principles Thinking to Drive Decisions." View From The Top, Stanford Graduate School of Business (Podcast/Transcript), 2024.
- The Source: A direct deep-dive from Jensen on his Stanford roots and his "First Principles" framework for running NVIDIA.

IV. Modern Synthesis: Emergence & AI

These works explain the "magic" of emergence that we see in current models like ChatGPT.

Bennett, Max. A Brief History of Intelligence. HarperCollins, 2023.
- The Modern View: A fantastic look at how AI and neuroscience have finally converged to prove the "Society of Mind" theories in real-time.
Mitchell, Melanie. Complexity: A Guided Tour. Oxford University Press, 2009.
- The Concept: The best entry-point for understanding "Emergence"—how ants, water, and AI models all follow the same rules of collective intelligence.

前言：第一性原理思维，以及启发黄仁勋创造 GPU 的明斯基的第二本著作。由 Gemini 协同创作。

1958年，马文·明斯基（Marvin Minsky）在他的第一本书《感知机》（Perceptron）中提出了解决人工智能的两个“不可能”。当时他并不知道如何解决它们，因为他深知那时的硬件基础设施根本无法承载我们所需的复杂计算，而且他当时也不知道如何教机器学会算法。当明斯基的第二本书《心智社会》（The Society of Mind）于1987年出版时，黄仁勋（Jensen Huang）正巧在斯坦福大学攻读电子工程硕士学位，并于1992年毕业。那是明斯基理论在学术界引发激烈辩论的时期，黄仁勋经常谈到，作为斯坦福工程文化基石的“第一性原理”（First Principles）思维，是如何引导他通过生物学和模块化的视角去看待计算的。

我们常听人提到“第一性原理”这个词，但本质上，它就是将问题剥离到最基础、不可动摇的真理，并从零开始构建解决方案，而不是依赖过去的做法。它通常与我们大多数人习惯的“类比推理”形成对比。类比推理会说：“我们应该做 X，因为它像 Y，而 Y 曾经成功过。”而第一性原理则会问：“忘掉 Y，物理定律和基础事实究竟是什么？”要运用第一性原理思考，通常遵循一个特定的循环：首先列出你认为关于该问题的所有已知信息（例如：“火箭很贵，因为它们一直卖6000万美元”）；然后剥离掉这些“包袱”，直到触及那些“原子级”的事实（例如：“火箭究竟是由什么组成的？铝、钛和燃料。这些原材料的市场价格是多少？”）；最后，利用这些原始事实开辟一条新路径（例如：“如果原材料只需200万美元，我们如何更高效地组合它们，造出更便宜的火箭？”）。

大多数人不使用第一性原理，因为它极其消耗脑力。类比是一条捷径，它让我们通过复制现有模式来节省能量。而第一性原理则要求你“怀疑一切”——这是勒内·笛卡尔著名的思考方法——直到你找到无法再怀疑的东西为止。当英伟达（NVIDIA）创立时，行业标准是构建 CPU 来处理一切。黄仁勋回到了第一性原理，去思考图形计算到底需要什么。他意识到这不需要一个复杂的“天才”逻辑门，而是需要数百万个微小、简单的“加法”步骤。因此，他没有去造一个更好的 CPU，而是造了一台完全不同的机器，即 GPU。

1993年，黄仁勋、克里斯·马拉科夫斯基和柯蒂斯·普里姆在 Denny's 餐厅会面时，他们并没有问如何做出更好的显卡，而是问：“计算机领域最难、且人们愿意付钱解决的问题是什么？”团队将 3D 图形拆解到原子级，意识到 3D 图像不是一张“画”，而是海量的三角形和像素的集合。传统的 CPU 像是一个做所有事情的强大处理器，它按顺序一个接一个地计算三角形的位置、光线的照射和颜色。由于每个像素的颜色是相互独立的，黄仁勋意识到并不需要一个“天才”处理器，而需要一个由“并行工作的简单劳动力”组成的社会。

在90年代初期，CPU 是计算机的“老板”，它负责所有思考，只将最终结果发送给显卡显示。然而，连接 CPU 和显卡的“电线”（总线）太慢了，随着游戏变得复杂，CPU 逐渐不堪重负。在最底层，计算机芯片使用 ALU（算术逻辑单元）进行数学运算。CPU 拥有少量非常庞大且复杂的 ALU，它们是“通才”，能在 5.0 GHz 的频率下进行加减运算、预测程序分支并处理“如果/那么”逻辑。英伟达意识到，绘制 3D 世界并不需要这种复杂的逻辑，只需要一遍又一遍地进行乘法和加法，即矩阵数学。这是历史上两种最重要芯片之间的根本分叉路口，理解这一点就能明白为什么“如果/那么”逻辑是速度的敌人，而矩阵数学是现实世界的秘密。

CPU 被设计用来处理不可预测的任务。当你使用 Excel、浏览网页或运行操作系统时，计算机需要不断做出决策，比如：“如果用户点击这个按钮，则打开菜单；否则保持空闲”，或者“如果密码匹配，则授予访问权限”。为了在每秒 50 亿次的频率下完成工作，CPU 将大量的“脑力”用于“分支预测”。它字面上在预测未来，在用户点击之前就猜测“如果”语句的走向并提前开始工作。如果猜错了，它必须丢弃所有工作并重新开始。这使 CPU 成为一个“天才”，但也是一个孤独的天才，因为它的大部分时间都在管理自己的想法。

相比之下，渲染 3D 世界或训练 AI 时几乎没有“如果”语句。你不需要问“如果像素是红色的该怎么办”，你只需要说：“将这 100 万个光照值乘以这 100 万个表面颜色。”这就是矩阵数学。在物理学和图形学中，几乎一切都是“点积”——一种将成对数字相乘并相加的说法。CPU 为了确保永不停歇，使用了大量的硅片面积用于分支预测（巨型猜测引擎）、深层缓存（内置在芯片中的大规模存储池）以及乱序执行逻辑（动态重排代码的复杂电路）。在这种设置中，大量的计算能力被用于预测和预存数据，一旦不需要就会被丢弃。在典型的 CPU 中，只有约 10% 到 20% 的芯片表面专门用于真正做数学的 ALU，剩下的全是负责确保 ALU 忙碌的“管理者”。

英伟达的第一性原理发现是，对于图形处理而言，停顿并不重要。如果一个像素比另一个慢一毫秒，用户不会察觉，只要所有像素能准时显示即可。因为不在乎延迟，他们剥离了分支预测，因为 GPU 假设所有线程都在做同样的事情。他们用宽阔的高速公路（高带宽）取代了巨大的内部存储，深层缓存因此被微型缓存取代。GPU 也不使用乱序执行逻辑，而是使用简单的控制逻辑，允许一个“管理者”同时处理一组 32 个或更多的 ALU。通过移除这种“管理税”，英伟达可以在同样的面积内塞进数千个 ALU。他们将 ALU 剥离得尽可能微小，从而实现了从串行处理到并行处理的转变。在串行处理中，芯片按顺序一个接一个解决像素问题，而大规模并行方式中，GPU 接收所有 1000 个像素并分发给 1000 个微小核心，让它们在同一瞬间开火，完成任务的时间从一千次缩短到了一次。

这种并行处理的概念正源自明斯基的《心智社会》。明斯基在 1986 年的书里认为人类心智不是一个统一的实体，而是一个由微小、无意识的过程（即代理）组成的庞大社会。他提出了智能如何从非智能中产生的问题，并认为当简单的代理以特定的组织结构组合时，就会产生思考和意识。这就是“涌现”的概念。涌现是指复杂系统表现出其任何组成部分都不具备的属性，即“整体大于部分之和”。

典型的例子是蚁群。单只蚂蚁虽笨，但组成的蚁群却像一个“超级有机体”，能解决复杂几何问题、架桥并进行战争，而没有哪只蚂蚁掌握着全局蓝图。水的湿润性是另一个例子。氢原子和氧原子本身不湿，但组合成 $H_2O$ 后，湿润性就涌现了。这正是 ChatGPT 的工作原理：英伟达 GPU 上的数十亿个权重就是代理，它们只做简单的乘法和加法。开发者并没有“编程”让 AI 理解讽刺，他们只是构建了一个庞大的数学代理社会，当系统变得足够大，复杂的智能就自发地涌现了。

哲学家丹尼尔·丹内特区分了弱涌现和强涌现，前者如引擎动力，后者如大脑产生意识。在明斯基和英伟达的语境下，涌现就是连接足够多的代理时产生的魔力。明斯基将心智分解为层级结构：代理是最小单元（如负责“看到红色”），机构则是代理的协作体（如“建筑者”机构）。他引入了 K-lines（知识线）来解释记忆。K-line 本质上是一根心理电线，激活时会开启过去成功时的代理配置。心智不存数据，而是存配置。此外，明斯基用“框架”解释我们如何处理新情况，比如“生日派对”框架已有预设插槽，你只需填入细节，不需要重新学习。因为心智是一个社会，代理间会有冲突，比如想睡觉 vs 想看书。明斯基解释说我们拥有“管理者代理”来解决争端。他认为意识其实是一种“用户错觉”，是高层代理为了追踪底层动态而编造的一个简化故事。更多内容，我们将在下一篇帖子中讨论。 ☀️

References & Recommended Reading

核心理论奠基