剖析 GPT 第二篇: 為什麼 Transformer 能壓縮世界知識?  Dissecting GPT, Part 2: Why Can Transformers Compress World Knowledge?

為什麼 Transformer 能壓縮世界知識?

從語言的「廢話」、世界的「規律」,到一個巨大的可微分函數

前言:AI 真的有「把世界裝進大腦」嗎?


如果你看過一些 AI 新聞,大概看過這種說法:

「GPT-4 的參數規模,已經包含了大量人類世界知識。」

聽起來很玄:
幾千億個數字的組合,
怎麼可能裝得下人類幾百年的書、論文、程式碼、財報、小說?

更直覺的疑問是:

  • 人類世界這麼複雜,模型真的有「學進去」嗎?

  • 還是它只是裝得很像,只是比較高級的文字拼貼機?

  • 如果它真的壓縮了知識,那個壓縮的「格式」到底是什麼?

上一篇我們說過:
LLM 不是一個資料庫,而是一個能「算出答案」的世界模型。

這一篇要回答的問題是:

這個世界模型是怎麼被「壓縮」進去的?
Transformer 憑什麼可以做到這種事?

我們不會走硬派數學路線,
但會給你一個足夠扎實、又不會看得頭痛的直覺。


一、先承認一件事:語言其實「非常好壓縮」

如果你有寫報告、寫簡報、寫提案的經驗,應該有這種感覺:

很多內容,其實都是「換句話說」。

同樣的邏輯,會被各種句子、故事、比喻重複包裝一次又一次。

  • 「升息循環」的論述,換個市場、換個國家,又可以用一輪

  • 「成長股 vs 價值股」的討論,十年前講過的,今天可以再講一次

  • 「平台經濟」、「網效應」、「規模經濟」的故事,不斷以不同公司為主角重演

這不是偷懶,而是語言本來就高度「套路化」。

從 AI 的角度來看,這就是一個關鍵事實:

人類語言不是亂數,而是一大堆可以重複利用的結構、模式、模板。

這件事有兩個後果:

  1. 資料量雖大,但資訊其實非常「重複」

  2. 一旦抓到這些重複的結構,就可以大幅壓縮

你可以想像一個初學者記筆記:
把每一個段落都抄下來。

而一個熟練的分析師寫的是「骨架」:
只記各種情境的核心模板。

Transformer 要做的事情,就是努力把自己變成那個只記「骨架」的分析師。


二、世界也不是亂的:規律本身就是壓縮

語言好壓縮,只是故事的一半。

另一半來自:世界本身也充滿規律

  • 物理上:力、速度、加速度有穩定關係

  • 經濟上:供需、利率、通膨存在可預期動態

  • 社會上:人類行為有固定偏誤與慣性

  • 故事層面:敘事往往繞不開衝突、轉折、和解

這些規律讓 AI 可以做一件非常關鍵的事:

不用死背每一個具體案例,
只要學會背後的「生成機制」。

就像一個老交易員,不會記得 20 年來每一天的走勢,
但他會知道:

  • 利率怎麼影響風險資產

  • 資金行情怎麼推動估值

  • 市場情緒在什麼情況下容易過熱或過冷

從 AI 的語言來說:
世界規律大多可以被「低維表示」。

意思是——
雖然你看見的是各式各樣的文本、事件、數據,
但背後驅動它們的結構,其實沒有那麼多維度。

例如:

  • 不同國家的貨幣政策,看起來五花八門,背後卻有幾個共通軸:

    • 物價

    • 就業

    • 成長

    • 債務

  • 不同公司的策略看起來風格迥異,但常常可以投影回幾個軸:

    • 成長優先 vs 盈利優先

    • 輕資產 vs 重資產

    • 生態系 vs 單點產品

對模型來說,學到的不是每一個案例本身,而是這些「軸」。

這就是「世界規律具低維結構」這句工程話的白話版。


三、Transformer 的關鍵:能把「很遠的因果」壓在一起學

那為什麼是 Transformer 做得到這種壓縮,而不是早期的 RNN 之類的架構?

因為 Transformer 有一個決定性的優勢:

它擅長處理「距離很遠但相關」的東西。

你可以回想一下真實世界的敘事:

  • 一家公司的現在股價,往往是三年前某個策略決定引爆的結果

  • 一個國家的政治走向,常常跟幾十年前的制度設計有關

  • 一段關係的崩壞,可能源自很早以前被忽略的一個細節

如果一個模型只能看到「附近幾個字」的關係,那它頂多學會文法。
它看不到「一開始的承諾」如何影響「最後的結局」。

Transformer 的 Attention 機制,
讓模型能夠在一句話、一段文章、甚至一整篇長文裡頭:

  • 把開頭的一句伏筆

  • 和中間的轉折

  • 以及結尾的總結

全部串在一起考慮。

用工程語言叫「長距依賴(long-range dependency)」,
用白話講就是:

它不只會看你「剛說了什麼」,
還會記得「你之前曾經說過什麼」,
兩者一起進來決定下一步。

這種能力有一個很實際的效果:

  • 模型可以把「原因」和「後果」一起壓縮到同一套表示裡

  • 也可以把「條件」和「結果」放進同一個模式裡學

久而久之,它不只是知道「A 之後常常接 B」,
而是慢慢學到「某些 A 會導致某種 B」。

這就是為什麼今天的 LLM:

  • 不只會接句子

  • 還能在某種程度上「跟你討論邏輯」


四、壓縮的另一個面向:從「文本」到「可微分函數」

前面講的還比較偏直覺,這段我們稍微往工程一點點靠,但保持可讀。

如果你要用一句話總結 LLM 的本質,可以這樣說:

大量文本(訓練資料)
👉 被壓縮成一組參數(權重)
👉 這組參數對應到一個函數(模型)
👉 這個函數可以接受「問題」,輸出「合理的文字回應」

重點在「函數」這兩個字。

模型不是存了一堆答案,而是學到一個規律:

「在某種語境下,出現什麼樣的下一個字比較合理。」

這個規律在數學上,就是一個非常巨大的函數——
輸入是你給它的 Prompt,
輸出是下一個 token 的機率分佈。

什麼叫「可微分」?
先不講微積分課本,只講一個關鍵意義:

因為它是可微分的,所以可以透過「調參數」慢慢變好。

訓練過程中,模型不斷地:

  1. 猜下一個字

  2. 和真實文字比對,算「差多少」

  3. 用這個差異來微調自己的參數

這就是所謂的「梯度下降(gradient descent)」。

久而久之,模型學到的,不是每一篇文章的內容,
而是「怎麼根據前面已經看到的東西,推論合理的下一步」。

你可以把它想成一個極端版本的「寫作自動補完系統」,
只是它補完的不只是句子,還包含:

  • 故事結構

  • 概念之間的連結

  • 世界的運作方式

於是:

  • 訓練資料 = 原始世界(壓縮前)

  • 權重 = 壓縮後的世界模型

  • 前向生成(回答、續寫) = 解壓縮過程的一部分

這就是「世界知識被近似成一個高維可微分函數」的真正含義。


五、多頭、多層 Attention:AI 的「多層分析團隊」

前一篇我們提到,多頭、多層 Attention
很像是一整個分工細膩的分析團隊。

現在多加一個角度來看:

這些不同層級、不同頭的 Attention,
其實就是在做「分階段壓縮」。

你可以想像這樣一個流程:

  1. 初始層(前幾層)
    把文字拆成比較「語法」層級的東西——
    誰是主詞、誰是受詞、時態是什麼、否定是誰否定誰。

  2. 中間層
    開始關心「誰對誰做了什麼」、「事件順序」、「角色關係」。
    有點像是把句子變成事件圖:「A 在什麼條件下對 B 做了某件事」。

  3. 深層
    在這些事件之上,開始抽象出更高層的規律。
    例如:「景氣循環的敘事模板」、「產業轉型的典型路徑」、「科技導入與組織變革的常見模式」。

這些層層抽象的結果,就是:

文本慢慢變成「結構」
結構再進一步變成「幾何」
幾何最後被固化在權重裡,成為模型的世界觀。

接著,當你問它:

「一間原本只做硬體的公司,要切入雲端服務,最大的挑戰是什麼?」

它不是去翻哪篇文章講過類似的例子,
而是在問自己:

  • 我見過多少「硬體轉服務」的故事?

  • 它們有哪些共同難題?

  • 組織、文化、商業模式上,通常會出現什麼摩擦?

然後用這個內化後的結構,
給你一個看起來像「綜合版」的答案。


六、那這樣壓縮有什麼限制與代價?

講到這裡,Transformer 聽起來有點神。
但任何壓縮都有代價,LLM 也一樣。

1. 它不是「完整備份」,比較像「有洞的世界地圖」

因為模型只是在「近似」一個世界函數,
它在某些區域會特別精細(資料多、規律強)、
在某些區域則會很粗糙(資料少、規律不明)。

這也是為什麼:

  • 主流語言(英文)通常比小語種強

  • 熱門領域(通用知識、主流產業)表現比冷門主題好

  • 最新事件經常答不準(因為根本沒出現在訓練資料裡)

2. 壓縮必然帶來「幻覺」風險

既然模型靠的是「內化模式」作答,而不是查表,
那麼在資訊不足時,它還是會依照自己理解的邏輯「合理猜測」。

這就是所謂的「幻覺(hallucination)」:
邏輯上看起來順、語言很流暢,但事實不一定正確。

用前面的比喻來說:

它是一個非常會講故事、見過很多案例的顧問,
但不是一個每天盯著最新數據的資料庫。

這也是為什麼前一篇最後會說:
LLM 要跟外部記憶(像 RAG、GraphRAG、向量資料庫)搭配使用。


小結:Transformer 是一台「世界規律編碼機」

如果要用一句商業書封面的話來總結 Transformer:

它不是記憶體,而是一台把世界規律「編碼成函數」的機器。

它之所以做得到,靠的是三件事:

  1. 語言高度冗餘:套路很多、重複很多,很適合壓縮

  2. 世界規律具低維結構:背後的驅動軸沒有看起來那麼多

  3. Transformer 擅長長距依賴:能把因果、條件、角色、時間一起考慮

最後,透過梯度下降訓練,
這一切被擠進一個「高維可微分函數」裡,
成為我們今天叫作 GPT、Claude、Gemini 的這些大模型。

下一篇,我們要把視角從「模型裡的大腦」,
拉到「模型外的記憶系統」:

隱式知識 vs 顯式記憶:
為什麼 LLM 需要 RAG、GraphRAG 跟它一起工作,
才能變成真正可用的企業級 AI?


本著作由大力士的AI天地創造 製作,以創用CC 姓名標示–非商業性– 禁止改作 4.0 國際授權條款釋出。





Why Can Transformers Compress World Knowledge?

From linguistic “redundancy” and worldly “regularities” to a vast differentiable function


Preface: Did AI really “pack the world into its brain”?



If you follow AI news, you’ve probably seen claims like:


“GPT-4’s parameter scale already contains a huge amount of human world knowledge.”


It sounds mystical:

How could a combination of hundreds of billions of numbers

possibly contain centuries of books, papers, code, financial reports, and novels?


A more intuitive doubt:


The human world is so complex—did the model really “learn it”?

Or is it just a very sophisticated text cut-and-paste machine?


And if it truly compresses knowledge, what is the “format” of that compression?


In Part 1 we said:

An LLM is not a database—it’s a world model that can “compute answers.”


This chapter asks:


How is that world model “compressed” into the network?

What gives Transformers the power to do this?


We won’t go full hardcore math,

but you’ll get a solid, headache-free intuition.


1) First, admit this: language is “highly compressible”


If you’ve written reports, decks, or proposals, you know the feeling:


A lot of content is just “saying the same thing differently.”


The same logic gets repackaged over and over with new sentences, stories, and metaphors.


- The “rate-hike cycle” narrative gets reused across markets and countries

- “Growth vs. value” debates from a decade ago play just fine today

- Tales of “platform economics,” “network effects,” and “economies of scale” recur with new protagonists


This isn’t laziness—language is inherently templated and patterned.


From an AI perspective, this is crucial:


Human language isn’t random. It’s a vast set of reusable structures, patterns, and templates.


Two consequences follow:


- The data volume is large, but the information is extremely “repetitive”

- Once you capture the recurring structures, you can compress massively


Think of a beginner taking notes:

They copy every paragraph.


A seasoned analyst writes the “skeleton”:

They capture core templates for scenarios.


A Transformer’s job is to become that analyst who remembers the “skeleton,” not every paragraph.


2) The world isn’t random either: regularity is compression


Language being compressible is only half the story.


The other half is: the world itself is structured by regularities.


- Physics: force, velocity, and acceleration have stable relationships

- Economics: supply–demand, rates, and inflation follow predictable dynamics

- Society: humans have consistent biases and inertia

- Narrative: stories revolve around conflict, turning points, and resolution


These regularities enable one pivotal move:


You don’t need to memorize every concrete case.

You can learn the underlying “generative mechanisms.”


Like a veteran trader who doesn’t remember every daily chart for 20 years, but knows:


- How rates affect risk assets

- How liquidity cycles drive valuations

- When market sentiment tends to overheat or freeze


In AI terms:

World regularities often admit “low-dimensional representations.”


Meaning—

Though you see myriad texts, events, and datasets,

the structures driving them don’t require that many dimensions.


For example:


National monetary policies look diverse, but project onto a few common axes:

- Prices

- Employment

- Growth

- Debt


Corporate strategies vary in style, but commonly project to axes like:

- Growth-first vs. profit-first

- Asset-light vs. asset-heavy

- Ecosystem vs. single product


To the model, what’s learned isn’t each case, but those “axes.”


That’s the plain-language version of “world regularities are low-dimensional.”


3) The Transformer’s key: learning “far-apart causality” together


Why Transformers, not earlier RNN-style architectures?


Because Transformers have a decisive advantage:


They handle “things that are far apart but related” extremely well.


Think about real narratives:


- Today’s stock price often traces back to a strategic decision three years ago

- A nation’s political trajectory can hinge on institutional choices decades back

- A relationship’s collapse may stem from a long-ignored detail early on


A model that only sees “nearby words” learns grammar at best.

It won’t see how “the initial promise” shapes “the final outcome.”


Transformer Attention lets the model—within a sentence, a section, even a whole document—


- Link an opening setup

- With a midstream twist

- And a concluding synthesis


In engineering terms: “long-range dependency.”

In plain terms:


It doesn’t just attend to “what you just said,”

it remembers “what you said before,”

and combines both to decide what comes next.


Practically, this means:


- The model can compress “causes” and “effects” into the same representation

- It can learn “conditions” and “outcomes” within the same pattern


Over time, it learns more than “A is often followed by B.”

It increasingly learns “certain A leads to certain B.”


That’s why modern LLMs:


- Don’t merely complete sentences

- They can, to an extent, “discuss logic” with you


4) Another face of compression: from “text” to a “differentiable function”


So far we’ve stayed intuitive. Let’s edge toward engineering while keeping it readable.


One sentence to summarize an LLM’s essence:


Massive text (training data)

👉 compressed into parameters (weights)

👉 those parameters define a function (the model)

👉 that function accepts “questions” and outputs “plausible text”


The emphasis is on “function.”


The model isn’t storing answers. It learns a rule:


“In this context, what’s the most reasonable next token?”


Mathematically, that rule is an enormous function—

input is your prompt,

output is a probability distribution over the next token.


What does “differentiable” mean?

Skip the calculus textbook—just note the key implication:


Because it’s differentiable, you can “tune the parameters” to improve it gradually.


During training, the model repeatedly:


- Predicts the next token

- Compares with the ground truth—computes the “error”

- Uses that error to adjust its parameters


This is “gradient descent.”


Over time, the model doesn’t memorize article content.

It learns “how to infer the next step from what it has seen.”


Think of it as an extreme version of “auto-complete,”

but it completes not just sentences—it completes:


- Story structure

- Conceptual linkages

- Operational patterns of the world


Thus:


- Training data = the raw world (pre-compression)

- Weights = the compressed world model

- Forward generation (answering, continuing text) = part of the decompression process


That’s the real meaning of “world knowledge approximated as a high-dimensional differentiable function.”


5) Multi-head, multi-layer Attention: AI’s “multi-stage analysis team”


In Part 1, we likened multi-head, multi-layer Attention to a finely divided analysis team.


Here’s another angle:


Those layers and heads are doing “stage-wise compression.”


Imagine this pipeline:


- Early layers:

  Break text into more “syntactic” elements—

  who’s the subject, who’s the object, tense, who negates whom.


- Middle layers:

  Shift focus to “who did what to whom,” “event order,” “role relationships.”

  It’s like turning sentences into an event graph: “Under what conditions did A do X to B?”


- Deep layers:

  Abstract higher-level regularities atop those events.

  For example: “business-cycle narrative templates,” “typical paths of industry transformation,” “patterns of tech adoption and organizational change.”


Layer-by-layer abstraction yields:


- Text becomes “structure”

- Structure becomes “geometry”

- Geometry solidifies into weights—your model’s worldview


So when you ask:


“A company that historically sold hardware wants to enter cloud services—what’s the biggest challenge?”


It doesn’t search for a similar case.

It asks itself:


- How many “hardware-to-services” stories have I seen?

- What common pain points recur?

- What frictions typically appear in org, culture, and business model?


Then it draws on that internalized structure

to give you a “synthesized” answer.


6) What are the limits and costs of this compression?


Yes, Transformers can look magical.

But every compression has a price; LLMs are no exception.


1. Not a “full backup,” more like a “world map with holes”

Because the model “approximates” a world function,

some regions are very detailed (rich data, strong regularities),

others are rough (sparse data, unclear patterns).


That’s why:


- Major languages (English) outperform low-resource languages

- Popular domains (general knowledge, mainstream industries) beat niche topics

- Brand-new events are often answered poorly (they never appeared in training)


2. Compression inevitably introduces “hallucination” risk

Since the model answers via “internalized patterns,” not table lookups,

when information is thin, it still “reasonably guesses” by its own logic.


That’s “hallucination”:

the logic flows, the prose is elegant—but facts may be off.


By our earlier analogy:


It’s a consultant who tells great stories and has seen many cases,

not a database that monitors the freshest numbers daily.


That’s why Part 1 concluded:

LLMs should pair with external memory (RAG, GraphRAG, vector DBs).


Conclusion: A Transformer is a “world-regularity encoding machine”


One business-book line to summarize a Transformer:


It’s not memory—it’s a machine that encodes world regularities into a function.


It works because of three factors:


- Language is highly redundant: templates and repetition make it compressible

- World regularities are low-dimensional: the driving axes are fewer than they appear

- Transformers excel at long-range dependencies: they consider cause/effect, conditions, roles, and time together


With gradient descent,

all of this gets packed into a “high-dimensional differentiable function,”

becoming the large models we call GPT, Claude, and Gemini.


In the next chapter, we’ll shift the lens from the model’s “internal brain”

to the “external memory systems”:


Implicit knowledge vs. explicit memory:

Why LLMs need RAG and GraphRAG to work alongside them

to become truly usable, enterprise-grade AI.


This work is created by Dalishi’s AI World and released under the Creative Commons Attribution–NonCommercial–NoDerivatives 4.0 International license.

留言

這個網誌中的熱門文章

不要再學 Prompt : 第 1 篇:新手完全不懂 Prompt,也能讓 AI 幫你生出專業 Prompt(超簡單)

蜀漢多代理智能架構 *AI 不是一個人工作,而是一個國家在運作。*

不要再學 Prompt: 第 2 篇:LLM 如何把人的意圖翻譯成高品質 Prompt?