NVIDEA 黄仁勳 都在推的(SLM)小語言模型是什麼? 為什麼要推SLM ? (What's SLM ? Why SLM ?)
Agentic AI 的經營選擇
為什麼「純 LLM 架構」是商業上的慢性自殺?
——從成本曲線、決策權力到企業護城河的完整拆解
作者:大力士
前言:Demo 很香,帳單很毒
劇本:
團隊用 GPT-5 做了一個超炫的 Demo
老闆、董事會看完,拍手:「就照這個做成產品吧!」
一年後,專案悄悄收掉,沒人提起
問題從來不是「AI 不夠聰明」,而是:
把 Demo 的玩法,直接套進商業規模,是一種財務自殺。
尤其是那種「所有事情都丟給一個超大 LLM」的純 LLM 架構——
從經營角度來看,這幾乎註定會死。
一、兩種完全不同的公司:用 AI 的方式 = 管公司的方式
我們先把模型名字全部忘掉,只看「組織」。
公司 A:純 LLM 架構 —— 顧問接電話公司
這家公司有一個超強、超貴的顧問(LLM):
客戶問:「訂單出貨了沒?」 → 找顧問
員工問:「這筆退款要不要同意?」 → 找顧問
律師問:「這段條款怎麼改?」 → 還是找顧問
顧問非常聰明,什麼都會,但有幾個問題:
每次問他都很貴(token)
每個問題都要想一下(延遲)
所有決策權集中在一個點(rate limit = 公司塞車)
你可以把純 LLM 架構翻譯成一句話:
「我們決定用最貴的大腦,做所有瑣碎的事。」
公司 B:LLM + SLM + SOP —— 正常人類公司
這家公司也是用 AI,但組織長這樣:
LLM = CEO / 顧問
處理少數高風險、高不確定性問題
決策前要整合多方資訊、平衡風險
SLM = 經理 / 專員
做 80% 的日常工作:查詢、分類、意圖判斷、格式轉換
可以本地部署、成本固定、毫秒級回應
SOP / Rule-based = 作業標準
例如金額上限、異常偵測、合規檢查
不需要「思考」,只要「照做」
這家公司做了兩件看起來很無聊、但極度關鍵的事:
把「誰來決定什麼」設計清楚
把「什麼事情根本不該思考」寫死在系統裡
從技術名詞換回經營語言:
公司 A 在玩「中央集權式 AI」。
公司 B 在打造「授權式、分層式 AI 組織」。
二、為什麼純 LLM 架構常常是「Demo 成功、上線暴斃」?
來看三種企業裡最常看到的死亡方式。
1. 帳單型死亡:Demo 版毛利率很好,正式版直接賠錢
PoC 階段:
一天 1,000 次請求
開會時大家只看:「回答好厲害」、「語氣很自然」
上線後:
一天 1,000,000 次請求
CFO 看的是:「這個月 AI 帳單多了一台工廠的成本」
純 LLM 架構有一個致命特性:
業務量成長是線性的,但 token 成本常常是「疊加的」。
為了提升準確率,prompt 變長
為了個人化,要塞更多 context
為了多輪推理,要多次呼叫模型
最後變成:
「客戶數越多,毛利率越差。」
這在商業上是完全站不住腳的。
2. 黑盒型死亡:所有邏輯塞在 prompt 裡,誰都不敢改
很多純 LLM 專案,長這樣:
建一個超長的 system prompt
把規則、流程、口吻全部寫在裡面
運氣好時,真的可以跑起來
但:
想改一個流程,要改 prompt
想試 A/B test,要複製 prompt
一不小心改錯一行,整個系統行為變掉
技術上這叫高耦合,組織上則叫:
「我們有一個誰都不敢動的 AI 黑盒。」
黑盒只適合 Demo,不適合長期經營。
3. 組織型死亡:AI 團隊變成所有人的「服務櫃檯」
當所有能力都集中在「那個大模型」時,組織會變成這樣:
每個部門的需求都要排隊給 AI 團隊
每個流程調整都要「技術支援」
產品部、營運部完全失去自主調整能力
你很快就會得到一個奇妙的結論:
「AI 是來幫忙的,但公司運作因為 AI,反而更慢。」
這跟模型多聰明完全無關,是架構與權責分配的問題。
三、財務與 TCO:你不是在選模型,是在選成本曲線
我們把術語暫時收掉,只看一件事:這條成本線以後會長成什麼形狀。
純 LLM 架構的現實
特徵基本上是:
高固定單價 + 高複雜度成本
使用越多 context,越貴
做越多多輪推理,越貴
如果你不切分任務、不做分層,你會看到這畫面:
「今年營收成長 3 倍,結果 AI 成本成長 10 倍。」
混合架構的財務邏輯
把任務切開:
SLM 負責:
意圖判斷
簡單問答 / 查詢
格式轉換、資料清洗
→ 可以跑在自家機房、邊緣設備、或便宜算力上
LLM 負責:
例外狀況
高價值對話
多來源整合、策略決策
這樣做的結果是:
每多一個用戶,邊際成本接近 SLM 的成本,而不是 LLM 的成本。
這就是為什麼要一再強調:
這不是「模型組合問題」,而是「毛利率問題」。
四、AI 架構 = 公司權力結構:你在決定誰有決策權
這一段,是多數技術文件完全沒談,但對 CEO 反而是最重要的。
純 LLM 架構:所有事情都送去中央
每個決策都送進同一個大腦
每個流程、每個部門都要等它回覆
任何延遲、任何異常都是「全域性」的
這跟一些公司很像:
「什麼都要董事長簽。」
你知道那種公司最後長什麼樣:
速度慢
責任模糊
沒人敢決策
混合架構:授權、分層、局部自治
在混合架構裡,權力大致長這樣:
SOP / SLM:
有權對 80% 的標準情境直接做決定
不用報告、不用 escalate
LLM:
只在「不確定」、「高風險」、「需要綜合判斷」時被召喚
反而能把精力放在真正重要的地方
你可以把這一句話寫在架構圖旁邊:
「我們不是在選 AI 架構,我們在設計公司怎麼做決定。」
這才是 Agentic AI 真正的戰略意義。
五、時間與風險:企業最怕的是「三年後算不出來」
很多人做 AI 架構時,只看「現在的單價」和「現在的算力」,
但企業真正關心的是:這個東西三年後還玩得下去嗎?
純 LLM 架構的時間風險
未來幾年,有幾件事是你控制不了的:
模型單價怎麼變 → 供應商決定
API 政策怎麼改 → 供應商決定
敏感資料能不能出境 → 法規決定
如果你所有能力全綁在「雲端大模型 + 封閉 API」上,
那未來幾乎等同於:
「別人的 Roadmap 就是你的命運。」
混合架構的時間優勢
你做的事情其實只有一件:
把不可控的東西,擋在邊界之外。
把敏感資料留在自己系統
把穩定、高頻任務交給 SLM
把可替換的部分,用清楚的介面包起來
這讓你三年後還有選項:
換模型
換供應商
做私有化
而不需要把系統整個拆掉重寫。
六、護城河:最貴的不是模型,是「你自己的決策軌跡」
大家都在追:
誰先用上最新的 GPT / Gemini / Claude
誰的模型分數高幾個百分点
但從商業角度來看,這些差距幾乎不構成長期護城河,因為:
競爭對手隔天也可以買同一個模型
Prompt 範本很快就被複製、學走
真正拉開距離的,是這些東西:
你手上的資料長什麼樣
你怎麼標記、怎麼拆解任務
你怎麼把人類決策,轉成可以學習的軌跡
而這些,都發生在:
SLM 層
Workflow 層
Rule / Event / Log 層
換句話說:
護城河不在「你用了哪一家 LLM」,而在「你怎麼把公司變成一台可以持續學習的機器」。
七、給經營者的實戰檢查表:這個 AI 專案值不值得做下去?
如果你是 CEO / 董事 / BU Head,可以只看這幾個問題:
這個系統裡,有沒有明確區分:
什麼是 SOP(不用思考)
什麼是 SLM(便宜快速思考)
什麼才丟給 LLM(昂貴、少量、需要綜合判斷)
團隊能不能算出來:
一萬次任務的成本
一百萬次任務的成本
如果營收成長 10 倍,毛利率會怎麼變
架構圖裡,有沒有清楚標示:
哪些資料永遠不會離開公司
哪些東西可以換供應商不重寫
如果這三題問下去,技術團隊講不清楚,那你大概可以先把「上線時間」往後移幾個月。
結語:最聰明的公司,不是把 AI 用到滿,而是用得很克制
我越做越確定一件事:
成熟的系統,追求的不是「智慧最大化」,而是「智慧使用的最小化」。
能不要想的,就寫死在規則裡
能讓小模型處理的,就不要丟給大模型
能讓流程自動跑的,就不要每次重新「推理」一次
SLM 不是 LLM 的備胎,
它是整個商業系統能不能活下去的 獲利引擎。
純 LLM 架構,是昂貴且脆弱的玩具。
LLM + SLM + SOP 的混合架構,才是可以放進財報裡的商業機器。
在我目前正在開發的 AI 分析師 裏頭, 第一版時(Pure Vibe Coding), 不管是寫投資策略, 做回測分析, 或是做台股分析, 你會發現有許多資料是重覆, 而且 input token常常是是巨量的,8k, 10k,20k ,如果把這些都給LLM, 一個回答常常要花1-3 分鐘才能完成一份投資分析,
你問 AI 為什麼要怎麼久, AI 回你說, 你把這幾萬字丟給分析師看看, 看他30分能不能寫出來投資報告.. 講的似乎也有點道理..
不過, 當然是不能這樣做, 後來又跟GPT 討論這個架構, 開始改架構,改成 react Agent, 自建router,運用不同 AI(LLM+SLM), 給AI 建 atomics ,讓AI 有不同工具可以用, 建bullet system prompt, 把prompt 定型化, 這樣做input cache ,簡單的數據計算, 讓SLM去引用工具套用模板來用就好
LLM 唯一要做的事, 就是去判斷更難的進出點, 主力動向, 有沒有洗盤, 拉價, 出貨這些動作.
一套弄下來, 回答的時間總算降到 30秒內, token 的使用也更有效率. gpt 5 nano 可以做的, 不需要去gpt 5 去做. 對時間, 成本都可以得到很大的優化.
可以看下面的例子,
資料原始長度 7787 字元
實際丟給LLM 長度: 983 字元。
output 2273 個字元
INFO:llm_investment_advisor:📨 Messages 總長度: 6051 字元(user 部分: 937)
INFO:llm_investment_advisor:🤖 使用模型: gpt-5-mini
--
INFO:llm_investment_advisor:📝 資料摘要長度: 7787 字元
INFO:llm_investment_advisor:✅ 技術分析報告已準備就緒(長度: 983 字元)
INFO:llm_investment_advisor:✅ 報告將包含在 prompt 中供 LLM 使用
INFO:llm_investment_advisor:🔀 LLM Router 決策: {'question': '最近的走勢如何,適合進場嗎?', 'question_type': '個股趨勢', 'model_tier': 'mini', 'model_name': 'gpt-5-mini', 'matched_keywords': ['走勢', '適合', '進場']}
INFO:llm_investment_advisor:🔀 Router 選擇模型: gpt-5-mini
INFO:llm_investment_advisor:🔫 準備使用 DYNAMIC_CONTEXT(動態子彈)模式...
INFO:industry_searcher:[SUCCESS] 股票 2382 的產業: 電腦及週邊設備業
INFO:industry_searcher:搜尋產業 '電腦及週邊設備業' 找到 10 檔股票
INFO:llm_investment_advisor:✅ 取得 5 家同業公司
INFO:llm_investment_advisor:📨 DYNAMIC_CONTEXT 準備完成(長度: 937 字元)
INFO:llm_investment_advisor:📨 Messages 總長度: 6051 字元(user 部分: 937)
INFO:llm_investment_advisor:🤖 使用模型: gpt-5-mini
INFO:llm_investment_advisor:📥 回答長度: 2273 字元,完成原因: stop
如果都不做預處理,這份system prompt 就會變成 7787+6501+937 ,15000多個字元..
如果這個系統一天有10000個問答. 光 input token 就會要1.5 億字元了..
而透過分工, 技術報告交給SLM 去處理並匯整,input 變少了, 而且output 也只剩少少的 2273 個字元( output token 貴很多! 更要注意)
========
本著作由大力士的AI天地創造 製作,以創用CC 姓名標示–非商業性– 禁止改作 4.0 國際授權條款釋出
Strategic Choices for Agentic AI
Why a “Pure LLM Architecture” Is Slow-Motion Commercial Suicide
— A Full Breakdown from Cost Curves and Decision Rights to Moats
Author: Dali-Shi
Preface: The Demo Is Gorgeous, the Bill Is Deadly
The story usually goes like this:
The team builds a jaw‑dropping demo on GPT‑5.
The CEO and the board see it and applaud: “Perfect, let’s turn this into a product!”
One year later, the project is quietly shut down. Nobody mentions it again.
The problem was never that “the AI isn’t smart enough.”
The real problem is:
Taking the way you build demos and deploying it at business scale
is a form of financial self‑destruction.
This is especially true for the pattern where everything is sent to one giant LLM —
the so‑called pure LLM architecture.
From a business and operating standpoint, this architecture is almost guaranteed to fail.
I. Two Very Different Companies:
How You Use AI = How You Run the Company
Let’s temporarily forget model names and just look at organizational design.
Company A: Pure LLM Architecture —
“The Consultant Answers Every Call”
This company relies on one super‑smart, super‑expensive consultant (the LLM):
- Customer asks: “Has my order shipped yet?” → Ask the consultant
- Employee asks: “Should we approve this refund?” → Ask the consultant
- Lawyer asks: “How should we revise this clause?” → Still ask the consultant
The consultant is brilliant and knows almost everything. But there are problems:
- Every time you ask, it’s expensive (tokens)
- Every question incurs latency
- All decision‑making is bottlenecked at a single point (rate limits = organizational traffic jams)
You can summarize a pure LLM architecture in one sentence:
“We’ve decided to use the most expensive brain to handle all the trivial work.”
Company B: LLM + SLM + SOP —
A Normal Human Organization
This company also uses AI, but the “org chart” looks like this:
LLM = CEO / Senior Advisor
- Handles a small number of high‑risk, highly uncertain decisions
- Needs to aggregate diverse information and balance risk before deciding
SLM (Small Language Model) = Managers / Specialists
- Handles 80% of routine work:
- Queries and lookups
- Classification and intent detection
- Format transformation, data cleaning
- Can run on‑premise, on the edge, or on cheap compute
- Cost is stable, latency is in milliseconds
- Handles 80% of routine work:
SOP / Rule‑based Logic = Standard Operations
- Spending limits, anomaly detection, compliance checks
- No “thinking” required — just “do exactly this”
This company does two seemingly boring but critical things:
- Designs clear decision boundaries — who decides what.
- Defines which tasks should never require thinking — they’re hard‑coded into the system.
Translated back into management language:
- Company A is playing with centralized, command‑style AI.
- Company B is building a delegated, layered AI organization.
II. Why Pure LLM Architectures Often Go
“Demo Success → Production Failure”
Here are three of the most common ways these systems die inside enterprises.
1. Cost-Curve Death:
Great Margins in the Demo, Bleeding Cash in Production
During the PoC:
- ~1,000 requests per day
- In meetings, everyone only looks at:
- “The answers are amazing”
- “The tone feels so natural”
After going live:
- ~1,000,000 requests per day
- The CFO looks at something else entirely:
- “Why did our AI bill this month equal the cost of running an entire factory?”
Pure LLM architectures have a fatal financial property:
Traffic scales linearly, but token cost often scales super‑linearly.
Because:
- To improve accuracy, prompts get longer
- To personalize, you stuff in more context
- To do multi‑step reasoning, you call the model multiple times
You eventually end up here:
“The more customers we have, the worse our gross margin gets.”
Commercially, this simply doesn’t hold.
2. Black-Box Death:
All Logic Buried in Prompts, Nobody Dares to Touch It
Many pure‑LLM projects look like this:
- One enormous system prompt
- All rules, workflows, and tone of voice baked into that prompt
- With some luck, it runs surprisingly well in the beginning
But:
- Want to change a workflow? → You edit the prompt
- Want to A/B test? → You clone the prompt
- Change the wrong line? → System behavior changes unpredictably
Technically, this is high coupling.
Organizationally, it becomes:
“We have a magic AI box that nobody dares to touch.”
Black boxes are great for demos.
They’re terrible as long‑term operational infrastructure.
3. Organizational Death:
The AI Team Turns into Everyone’s “Service Counter”
When all capabilities are concentrated in “that one big model”, the org drifts into this pattern:
- Every department lines up in front of the AI team for changes
- Every process adjustment requires “technical support”
- Product and operations teams lose autonomy over their own workflows
Soon you reach a bizarre conclusion:
“AI was supposed to help us move faster,
but because of AI, the company is actually moving slower.”
This has nothing to do with model intelligence.
It’s a problem of architecture and decision‑rights design.
III. Finance & TCO:
You’re Not Choosing a Model, You’re Choosing a Cost Curve
Forget jargon for a moment and ask a single question:
What will this cost curve look like as the business grows?
The Reality of Pure LLM Architectures
Typical characteristics:
- High unit cost per call
- High complexity per task
And also:
- More context → more expensive
- More multi‑step reasoning → more expensive
If you don’t split tasks and add layers, you’ll see:
“Revenue grew 3× this year,
but AI cost grew 10×.”
This is not a technology issue.
It’s a gross margin issue.
The Financial Logic of Hybrid Architectures
You split the work:
SLM handles:
- Intent detection
- Simple Q&A and data lookup
- Formatting, normalization, data cleaning
→ Runs on your own servers, on the edge, or on cheap compute
LLM handles:
- Exceptions
- High‑value conversations
- Multi‑source integration and strategic judgment
The result:
Each new user pushes your marginal cost closer to the SLM cost,
not the LLM cost.
This is why we keep emphasizing:
This is not merely a “model selection problem”.
It is a unit economics and gross margin problem.
IV. AI Architecture = Power Structure:
You’re Deciding Who Gets to Decide
This is the part most technical documents gloss over,
but it’s exactly what CEOs care most about.
Pure LLM Architecture:
Every Decision Goes to Central Command
- Every decision flows into the same giant brain
- Every workflow and department waits for its response
- Any latency or incident is global and impacts everyone
This is like a company where:
“The chairman personally signs off on everything.”
We already know what these companies look like in the end:
- Slow
- Blurry accountability
- Nobody dares to make decisions
Hybrid Architecture:
Delegation, Layers, and Local Autonomy
In a hybrid design, the power structure looks more like this:
SOP / SLM:
- Authorized to handle 80% of standard, predictable scenarios
- No need to escalate in routine cases
LLM:
- Only invoked for “uncertain”, “high‑risk”, or “requires synthesis” problems
- Its capacity is reserved for what truly matters
You can write this sentence directly on your architecture diagram:
“We’re not just choosing an AI architecture.
We’re designing how this company makes decisions.”
That is the true strategic meaning of Agentic AI.
V. Time & Risk:
What Enterprises Fear Most Is “Not Being Able to Do the Math Three Years Out”
When people design AI architectures, they usually look at today’s:
- Token prices
- Compute availability
But enterprises really care about:
“Will this still be viable three years from now?”
Time Risk of Pure LLM Architectures
Over the next few years, several critical factors are outside your control:
- Model prices → determined by vendors
- API policies → determined by vendors
- Data residency / cross‑border flows → determined by regulators
If all your capabilities are tied to a
cloud LLM + closed API, then your future is:
“Someone else’s roadmap is your destiny.”
Time Advantage of Hybrid Architectures
What you’re actually doing is:
Pushing the uncontrollable parts out to the boundaries.
Concretely:
- Keep sensitive data inside your own systems
- Let SLMs handle high‑frequency, stable workloads
- Wrap all replaceable components behind clear, well‑defined interfaces
This gives you options three years from now:
- Switch models
- Change vendors
- Move to private or hybrid deployments
All without ripping the entire system apart and rebuilding from scratch.
VI. Moats: The Most Valuable Asset Isn’t the Model —
It’s Your Decision Trace
Everyone is chasing:
- “Who uses the latest GPT / Gemini / Claude first?”
- “Whose benchmark scores are a few points higher?”
From a business standpoint, these differences rarely become durable moats, because:
- Competitors can buy the same models tomorrow
- Prompt templates are easily copied and improved upon
What truly creates separation over time is:
- What your data actually looks like
- How you label data and decompose tasks
- How you turn human decisions into machine‑learnable decision traces
And these all live in:
- The SLM layer
- The workflow layer
- The rule / event / log layer
In other words:
Your moat is not “which LLM you use”.
Your moat is how you turn your company into a machine that continuously learns.
VII. A Practical Checklist for Executives:
Is This AI Project Worth Continuing?
If you’re a CEO, board member, or BU head, you can focus on just three questions:
Does the system clearly distinguish:
- What is SOP (no thinking required)
- What is for SLM (cheap, fast thinking)
- What truly goes to the LLM (expensive, rare, requiring synthesis)
Can the team calculate:
- The cost for 10,000 tasks
- The cost for 1,000,000 tasks
- What happens to gross margin if revenue grows 10×
In the architecture diagram, is it clearly marked:
- Which data will never leave the company
- Which components can switch vendors without a full rewrite
If the technical team can’t answer these three questions clearly,
you likely need to push your “go‑live” date back by a few months.
VIII. A Concrete Case:
Building an AI Investment Analyst the Wrong Way — Then Fixing It
Let’s ground this in a real system:
an AI “investment analyst” I’ve been building.
Version 1: Pure LLM, Pure Vibe Coding
In the first version (pure vibe coding):
- Writing investment strategies
- Running backtests
- Analyzing individual Taiwan stocks
all shared the same pattern:
- A lot of the data was repeated
- Input tokens were often huge: 8k, 10k, even 20k
If you just dump all of this into one LLM:
- A single investment analysis could take 1–3 minutes.
If you asked the AI why it took so long, it was like the model was saying:
“If you throw tens of thousands of characters at a human analyst,
do you really expect a solid report in 30 minutes?”
Not entirely wrong — but completely unacceptable for a production system.
Version 2: From Monolithic LLM to Agentic, Layered Architecture
After some architecture discussions with GPT itself, I refactored the system into a ReAct‑style agent with a hybrid design:
- Built a custom LLM router
- Introduced multiple models (LLM + SLM)
- Created atomic tools (atomics) for the AI to compose
- Defined bullet‑style system prompts to standardize patterns
- Added input caching and simple numerical calculations via SLM
- Let SLMs generate structured technical reports by calling tools and templates
Under this design, the large LLM has exactly one job:
Make the hard calls —
- determine entry/exit points
- infer the main players’ intentions
- detect accumulation, shakeouts, markups, and distribution patterns
Everything else is delegated downward.
After putting the system together:
- Response time dropped to under 30 seconds
- Token usage became far more efficient
- Anything GPT‑5‑mini / nano can do is never escalated to full GPT‑5
This resulted in huge improvements in both latency and cost.
Concrete Token Numbers: From 15,000 Characters to Under 1,000
Take this example:
- Original technical data length: 7,787 characters
- Actual text sent to the LLM: 983 characters
- Final answer length: 2,273 characters
Here’s what the logs look like in English after refactoring:
If we did not do any preprocessing, the system prompt would look like this:
7,787 (raw technical report)+ 6,051 (messages)+ 937 (user content)- ≈ 15,000 characters sent to the LLM
Now imagine this system handling 10,000 Q&A per day:
- Input volume alone would be around 150 million characters daily.
By enforcing division of labor:
- The technical report is generated and summarized by an SLM
- Only the compressed report goes into the LLM prompt
- The output is also kept concise at ~2,273 characters
And remember: output tokens are often more expensive than input tokens,
so keeping answers concise is crucial.
This is what it means to turn “Agentic AI” from a slide into a working, sustainable system.
Conclusion: The Smartest Companies Don’t Max Out AI —
They Use It with Extreme Discipline
The more real systems I build, the more convinced I am:
Mature systems don’t optimize for “maximum intelligence”.
They optimize for “minimal necessary intelligence”.
- If something doesn’t need thinking, write it into rules
- If a small model can handle it, never send it to a large model
- If a workflow can be automated, don’t “re‑reason” from scratch every time
SLMs are not backups for LLMs.
They are the profit engine that determines whether your AI system can survive as a business.
A pure LLM architecture is an expensive, fragile toy.
A hybrid LLM + SLM + SOP architecture is a machine you can confidently
integrate into your financials and build a moat around.
License
This work is created by 大力士的AI天地創造
and released under the Creative Commons Attribution–NonCommercial–NoDerivatives 4.0 International (CC BY‑NC‑ND 4.0) license.

留言
張貼留言