NVIDEA 黄仁勳都在推的(SLM)小語言模型是什麼? 為什麼要推SLM ? (What's SLM ? Why SLM ?)

12月 15, 2025

Agentic AI 的經營選擇

為什麼「純 LLM 架構」是商業上的慢性自殺？

——從成本曲線、決策權力到企業護城河的完整拆解

作者：大力士

前言：Demo 很香，帳單很毒

劇本：

團隊用 GPT-5 做了一個超炫的 Demo
老闆、董事會看完，拍手：「就照這個做成產品吧！」
一年後，專案悄悄收掉，沒人提起

問題從來不是「AI 不夠聰明」，而是：

把 Demo 的玩法，直接套進商業規模，是一種財務自殺。

尤其是那種「所有事情都丟給一個超大 LLM」的純 LLM 架構——
從經營角度來看，這幾乎註定會死。

一、兩種完全不同的公司：用 AI 的方式 = 管公司的方式

我們先把模型名字全部忘掉，只看「組織」。

公司 A：純 LLM 架構 —— 顧問接電話公司

這家公司有一個超強、超貴的顧問（LLM）：

客戶問：「訂單出貨了沒？」 → 找顧問
員工問：「這筆退款要不要同意？」 → 找顧問
律師問：「這段條款怎麼改？」 → 還是找顧問

顧問非常聰明，什麼都會，但有幾個問題：

每次問他都很貴（token）
每個問題都要想一下（延遲）
所有決策權集中在一個點（rate limit = 公司塞車）

你可以把純 LLM 架構翻譯成一句話：

「我們決定用最貴的大腦，做所有瑣碎的事。」

公司 B：LLM + SLM + SOP —— 正常人類公司

這家公司也是用 AI，但組織長這樣：

LLM = CEO / 顧問
- 處理少數高風險、高不確定性問題
- 決策前要整合多方資訊、平衡風險
SLM = 經理 / 專員
- 做 80% 的日常工作：查詢、分類、意圖判斷、格式轉換
- 可以本地部署、成本固定、毫秒級回應
SOP / Rule-based = 作業標準
- 例如金額上限、異常偵測、合規檢查
- 不需要「思考」，只要「照做」

這家公司做了兩件看起來很無聊、但極度關鍵的事：

把「誰來決定什麼」設計清楚
把「什麼事情根本不該思考」寫死在系統裡

從技術名詞換回經營語言：

公司 A 在玩「中央集權式 AI」。
公司 B 在打造「授權式、分層式 AI 組織」。

二、為什麼純 LLM 架構常常是「Demo 成功、上線暴斃」？

來看三種企業裡最常看到的死亡方式。

1. 帳單型死亡：Demo 版毛利率很好，正式版直接賠錢

PoC 階段：

一天 1,000 次請求
開會時大家只看：「回答好厲害」、「語氣很自然」

上線後：

一天 1,000,000 次請求
CFO 看的是：「這個月 AI 帳單多了一台工廠的成本」

純 LLM 架構有一個致命特性：

業務量成長是線性的，但 token 成本常常是「疊加的」。

為了提升準確率，prompt 變長
為了個人化，要塞更多 context
為了多輪推理，要多次呼叫模型

最後變成：

「客戶數越多，毛利率越差。」

這在商業上是完全站不住腳的。

2. 黑盒型死亡：所有邏輯塞在 prompt 裡，誰都不敢改

很多純 LLM 專案，長這樣：

建一個超長的 system prompt
把規則、流程、口吻全部寫在裡面
運氣好時，真的可以跑起來

但：

想改一個流程，要改 prompt
想試 A/B test，要複製 prompt
一不小心改錯一行，整個系統行為變掉

技術上這叫高耦合，組織上則叫：

「我們有一個誰都不敢動的 AI 黑盒。」

黑盒只適合 Demo，不適合長期經營。

3. 組織型死亡：AI 團隊變成所有人的「服務櫃檯」

當所有能力都集中在「那個大模型」時，組織會變成這樣：

每個部門的需求都要排隊給 AI 團隊
每個流程調整都要「技術支援」
產品部、營運部完全失去自主調整能力

你很快就會得到一個奇妙的結論：

「AI 是來幫忙的，但公司運作因為 AI，反而更慢。」

這跟模型多聰明完全無關，是架構與權責分配的問題。

三、財務與 TCO：你不是在選模型，是在選成本曲線

我們把術語暫時收掉，只看一件事：這條成本線以後會長成什麼形狀。

純 LLM 架構的現實

特徵基本上是：

高固定單價 + 高複雜度成本
使用越多 context，越貴
做越多多輪推理，越貴

如果你不切分任務、不做分層，你會看到這畫面：

「今年營收成長 3 倍，結果 AI 成本成長 10 倍。」

混合架構的財務邏輯

把任務切開：

SLM 負責：
- 意圖判斷
- 簡單問答 / 查詢
- 格式轉換、資料清洗
  → 可以跑在自家機房、邊緣設備、或便宜算力上
LLM 負責：
- 例外狀況
- 高價值對話
- 多來源整合、策略決策

這樣做的結果是：

每多一個用戶，邊際成本接近 SLM 的成本，而不是 LLM 的成本。

這就是為什麼要一再強調：

這不是「模型組合問題」，而是「毛利率問題」。

四、AI 架構 = 公司權力結構：你在決定誰有決策權

這一段，是多數技術文件完全沒談，但對 CEO 反而是最重要的。

純 LLM 架構：所有事情都送去中央

每個決策都送進同一個大腦
每個流程、每個部門都要等它回覆
任何延遲、任何異常都是「全域性」的

這跟一些公司很像：

「什麼都要董事長簽。」

你知道那種公司最後長什麼樣：

速度慢
責任模糊
沒人敢決策

混合架構：授權、分層、局部自治

在混合架構裡，權力大致長這樣：

SOP / SLM：
- 有權對 80% 的標準情境直接做決定
- 不用報告、不用 escalate
LLM：
- 只在「不確定」、「高風險」、「需要綜合判斷」時被召喚
- 反而能把精力放在真正重要的地方

你可以把這一句話寫在架構圖旁邊：

「我們不是在選 AI 架構，我們在設計公司怎麼做決定。」

這才是 Agentic AI 真正的戰略意義。

五、時間與風險：企業最怕的是「三年後算不出來」

很多人做 AI 架構時，只看「現在的單價」和「現在的算力」，
但企業真正關心的是：這個東西三年後還玩得下去嗎？

純 LLM 架構的時間風險

未來幾年，有幾件事是你控制不了的：

模型單價怎麼變 → 供應商決定
API 政策怎麼改 → 供應商決定
敏感資料能不能出境 → 法規決定

如果你所有能力全綁在「雲端大模型 + 封閉 API」上，
那未來幾乎等同於：

「別人的 Roadmap 就是你的命運。」

混合架構的時間優勢

你做的事情其實只有一件：

把不可控的東西，擋在邊界之外。

把敏感資料留在自己系統
把穩定、高頻任務交給 SLM
把可替換的部分，用清楚的介面包起來

這讓你三年後還有選項：

換模型
換供應商
做私有化

而不需要把系統整個拆掉重寫。

六、護城河：最貴的不是模型，是「你自己的決策軌跡」

大家都在追：

誰先用上最新的 GPT / Gemini / Claude
誰的模型分數高幾個百分点

但從商業角度來看，這些差距幾乎不構成長期護城河，因為：

競爭對手隔天也可以買同一個模型
Prompt 範本很快就被複製、學走

真正拉開距離的，是這些東西：

你手上的資料長什麼樣
你怎麼標記、怎麼拆解任務
你怎麼把人類決策，轉成可以學習的軌跡

而這些，都發生在：

SLM 層
Workflow 層
Rule / Event / Log 層

換句話說：

護城河不在「你用了哪一家 LLM」，而在「你怎麼把公司變成一台可以持續學習的機器」。

七、給經營者的實戰檢查表：這個 AI 專案值不值得做下去？

如果你是 CEO / 董事 / BU Head，可以只看這幾個問題：

這個系統裡，有沒有明確區分：
- 什麼是 SOP（不用思考）
- 什麼是 SLM（便宜快速思考）
- 什麼才丟給 LLM（昂貴、少量、需要綜合判斷）
團隊能不能算出來：
- 一萬次任務的成本
- 一百萬次任務的成本
- 如果營收成長 10 倍，毛利率會怎麼變
架構圖裡，有沒有清楚標示：
- 哪些資料永遠不會離開公司
- 哪些東西可以換供應商不重寫

如果這三題問下去，技術團隊講不清楚，那你大概可以先把「上線時間」往後移幾個月。

結語：最聰明的公司，不是把 AI 用到滿，而是用得很克制

我越做越確定一件事：

成熟的系統，追求的不是「智慧最大化」，而是「智慧使用的最小化」。

能不要想的，就寫死在規則裡
能讓小模型處理的，就不要丟給大模型
能讓流程自動跑的，就不要每次重新「推理」一次

SLM 不是 LLM 的備胎，
它是整個商業系統能不能活下去的 獲利引擎。

純 LLM 架構，是昂貴且脆弱的玩具。
LLM + SLM + SOP 的混合架構，才是可以放進財報裡的商業機器。

在我目前正在開發的 AI 分析師裏頭, 第一版時(Pure Vibe Coding), 不管是寫投資策略, 做回測分析, 或是做台股分析, 你會發現有許多資料是重覆, 而且 input token常常是是巨量的,8k, 10k,20k ,如果把這些都給LLM, 一個回答常常要花1-3 分鐘才能完成一份投資分析,

你問 AI 為什麼要怎麼久, AI 回你說, 你把這幾萬字丟給分析師看看, 看他30分能不能寫出來投資報告.. 講的似乎也有點道理..

不過, 當然是不能這樣做, 後來又跟GPT 討論這個架構, 開始改架構,改成 react Agent, 自建router,運用不同 AI(LLM+SLM), 給AI 建 atomics ,讓AI 有不同工具可以用, 建bullet system prompt, 把prompt 定型化, 這樣做input cache ,簡單的數據計算, 讓SLM去引用工具套用模板來用就好

LLM 唯一要做的事, 就是去判斷更難的進出點, 主力動向, 有沒有洗盤, 拉價, 出貨這些動作.

一套弄下來, 回答的時間總算降到 30秒內, token 的使用也更有效率. gpt 5 nano 可以做的, 不需要去gpt 5 去做. 對時間, 成本都可以得到很大的優化.

可以看下面的例子,
資料原始長度 7787 字元
實際丟給LLM 長度: 983 字元。 output 2273 個字元

INFO:llm_investment_advisor:📨 Messages 總長度: 6051 字元（user 部分: 937）
INFO:llm_investment_advisor:🤖 使用模型: gpt-5-mini
--
INFO:llm_investment_advisor:📝 資料摘要長度: 7787 字元
INFO:llm_investment_advisor:✅ 技術分析報告已準備就緒(長度: 983 字元)
INFO:llm_investment_advisor:✅ 報告將包含在 prompt 中供 LLM 使用
INFO:llm_investment_advisor:🔀 LLM Router 決策: {'question': '最近的走勢如何,適合進場嗎？', 'question_type': '個股趨勢', 'model_tier': 'mini', 'model_name': 'gpt-5-mini', 'matched_keywords': ['走勢', '適合', '進場']}
INFO:llm_investment_advisor:🔀 Router 選擇模型: gpt-5-mini
INFO:llm_investment_advisor:🔫 準備使用 DYNAMIC_CONTEXT（動態子彈）模式...
INFO:industry_searcher:[SUCCESS] 股票 2382 的產業: 電腦及週邊設備業
INFO:industry_searcher:搜尋產業 '電腦及週邊設備業' 找到 10 檔股票
INFO:llm_investment_advisor:✅ 取得 5 家同業公司
INFO:llm_investment_advisor:📨 DYNAMIC_CONTEXT 準備完成(長度: 937 字元)
INFO:llm_investment_advisor:📨 Messages 總長度: 6051 字元（user 部分: 937）
INFO:llm_investment_advisor:🤖 使用模型: gpt-5-mini
INFO:llm_investment_advisor:📥 回答長度: 2273 字元,完成原因: stop

如果都不做預處理，這份system prompt 就會變成 7787+6501+937 ,15000多個字元..

如果這個系統一天有10000個問答. 光 input token 就會要1.5 億字元了..

而透過分工, 技術報告交給SLM　去處理並匯整，input 變少了, 而且output 也只剩少少的 2273 個字元( output token 貴很多! 更要注意)

========

本著作由大力士的AI天地創造製作，以創用CC 姓名標示–非商業性– 禁止改作 4.0 國際授權條款釋出

Strategic Choices for Agentic AI

Why a “Pure LLM Architecture” Is Slow-Motion Commercial Suicide

— A Full Breakdown from Cost Curves and Decision Rights to Moats

Author: Dali-Shi

Preface: The Demo Is Gorgeous, the Bill Is Deadly

The story usually goes like this:

The team builds a jaw‑dropping demo on GPT‑5.
The CEO and the board see it and applaud: “Perfect, let’s turn this into a product!”

One year later, the project is quietly shut down. Nobody mentions it again.

The problem was never that “the AI isn’t smart enough.”
The real problem is:

Taking the way you build demos and deploying it at business scale
is a form of financial self‑destruction.

This is especially true for the pattern where everything is sent to one giant LLM —
the so‑called pure LLM architecture.

From a business and operating standpoint, this architecture is almost guaranteed to fail.

I. Two Very Different Companies:

How You Use AI = How You Run the Company

Let’s temporarily forget model names and just look at organizational design.

Company A: Pure LLM Architecture —

“The Consultant Answers Every Call”

This company relies on one super‑smart, super‑expensive consultant (the LLM):

Customer asks: “Has my order shipped yet?” → Ask the consultant
Employee asks: “Should we approve this refund?” → Ask the consultant
Lawyer asks: “How should we revise this clause?” → Still ask the consultant

The consultant is brilliant and knows almost everything. But there are problems:

Every time you ask, it’s expensive (tokens)
Every question incurs latency
All decision‑making is bottlenecked at a single point (rate limits = organizational traffic jams)

You can summarize a pure LLM architecture in one sentence:

“We’ve decided to use the most expensive brain to handle all the trivial work.”

Company B: LLM + SLM + SOP —

A Normal Human Organization

This company also uses AI, but the “org chart” looks like this:

LLM = CEO / Senior Advisor
- Handles a small number of high‑risk, highly uncertain decisions
- Needs to aggregate diverse information and balance risk before deciding
SLM (Small Language Model) = Managers / Specialists
- Handles 80% of routine work:
  - Queries and lookups
  - Classification and intent detection
  - Format transformation, data cleaning
- Can run on‑premise, on the edge, or on cheap compute
- Cost is stable, latency is in milliseconds
SOP / Rule‑based Logic = Standard Operations
- Spending limits, anomaly detection, compliance checks
- No “thinking” required — just “do exactly this”

This company does two seemingly boring but critical things:

Designs clear decision boundaries — who decides what.
Defines which tasks should never require thinking — they’re hard‑coded into the system.

Translated back into management language:

Company A is playing with centralized, command‑style AI.
Company B is building a delegated, layered AI organization.

II. Why Pure LLM Architectures Often Go

“Demo Success → Production Failure”

Here are three of the most common ways these systems die inside enterprises.

1. Cost-Curve Death:

Great Margins in the Demo, Bleeding Cash in Production

During the PoC:

~1,000 requests per day
In meetings, everyone only looks at:
- “The answers are amazing”
- “The tone feels so natural”

After going live:

~1,000,000 requests per day
The CFO looks at something else entirely:
- “Why did our AI bill this month equal the cost of running an entire factory?”

Pure LLM architectures have a fatal financial property:

Traffic scales linearly, but token cost often scales super‑linearly.

Because:

To improve accuracy, prompts get longer
To personalize, you stuff in more context
To do multi‑step reasoning, you call the model multiple times

You eventually end up here:

“The more customers we have, the worse our gross margin gets.”

Commercially, this simply doesn’t hold.

2. Black-Box Death:

All Logic Buried in Prompts, Nobody Dares to Touch It

Many pure‑LLM projects look like this:

One enormous system prompt
All rules, workflows, and tone of voice baked into that prompt
With some luck, it runs surprisingly well in the beginning

But:

Want to change a workflow? → You edit the prompt
Want to A/B test? → You clone the prompt
Change the wrong line? → System behavior changes unpredictably

Technically, this is high coupling.
Organizationally, it becomes:

“We have a magic AI box that nobody dares to touch.”

Black boxes are great for demos.
They’re terrible as long‑term operational infrastructure.

3. Organizational Death:

The AI Team Turns into Everyone’s “Service Counter”

When all capabilities are concentrated in “that one big model”, the org drifts into this pattern:

Every department lines up in front of the AI team for changes
Every process adjustment requires “technical support”
Product and operations teams lose autonomy over their own workflows

Soon you reach a bizarre conclusion:

“AI was supposed to help us move faster,
but because of AI, the company is actually moving slower.”

This has nothing to do with model intelligence.
It’s a problem of architecture and decision‑rights design.

III. Finance & TCO:

You’re Not Choosing a Model, You’re Choosing a Cost Curve

Forget jargon for a moment and ask a single question:

What will this cost curve look like as the business grows?

The Reality of Pure LLM Architectures

Typical characteristics:

High unit cost per call
High complexity per task

And also:

More context → more expensive
More multi‑step reasoning → more expensive

If you don’t split tasks and add layers, you’ll see:

“Revenue grew 3× this year,
but AI cost grew 10×.”

This is not a technology issue.
It’s a gross margin issue.

The Financial Logic of Hybrid Architectures

You split the work:

SLM handles:
- Intent detection
- Simple Q&A and data lookup
- Formatting, normalization, data cleaning
  → Runs on your own servers, on the edge, or on cheap compute
LLM handles:
- Exceptions
- High‑value conversations
- Multi‑source integration and strategic judgment

The result:

Each new user pushes your marginal cost closer to the SLM cost,
not the LLM cost.

This is why we keep emphasizing:

This is not merely a “model selection problem”.
It is a unit economics and gross margin problem.

IV. AI Architecture = Power Structure:

You’re Deciding Who Gets to Decide

This is the part most technical documents gloss over,
but it’s exactly what CEOs care most about.

Pure LLM Architecture:

Every Decision Goes to Central Command

Every decision flows into the same giant brain
Every workflow and department waits for its response
Any latency or incident is global and impacts everyone

This is like a company where:

“The chairman personally signs off on everything.”

We already know what these companies look like in the end:

Slow
Blurry accountability
Nobody dares to make decisions

Hybrid Architecture:

Delegation, Layers, and Local Autonomy

In a hybrid design, the power structure looks more like this:

SOP / SLM:
- Authorized to handle 80% of standard, predictable scenarios
- No need to escalate in routine cases
LLM:
- Only invoked for “uncertain”, “high‑risk”, or “requires synthesis” problems
- Its capacity is reserved for what truly matters

You can write this sentence directly on your architecture diagram:

“We’re not just choosing an AI architecture.
We’re designing how this company makes decisions.”

That is the true strategic meaning of Agentic AI.

V. Time & Risk:

What Enterprises Fear Most Is “Not Being Able to Do the Math Three Years Out”

When people design AI architectures, they usually look at today’s:

Token prices
Compute availability

But enterprises really care about:

“Will this still be viable three years from now?”

Time Risk of Pure LLM Architectures

Over the next few years, several critical factors are outside your control:

Model prices → determined by vendors
API policies → determined by vendors
Data residency / cross‑border flows → determined by regulators

If all your capabilities are tied to a
cloud LLM + closed API, then your future is:

“Someone else’s roadmap is your destiny.”

Time Advantage of Hybrid Architectures

What you’re actually doing is:

Pushing the uncontrollable parts out to the boundaries.

Concretely:

Keep sensitive data inside your own systems
Let SLMs handle high‑frequency, stable workloads
Wrap all replaceable components behind clear, well‑defined interfaces

This gives you options three years from now:

Switch models
Change vendors
Move to private or hybrid deployments

All without ripping the entire system apart and rebuilding from scratch.

VI. Moats: The Most Valuable Asset Isn’t the Model —

It’s Your Decision Trace

Everyone is chasing:

“Who uses the latest GPT / Gemini / Claude first?”
“Whose benchmark scores are a few points higher?”

From a business standpoint, these differences rarely become durable moats, because:

Competitors can buy the same models tomorrow
Prompt templates are easily copied and improved upon

What truly creates separation over time is:

What your data actually looks like
How you label data and decompose tasks
How you turn human decisions into machine‑learnable decision traces

And these all live in:

The SLM layer
The workflow layer
The rule / event / log layer

In other words:

Your moat is not “which LLM you use”.
Your moat is how you turn your company into a machine that continuously learns.

VII. A Practical Checklist for Executives:

Is This AI Project Worth Continuing?

If you’re a CEO, board member, or BU head, you can focus on just three questions:

Does the system clearly distinguish:
- What is SOP (no thinking required)
- What is for SLM (cheap, fast thinking)
- What truly goes to the LLM (expensive, rare, requiring synthesis)
Can the team calculate:
- The cost for 10,000 tasks
- The cost for 1,000,000 tasks
- What happens to gross margin if revenue grows 10×
In the architecture diagram, is it clearly marked:
- Which data will never leave the company
- Which components can switch vendors without a full rewrite

If the technical team can’t answer these three questions clearly,
you likely need to push your “go‑live” date back by a few months.

VIII. A Concrete Case:

Building an AI Investment Analyst the Wrong Way — Then Fixing It

Let’s ground this in a real system:
an AI “investment analyst” I’ve been building.

Version 1: Pure LLM, Pure Vibe Coding

In the first version (pure vibe coding):

Writing investment strategies
Running backtests
Analyzing individual Taiwan stocks

all shared the same pattern:

A lot of the data was repeated
Input tokens were often huge: 8k, 10k, even 20k

If you just dump all of this into one LLM:

A single investment analysis could take 1–3 minutes.

If you asked the AI why it took so long, it was like the model was saying:

“If you throw tens of thousands of characters at a human analyst,
do you really expect a solid report in 30 minutes?”

Not entirely wrong — but completely unacceptable for a production system.

Version 2: From Monolithic LLM to Agentic, Layered Architecture

After some architecture discussions with GPT itself, I refactored the system into a ReAct‑style agent with a hybrid design:

Built a custom LLM router
Introduced multiple models (LLM + SLM)
Created atomic tools (atomics) for the AI to compose
Defined bullet‑style system prompts to standardize patterns
Added input caching and simple numerical calculations via SLM
Let SLMs generate structured technical reports by calling tools and templates

Under this design, the large LLM has exactly one job:

Make the hard calls —
determine entry/exit points
infer the main players’ intentions
detect accumulation, shakeouts, markups, and distribution patterns

Everything else is delegated downward.

After putting the system together:

Response time dropped to under 30 seconds
Token usage became far more efficient
Anything GPT‑5‑mini / nano can do is never escalated to full GPT‑5

This resulted in huge improvements in both latency and cost.

Concrete Token Numbers: From 15,000 Characters to Under 1,000

Take this example:

Original technical data length: 7,787 characters
Actual text sent to the LLM: 983 characters
Final answer length: 2,273 characters

Here’s what the logs look like in English after refactoring:

複製
INFO:llm_investment_advisor:📨 Total message length: 6051 characters (user content: 937)
INFO:llm_investment_advisor:🤖 Selected model: gpt-5-mini
--
INFO:llm_investment_advisor:📝 Raw technical analysis length: 7787 characters
INFO:llm_investment_advisor:✅ Compressed technical report prepared (length: 983 characters)
INFO:llm_investment_advisor:✅ Report will be injected into the prompt for the LLM
INFO:llm_investment_advisor:🔀 LLM Router decision:
    {
        "question": "How has the stock been moving recently, and is it a good time to enter?",
        "question_type": "single_stock_trend",
        "model_tier": "mini",
        "model_name": "gpt-5-mini",
        "matched_keywords": ["trend", "suitable", "entry"]
    }
INFO:llm_investment_advisor:🔀 Router selected model: gpt-5-mini
INFO:llm_investment_advisor:🔫 Using DYNAMIC_CONTEXT mode...
INFO:industry_searcher:[SUCCESS] Industry for stock 2382: Computer & Peripheral Equipment
INFO:industry_searcher:Found 10 stocks in industry 'Computer & Peripheral Equipment'
INFO:llm_investment_advisor:✅ Selected 5 peer companies for comparison
INFO:llm_investment_advisor:📨 DYNAMIC_CONTEXT assembled (length: 937 characters)
INFO:llm_investment_advisor:📨 Total message length: 6051 characters (user content: 937)
INFO:llm_investment_advisor:🤖 Final answering model: gpt-5-mini
INFO:llm_investment_advisor:📥 Answer length: 2273 characters, finish_reason: stop

If we did not do any preprocessing, the system prompt would look like this:

7,787 (raw technical report)
+ 6,051 (messages)
+ 937 (user content)
≈ 15,000 characters sent to the LLM

Now imagine this system handling 10,000 Q&A per day:

Input volume alone would be around 150 million characters daily.

By enforcing division of labor:

The technical report is generated and summarized by an SLM
Only the compressed report goes into the LLM prompt
The output is also kept concise at ~2,273 characters

And remember: output tokens are often more expensive than input tokens,
so keeping answers concise is crucial.

This is what it means to turn “Agentic AI” from a slide into a working, sustainable system.

Conclusion: The Smartest Companies Don’t Max Out AI —

They Use It with Extreme Discipline

The more real systems I build, the more convinced I am:

Mature systems don’t optimize for “maximum intelligence”.
They optimize for “minimal necessary intelligence”.

If something doesn’t need thinking, write it into rules
If a small model can handle it, never send it to a large model
If a workflow can be automated, don’t “re‑reason” from scratch every time

SLMs are not backups for LLMs.
They are the profit engine that determines whether your AI system can survive as a business.

A pure LLM architecture is an expensive, fragile toy.
A hybrid LLM + SLM + SOP architecture is a machine you can confidently
integrate into your financials and build a moat around.

License

This work is created by 大力士的AI天地創造
and released under the Creative Commons Attribution–NonCommercial–NoDerivatives 4.0 International (CC BY‑NC‑ND 4.0) license.

NVIDEA 黄仁勳 都在推的(SLM)小語言模型是什麼? 為什麼要推SLM ? (What's SLM ? Why SLM ?)