NVIDEA 黄仁勳 都在推的(SLM)小語言模型是什麼? 為什麼要推SLM ? (What's SLM ? Why SLM ?)

Agentic AI 的經營選擇

為什麼「純 LLM 架構」是商業上的慢性自殺?

——從成本曲線、決策權力到企業護城河的完整拆解




作者:大力士


前言:Demo 很香,帳單很毒

劇本:

  • 團隊用 GPT-5 做了一個超炫的 Demo

  • 老闆、董事會看完,拍手:「就照這個做成產品吧!」

  • 一年後,專案悄悄收掉,沒人提起

問題從來不是「AI 不夠聰明」,而是:

把 Demo 的玩法,直接套進商業規模,是一種財務自殺。

尤其是那種「所有事情都丟給一個超大 LLM」的純 LLM 架構——
從經營角度來看,這幾乎註定會死。


一、兩種完全不同的公司:用 AI 的方式 = 管公司的方式

我們先把模型名字全部忘掉,只看「組織」。

公司 A:純 LLM 架構 —— 顧問接電話公司

這家公司有一個超強、超貴的顧問(LLM):

  • 客戶問:「訂單出貨了沒?」 → 找顧問

  • 員工問:「這筆退款要不要同意?」 → 找顧問

  • 律師問:「這段條款怎麼改?」 → 還是找顧問

顧問非常聰明,什麼都會,但有幾個問題:

  1. 每次問他都很貴(token)

  2. 每個問題都要想一下(延遲)

  3. 所有決策權集中在一個點(rate limit = 公司塞車)

你可以把純 LLM 架構翻譯成一句話:

「我們決定用最貴的大腦,做所有瑣碎的事。」


公司 B:LLM + SLM + SOP —— 正常人類公司

這家公司也是用 AI,但組織長這樣:

  • LLM = CEO / 顧問

    • 處理少數高風險、高不確定性問題

    • 決策前要整合多方資訊、平衡風險

  • SLM = 經理 / 專員

    • 做 80% 的日常工作:查詢、分類、意圖判斷、格式轉換

    • 可以本地部署、成本固定、毫秒級回應

  • SOP / Rule-based = 作業標準

    • 例如金額上限、異常偵測、合規檢查

    • 不需要「思考」,只要「照做」

這家公司做了兩件看起來很無聊、但極度關鍵的事:

  1. 把「誰來決定什麼」設計清楚

  2. 把「什麼事情根本不該思考」寫死在系統裡

從技術名詞換回經營語言:

公司 A 在玩「中央集權式 AI」。
公司 B 在打造「授權式、分層式 AI 組織」。


二、為什麼純 LLM 架構常常是「Demo 成功、上線暴斃」?

來看三種企業裡最常看到的死亡方式。

1. 帳單型死亡:Demo 版毛利率很好,正式版直接賠錢

PoC 階段:

  • 一天 1,000 次請求

  • 開會時大家只看:「回答好厲害」、「語氣很自然」

上線後:

  • 一天 1,000,000 次請求

  • CFO 看的是:「這個月 AI 帳單多了一台工廠的成本」

純 LLM 架構有一個致命特性:

業務量成長是線性的,但 token 成本常常是「疊加的」。

  • 為了提升準確率,prompt 變長

  • 為了個人化,要塞更多 context

  • 為了多輪推理,要多次呼叫模型

最後變成:

「客戶數越多,毛利率越差。」

這在商業上是完全站不住腳的。


2. 黑盒型死亡:所有邏輯塞在 prompt 裡,誰都不敢改

很多純 LLM 專案,長這樣:

  • 建一個超長的 system prompt

  • 把規則、流程、口吻全部寫在裡面

  • 運氣好時,真的可以跑起來

但:

  • 想改一個流程,要改 prompt

  • 想試 A/B test,要複製 prompt

  • 一不小心改錯一行,整個系統行為變掉

技術上這叫高耦合,組織上則叫:

「我們有一個誰都不敢動的 AI 黑盒。」

黑盒只適合 Demo,不適合長期經營。


3. 組織型死亡:AI 團隊變成所有人的「服務櫃檯」

當所有能力都集中在「那個大模型」時,組織會變成這樣:

  • 每個部門的需求都要排隊給 AI 團隊

  • 每個流程調整都要「技術支援」

  • 產品部、營運部完全失去自主調整能力

你很快就會得到一個奇妙的結論:

「AI 是來幫忙的,但公司運作因為 AI,反而更慢。」

這跟模型多聰明完全無關,是架構與權責分配的問題。


三、財務與 TCO:你不是在選模型,是在選成本曲線

我們把術語暫時收掉,只看一件事:這條成本線以後會長成什麼形狀

純 LLM 架構的現實

特徵基本上是:

  • 高固定單價 + 高複雜度成本

  • 使用越多 context,越貴

  • 做越多多輪推理,越貴

如果你不切分任務、不做分層,你會看到這畫面:

「今年營收成長 3 倍,結果 AI 成本成長 10 倍。」



混合架構的財務邏輯

把任務切開:

  • SLM 負責:

    • 意圖判斷

    • 簡單問答 / 查詢

    • 格式轉換、資料清洗
      → 可以跑在自家機房、邊緣設備、或便宜算力上

  • LLM 負責:

    • 例外狀況

    • 高價值對話

    • 多來源整合、策略決策

這樣做的結果是:

每多一個用戶,邊際成本接近 SLM 的成本,而不是 LLM 的成本。

這就是為什麼要一再強調:

這不是「模型組合問題」,而是「毛利率問題」。


四、AI 架構 = 公司權力結構:你在決定誰有決策權

這一段,是多數技術文件完全沒談,但對 CEO 反而是最重要的。

純 LLM 架構:所有事情都送去中央

  • 每個決策都送進同一個大腦

  • 每個流程、每個部門都要等它回覆

  • 任何延遲、任何異常都是「全域性」的

這跟一些公司很像:

「什麼都要董事長簽。」

你知道那種公司最後長什麼樣:

  • 速度慢

  • 責任模糊

  • 沒人敢決策


混合架構:授權、分層、局部自治

在混合架構裡,權力大致長這樣:

  • SOP / SLM:

    • 有權對 80% 的標準情境直接做決定

    • 不用報告、不用 escalate

  • LLM:

    • 只在「不確定」、「高風險」、「需要綜合判斷」時被召喚

    • 反而能把精力放在真正重要的地方

你可以把這一句話寫在架構圖旁邊:

「我們不是在選 AI 架構,我們在設計公司怎麼做決定。」

這才是 Agentic AI 真正的戰略意義。


五、時間與風險:企業最怕的是「三年後算不出來」

很多人做 AI 架構時,只看「現在的單價」和「現在的算力」,
但企業真正關心的是:這個東西三年後還玩得下去嗎?

純 LLM 架構的時間風險

未來幾年,有幾件事是你控制不了的:

  • 模型單價怎麼變 → 供應商決定

  • API 政策怎麼改 → 供應商決定

  • 敏感資料能不能出境 → 法規決定

如果你所有能力全綁在「雲端大模型 + 封閉 API」上,
那未來幾乎等同於:

「別人的 Roadmap 就是你的命運。」


混合架構的時間優勢

你做的事情其實只有一件:

把不可控的東西,擋在邊界之外。

  • 把敏感資料留在自己系統

  • 把穩定、高頻任務交給 SLM

  • 把可替換的部分,用清楚的介面包起來

這讓你三年後還有選項:

  • 換模型

  • 換供應商

  • 做私有化

而不需要把系統整個拆掉重寫。


六、護城河:最貴的不是模型,是「你自己的決策軌跡」

大家都在追:

  • 誰先用上最新的 GPT / Gemini / Claude

  • 誰的模型分數高幾個百分点

但從商業角度來看,這些差距幾乎不構成長期護城河,因為:

  • 競爭對手隔天也可以買同一個模型

  • Prompt 範本很快就被複製、學走

真正拉開距離的,是這些東西:

  • 你手上的資料長什麼樣

  • 你怎麼標記、怎麼拆解任務

  • 你怎麼把人類決策,轉成可以學習的軌跡

而這些,都發生在:

  • SLM 層

  • Workflow 層

  • Rule / Event / Log 層

換句話說:

護城河不在「你用了哪一家 LLM」,而在「你怎麼把公司變成一台可以持續學習的機器」。


七、給經營者的實戰檢查表:這個 AI 專案值不值得做下去?

如果你是 CEO / 董事 / BU Head,可以只看這幾個問題:

  1. 這個系統裡,有沒有明確區分:

    • 什麼是 SOP(不用思考)

    • 什麼是 SLM(便宜快速思考)

    • 什麼才丟給 LLM(昂貴、少量、需要綜合判斷)

  2. 團隊能不能算出來:

    • 一萬次任務的成本

    • 一百萬次任務的成本

    • 如果營收成長 10 倍,毛利率會怎麼變

  3. 架構圖裡,有沒有清楚標示:

    • 哪些資料永遠不會離開公司

    • 哪些東西可以換供應商不重寫

如果這三題問下去,技術團隊講不清楚,那你大概可以先把「上線時間」往後移幾個月。


結語:最聰明的公司,不是把 AI 用到滿,而是用得很克制

我越做越確定一件事:

成熟的系統,追求的不是「智慧最大化」,而是「智慧使用的最小化」。

  • 能不要想的,就寫死在規則裡

  • 能讓小模型處理的,就不要丟給大模型

  • 能讓流程自動跑的,就不要每次重新「推理」一次

SLM 不是 LLM 的備胎,
它是整個商業系統能不能活下去的 獲利引擎

純 LLM 架構,是昂貴且脆弱的玩具。
LLM + SLM + SOP 的混合架構,才是可以放進財報裡的商業機器。

在我目前正在開發的 AI 分析師 裏頭, 第一版時(Pure Vibe Coding), 不管是寫投資策略, 做回測分析, 或是做台股分析, 你會發現有許多資料是重覆, 而且 input token常常是是巨量的,8k, 10k,20k ,如果把這些都給LLM, 一個回答常常要花1-3 分鐘才能完成一份投資分析,

你問 AI 為什麼要怎麼久, AI 回你說, 你把這幾萬字丟給分析師看看, 看他30分能不能寫出來投資報告.. 講的似乎也有點道理.. 

不過, 當然是不能這樣做, 後來又跟GPT 討論這個架構, 開始改架構,改成 react Agent, 自建router,運用不同 AI(LLM+SLM), 給AI 建 atomics ,讓AI 有不同工具可以用, 建bullet system prompt, 把prompt 定型化, 這樣做input cache ,簡單的數據計算, 讓SLM去引用工具套用模板來用就好

LLM 唯一要做的事, 就是去判斷更難的進出點, 主力動向, 有沒有洗盤, 拉價, 出貨這些動作. 

一套弄下來, 回答的時間總算降到 30秒內, token 的使用也更有效率. gpt 5 nano 可以做的, 不需要去gpt 5 去做. 對時間, 成本都可以得到很大的優化.


可以看下面的例子,
資料原始長度 7787 字元
實際丟給LLM 長度: 983 字元。 output 2273 個字元

INFO:llm_investment_advisor:📨 Messages 總長度: 6051 字元(user 部分: 937)
INFO:llm_investment_advisor:🤖 使用模型: gpt-5-mini
--
INFO:llm_investment_advisor:📝 資料摘要長度: 7787 字元
INFO:llm_investment_advisor:✅ 技術分析報告已準備就緒(長度: 983 字元)
INFO:llm_investment_advisor:✅ 報告將包含在 prompt 中供 LLM 使用
INFO:llm_investment_advisor:🔀 LLM Router 決策: {'question': '最近的走勢如何,適合進場嗎?', 'question_type': '個股趨勢', 'model_tier': 'mini', 'model_name': 'gpt-5-mini', 'matched_keywords': ['走勢', '適合', '進場']}
INFO:llm_investment_advisor:🔀 Router 選擇模型: gpt-5-mini
INFO:llm_investment_advisor:🔫 準備使用 DYNAMIC_CONTEXT(動態子彈)模式...
INFO:industry_searcher:[SUCCESS] 股票 2382 的產業: 電腦及週邊設備業
INFO:industry_searcher:搜尋產業 '電腦及週邊設備業' 找到 10 檔股票
INFO:llm_investment_advisor:✅ 取得 5 家同業公司
INFO:llm_investment_advisor:📨 DYNAMIC_CONTEXT 準備完成(長度: 937 字元)
INFO:llm_investment_advisor:📨 Messages 總長度: 6051 字元(user 部分: 937)
INFO:llm_investment_advisor:🤖 使用模型: gpt-5-mini
INFO:llm_investment_advisor:📥 回答長度: 2273 字元,完成原因: stop


如果都不做預處理,這份system prompt 就會變成 7787+6501+937 ,15000多個字元..

如果這個系統一天有10000個問答. 光 input token 就會要1.5 億字元了..

而透過分工, 技術報告交給SLM 去處理並匯整,input 變少了, 而且output 也只剩少少的 2273 個字元( output token 貴很多! 更要注意)

========

本著作由大力士的AI天地創造 製作,以創用CC 姓名標示–非商業性– 禁止改作 4.0 國際授權條款釋出

Strategic Choices for Agentic AI

Why a “Pure LLM Architecture” Is Slow-Motion Commercial Suicide

— A Full Breakdown from Cost Curves and Decision Rights to Moats

Author: Dali-Shi


Preface: The Demo Is Gorgeous, the Bill Is Deadly

The story usually goes like this:

The team builds a jaw‑dropping demo on GPT‑5.
The CEO and the board see it and applaud: “Perfect, let’s turn this into a product!”

One year later, the project is quietly shut down. Nobody mentions it again.

The problem was never that “the AI isn’t smart enough.”
The real problem is:

Taking the way you build demos and deploying it at business scale
is a form of financial self‑destruction.

This is especially true for the pattern where everything is sent to one giant LLM
the so‑called pure LLM architecture.

From a business and operating standpoint, this architecture is almost guaranteed to fail.


I. Two Very Different Companies:

How You Use AI = How You Run the Company

Let’s temporarily forget model names and just look at organizational design.

Company A: Pure LLM Architecture —

“The Consultant Answers Every Call”

This company relies on one super‑smart, super‑expensive consultant (the LLM):

  • Customer asks: “Has my order shipped yet?” → Ask the consultant
  • Employee asks: “Should we approve this refund?” → Ask the consultant
  • Lawyer asks: “How should we revise this clause?” → Still ask the consultant

The consultant is brilliant and knows almost everything. But there are problems:

  • Every time you ask, it’s expensive (tokens)
  • Every question incurs latency
  • All decision‑making is bottlenecked at a single point (rate limits = organizational traffic jams)

You can summarize a pure LLM architecture in one sentence:

“We’ve decided to use the most expensive brain to handle all the trivial work.”


Company B: LLM + SLM + SOP —

A Normal Human Organization

This company also uses AI, but the “org chart” looks like this:

  • LLM = CEO / Senior Advisor

    • Handles a small number of high‑risk, highly uncertain decisions
    • Needs to aggregate diverse information and balance risk before deciding
  • SLM (Small Language Model) = Managers / Specialists

    • Handles 80% of routine work:
      • Queries and lookups
      • Classification and intent detection
      • Format transformation, data cleaning
    • Can run on‑premise, on the edge, or on cheap compute
    • Cost is stable, latency is in milliseconds
  • SOP / Rule‑based Logic = Standard Operations

    • Spending limits, anomaly detection, compliance checks
    • No “thinking” required — just “do exactly this”

This company does two seemingly boring but critical things:

  1. Designs clear decision boundaries — who decides what.
  2. Defines which tasks should never require thinking — they’re hard‑coded into the system.

Translated back into management language:

  • Company A is playing with centralized, command‑style AI.
  • Company B is building a delegated, layered AI organization.

II. Why Pure LLM Architectures Often Go

“Demo Success → Production Failure”

Here are three of the most common ways these systems die inside enterprises.

1. Cost-Curve Death:

Great Margins in the Demo, Bleeding Cash in Production

During the PoC:

  • ~1,000 requests per day
  • In meetings, everyone only looks at:
    • “The answers are amazing”
    • “The tone feels so natural”

After going live:

  • ~1,000,000 requests per day
  • The CFO looks at something else entirely:
    • “Why did our AI bill this month equal the cost of running an entire factory?”

Pure LLM architectures have a fatal financial property:

Traffic scales linearly, but token cost often scales super‑linearly.

Because:

  • To improve accuracy, prompts get longer
  • To personalize, you stuff in more context
  • To do multi‑step reasoning, you call the model multiple times

You eventually end up here:

“The more customers we have, the worse our gross margin gets.”

Commercially, this simply doesn’t hold.


2. Black-Box Death:

All Logic Buried in Prompts, Nobody Dares to Touch It

Many pure‑LLM projects look like this:

  • One enormous system prompt
  • All rules, workflows, and tone of voice baked into that prompt
  • With some luck, it runs surprisingly well in the beginning

But:

  • Want to change a workflow? → You edit the prompt
  • Want to A/B test? → You clone the prompt
  • Change the wrong line? → System behavior changes unpredictably

Technically, this is high coupling.
Organizationally, it becomes:

“We have a magic AI box that nobody dares to touch.”

Black boxes are great for demos.
They’re terrible as long‑term operational infrastructure.


3. Organizational Death:

The AI Team Turns into Everyone’s “Service Counter”

When all capabilities are concentrated in “that one big model”, the org drifts into this pattern:

  • Every department lines up in front of the AI team for changes
  • Every process adjustment requires “technical support”
  • Product and operations teams lose autonomy over their own workflows

Soon you reach a bizarre conclusion:

“AI was supposed to help us move faster,
but because of AI, the company is actually moving slower.”

This has nothing to do with model intelligence.
It’s a problem of architecture and decision‑rights design.


III. Finance & TCO:

You’re Not Choosing a Model, You’re Choosing a Cost Curve

Forget jargon for a moment and ask a single question:

What will this cost curve look like as the business grows?

The Reality of Pure LLM Architectures

Typical characteristics:

  • High unit cost per call
  • High complexity per task

And also:

  • More context → more expensive
  • More multi‑step reasoning → more expensive

If you don’t split tasks and add layers, you’ll see:

“Revenue grew 3× this year,
but AI cost grew 10×.”

This is not a technology issue.
It’s a gross margin issue.


The Financial Logic of Hybrid Architectures

You split the work:

  • SLM handles:

    • Intent detection
    • Simple Q&A and data lookup
    • Formatting, normalization, data cleaning
      → Runs on your own servers, on the edge, or on cheap compute
  • LLM handles:

    • Exceptions
    • High‑value conversations
    • Multi‑source integration and strategic judgment

The result:

Each new user pushes your marginal cost closer to the SLM cost,
not the LLM cost.

This is why we keep emphasizing:

This is not merely a “model selection problem”.
It is a unit economics and gross margin problem.


IV. AI Architecture = Power Structure:

You’re Deciding Who Gets to Decide

This is the part most technical documents gloss over,
but it’s exactly what CEOs care most about.

Pure LLM Architecture:

Every Decision Goes to Central Command

  • Every decision flows into the same giant brain
  • Every workflow and department waits for its response
  • Any latency or incident is global and impacts everyone

This is like a company where:

“The chairman personally signs off on everything.”

We already know what these companies look like in the end:

  • Slow
  • Blurry accountability
  • Nobody dares to make decisions

Hybrid Architecture:

Delegation, Layers, and Local Autonomy

In a hybrid design, the power structure looks more like this:

  • SOP / SLM:

    • Authorized to handle 80% of standard, predictable scenarios
    • No need to escalate in routine cases
  • LLM:

    • Only invoked for “uncertain”, “high‑risk”, or “requires synthesis” problems
    • Its capacity is reserved for what truly matters

You can write this sentence directly on your architecture diagram:

“We’re not just choosing an AI architecture.
We’re designing how this company makes decisions.”

That is the true strategic meaning of Agentic AI.


V. Time & Risk:

What Enterprises Fear Most Is “Not Being Able to Do the Math Three Years Out”

When people design AI architectures, they usually look at today’s:

  • Token prices
  • Compute availability

But enterprises really care about:

“Will this still be viable three years from now?”

Time Risk of Pure LLM Architectures

Over the next few years, several critical factors are outside your control:

  • Model prices → determined by vendors
  • API policies → determined by vendors
  • Data residency / cross‑border flows → determined by regulators

If all your capabilities are tied to a
cloud LLM + closed API, then your future is:

“Someone else’s roadmap is your destiny.”


Time Advantage of Hybrid Architectures

What you’re actually doing is:

Pushing the uncontrollable parts out to the boundaries.

Concretely:

  • Keep sensitive data inside your own systems
  • Let SLMs handle high‑frequency, stable workloads
  • Wrap all replaceable components behind clear, well‑defined interfaces

This gives you options three years from now:

  • Switch models
  • Change vendors
  • Move to private or hybrid deployments

All without ripping the entire system apart and rebuilding from scratch.


VI. Moats: The Most Valuable Asset Isn’t the Model —

It’s Your Decision Trace

Everyone is chasing:

  • “Who uses the latest GPT / Gemini / Claude first?”
  • “Whose benchmark scores are a few points higher?”

From a business standpoint, these differences rarely become durable moats, because:

  • Competitors can buy the same models tomorrow
  • Prompt templates are easily copied and improved upon

What truly creates separation over time is:

  • What your data actually looks like
  • How you label data and decompose tasks
  • How you turn human decisions into machine‑learnable decision traces

And these all live in:

  • The SLM layer
  • The workflow layer
  • The rule / event / log layer

In other words:

Your moat is not “which LLM you use”.
Your moat is how you turn your company into a machine that continuously learns.


VII. A Practical Checklist for Executives:

Is This AI Project Worth Continuing?

If you’re a CEO, board member, or BU head, you can focus on just three questions:

  1. Does the system clearly distinguish:

    • What is SOP (no thinking required)
    • What is for SLM (cheap, fast thinking)
    • What truly goes to the LLM (expensive, rare, requiring synthesis)
  2. Can the team calculate:

    • The cost for 10,000 tasks
    • The cost for 1,000,000 tasks
    • What happens to gross margin if revenue grows 10×
  3. In the architecture diagram, is it clearly marked:

    • Which data will never leave the company
    • Which components can switch vendors without a full rewrite

If the technical team can’t answer these three questions clearly,
you likely need to push your “go‑live” date back by a few months.


VIII. A Concrete Case:

Building an AI Investment Analyst the Wrong Way — Then Fixing It

Let’s ground this in a real system:
an AI “investment analyst” I’ve been building.

Version 1: Pure LLM, Pure Vibe Coding

In the first version (pure vibe coding):

  • Writing investment strategies
  • Running backtests
  • Analyzing individual Taiwan stocks

all shared the same pattern:

  • A lot of the data was repeated
  • Input tokens were often huge: 8k, 10k, even 20k

If you just dump all of this into one LLM:

  • A single investment analysis could take 1–3 minutes.

If you asked the AI why it took so long, it was like the model was saying:

“If you throw tens of thousands of characters at a human analyst,
do you really expect a solid report in 30 minutes?”

Not entirely wrong — but completely unacceptable for a production system.


Version 2: From Monolithic LLM to Agentic, Layered Architecture

After some architecture discussions with GPT itself, I refactored the system into a ReAct‑style agent with a hybrid design:

  • Built a custom LLM router
  • Introduced multiple models (LLM + SLM)
  • Created atomic tools (atomics) for the AI to compose
  • Defined bullet‑style system prompts to standardize patterns
  • Added input caching and simple numerical calculations via SLM
  • Let SLMs generate structured technical reports by calling tools and templates

Under this design, the large LLM has exactly one job:

Make the hard calls —

  • determine entry/exit points
  • infer the main players’ intentions
  • detect accumulation, shakeouts, markups, and distribution patterns

Everything else is delegated downward.

After putting the system together:

  • Response time dropped to under 30 seconds
  • Token usage became far more efficient
  • Anything GPT‑5‑mini / nano can do is never escalated to full GPT‑5

This resulted in huge improvements in both latency and cost.


Concrete Token Numbers: From 15,000 Characters to Under 1,000

Take this example:

  • Original technical data length: 7,787 characters
  • Actual text sent to the LLM: 983 characters
  • Final answer length: 2,273 characters

Here’s what the logs look like in English after refactoring:

複製
INFO:llm_investment_advisor:📨 Total message length: 6051 characters (user content: 937) INFO:llm_investment_advisor:🤖 Selected model: gpt-5-mini -- INFO:llm_investment_advisor:📝 Raw technical analysis length: 7787 characters INFO:llm_investment_advisor:✅ Compressed technical report prepared (length: 983 characters) INFO:llm_investment_advisor:✅ Report will be injected into the prompt for the LLM INFO:llm_investment_advisor:🔀 LLM Router decision: { "question": "How has the stock been moving recently, and is it a good time to enter?", "question_type": "single_stock_trend", "model_tier": "mini", "model_name": "gpt-5-mini", "matched_keywords": ["trend", "suitable", "entry"] } INFO:llm_investment_advisor:🔀 Router selected model: gpt-5-mini INFO:llm_investment_advisor:🔫 Using DYNAMIC_CONTEXT mode... INFO:industry_searcher:[SUCCESS] Industry for stock 2382: Computer & Peripheral Equipment INFO:industry_searcher:Found 10 stocks in industry 'Computer & Peripheral Equipment' INFO:llm_investment_advisor:✅ Selected 5 peer companies for comparison INFO:llm_investment_advisor:📨 DYNAMIC_CONTEXT assembled (length: 937 characters) INFO:llm_investment_advisor:📨 Total message length: 6051 characters (user content: 937) INFO:llm_investment_advisor:🤖 Final answering model: gpt-5-mini INFO:llm_investment_advisor:📥 Answer length: 2273 characters, finish_reason: stop

If we did not do any preprocessing, the system prompt would look like this:

  • 7,787 (raw technical report)
  • + 6,051 (messages)
  • + 937 (user content)
  • ≈ 15,000 characters sent to the LLM

Now imagine this system handling 10,000 Q&A per day:

  • Input volume alone would be around 150 million characters daily.

By enforcing division of labor:

  • The technical report is generated and summarized by an SLM
  • Only the compressed report goes into the LLM prompt
  • The output is also kept concise at ~2,273 characters

And remember: output tokens are often more expensive than input tokens,
so keeping answers concise is crucial.

This is what it means to turn “Agentic AI” from a slide into a working, sustainable system.


Conclusion: The Smartest Companies Don’t Max Out AI —

They Use It with Extreme Discipline

The more real systems I build, the more convinced I am:

Mature systems don’t optimize for “maximum intelligence”.
They optimize for “minimal necessary intelligence”.

  • If something doesn’t need thinking, write it into rules
  • If a small model can handle it, never send it to a large model
  • If a workflow can be automated, don’t “re‑reason” from scratch every time

SLMs are not backups for LLMs.
They are the profit engine that determines whether your AI system can survive as a business.

A pure LLM architecture is an expensive, fragile toy.
A hybrid LLM + SLM + SOP architecture is a machine you can confidently
integrate into your financials and build a moat around.


License

This work is created by 大力士的AI天地創造
and released under the Creative Commons Attribution–NonCommercial–NoDerivatives 4.0 International (CC BY‑NC‑ND 4.0) license.



留言

這個網誌中的熱門文章

不要再學 Prompt : 第 1 篇:新手完全不懂 Prompt,也能讓 AI 幫你生出專業 Prompt(超簡單)

蜀漢多代理智能架構 *AI 不是一個人工作,而是一個國家在運作。*

不要再學 Prompt: 第 2 篇:LLM 如何把人的意圖翻譯成高品質 Prompt?