從邊緣設備到 Data Lakehouse：一行 .to_dicts() 教會我的事

Summary: 在 Windows 桌面應用的 Edge 端學會 Log 粒度控制和 I/O 紀律之後，到了雲端卻用 .to_dicts() 把 Arrow 的 contiguous memory 炸成 770 萬個 Python 物件——Memory 膨脹 30x、GC Stop-the-World。這篇從 Webpack Hack 出發，經過 DeviceSync 在隱私權限制下的 Edge Logging 設計、Python Serialization Overhead 的底層解析、53.2x 的 Vectorization 實戰，一路走到 Apache Iceberg 的 Partition Evolution 和 Trino 的四層 Pruning。一個貫穿全程的原則：不要讓資料做不必要的旅行。
Disclaimer: This is a personal learning project. All data is synthetic, generated programmatically. No real patient data is involved.

一、從 Webpack Hack 說起

在 Next.js 裡寫 Server Actions 的日子，我花了不少時間在 Webpack Hack 上。為了讓 ORM 在 server component 裡正常運作，你得在 next.config.js 裡加一堆 externals，手動把 Node.js 模組排除在 client bundle 之外。getServerSideProps 勉強能用，但你清楚這不是 "設計"——這是在框架的縫隙裡求生存。

真正讓我放棄的不是某個 bug，而是一個認知：前端框架不該負責 Data Boundary 的定義。

API 回傳的 JSON 隨著 UI 需求膨脹、欄位命名在前後端之間不一致、沒有人知道哪個欄位是 source of truth。這些問題的根因是 Data Contract 缺失。

Django REST Framework 用 Serializer 強迫你定義每個欄位的型別、驗證規則和巢狀關係。FastAPI 用 Pydantic 做同樣的事，但更快——因為 Pydantic v2 的核心是 Rust 寫的。

選擇 Python 不是因為語言本身快。而是因為 Python 生態系連接了兩個世界：一端是 Pydantic 定義的 Data Contract，另一端是 Polars、Apache Arrow 和 Apache Iceberg 構成的 Data Pipeline。這條路線在其他語言生態系中沒有對等選項。

二、DeviceSync 的教訓：當 Log 不能離開使用者的電腦

DeviceSync 是一套 Windows 桌面應用程式的 Log 系統。它面對的第一個現實不是效能——而是 隱私權。

Log 留在 Edge 端：不是選擇，是限制

這是一個跑在使用者電腦上的 Windows 應用程式。Log 輸出在 %AppData%\Roaming 的應用程式資料夾裡——不是資料庫，不是雲端，就是本地的文字檔。

為什麼不送回伺服器？隱私權條款不允許。 使用者端的應用程式不能主動把操作記錄傳回 server。頂多能在本機建一個 SQLite，但那對客服回饋毫無幫助——客服看不到使用者電腦裡的東西。

這意味著你在設計 Log 系統時，面對的是一個根本性的矛盾：你需要足夠的 Log 來診斷問題，但你沒辦法拿到這些 Log。

三層設計：在限制中求生存

既然 Log 只能留在使用者的電腦裡，每一筆 Log 都是有成本的——它佔磁碟空間、它消耗 I/O、它可能拖慢主程式。所以設計策略不是「盡量多記」，而是「精確控制記多少」。

Registry-based Log Level Control（登錄檔控制 Log 深度）：透過 Windows Registry 設定 Log 的顆粒度（Verbose / Info / Warning / Error）。這不只是技術決策——它是一個 trade-off：

顆粒度太細（Verbose）：每個操作都記，Log 檔爆炸，I/O 頻繁，主程式被拖慢
顆粒度太粗（Error only）：平時沒影響，但出問題時客服叫使用者寄 Log，裡面什麼都沒有
平衡點：預設 Info 等級，客服需要時透過遠端指引使用者調成 Verbose，復現問題後再調回來

Size-based Rotation（按大小壓縮刪除）：Log 檔超過閾值就自動壓縮歸檔，超過保留數量就刪除最舊的。使用者的硬碟空間是別人的財產，你不能無限制佔用。
Time-based Cleanup（按時間清理）：超過 N 天的 Log 自動清理。即使使用者從不打開應用程式，Log 資料夾也不會無限膨脹。

這三個設計——Log Level Control、Size Rotation、Time Cleanup——聽起來很基本。但它們背後的精神是：每一筆寫入都有成本，在資源受限的環境中，你必須精確控制 I/O 的「量」和「粒度」。

"每一筆 Log 都要問：這條記錄真的值得寫嗎？寫了之後誰能看到？" 在 Edge 端，這種潔癖是生存條件。

從 Edge 到雲端：紀律的延續與遺忘

這段經歷教會我一件事：在資源受限的環境中做工程，你會對「不必要的 I/O」極度敏感。

但奇怪的是，同一批工程師到了雲端，就忘了這個紀律。

"反正機器很多，跑慢了加 instance。" 聽起來合理，但當你的 Data Pipeline 處理百億級 CGM 血糖資料時，一個不必要的 Serialization Overhead 就能讓 Pipeline 從 52 秒膨脹到 46 分鐘。這不是加 instance 能解決的——這是架構層級的浪費。

Resource Isolation 在雲端的意思和 Edge 端一樣：控制資料的流向和粒度。 在 Edge 端，是用 Registry 控制 Log Level；在雲端，是把運算留在 Rust Engine 裡，別讓資料逃逸到 Python Heap。

底層原則從未改變：不要讓資料做不必要的旅行。

三、Python 物件化的效能陷阱

這是整篇文章最重要的段落。如果你只記一件事，記這個：不要讓資料離開 Arrow 的 Memory Format。

Arrow vs Python：Memory Layout 的戰爭

Apache Arrow 採用 Columnar Memory Format。同一欄位的所有值存放在一段 contiguous memory 中：

Arrow (glucose_mg_dl column, 966,799 rows):
┌──────────────────────────────────────────────────┐
│ 120.5 | 118.2 | 135.7 | 122.1 | ... | 119.8    │  ← contiguous 8 bytes × N
└──────────────────────────────────────────────────┘
Total: ~7.5 MB (8 bytes × 966,799)

Contiguous Memory：CPU Cache Line 一次載入 64 bytes = 8 個 float64 值
Zero-copy：Polars 和 Arrow 共享同一塊 Memory，不需要轉換
SIMD Friendly：連續排列讓 AVX2 一次處理 4 個 float64

現在看 .to_dicts() 做了什麼：

Python (after .to_dicts()):
Row 0: {"reading_id": PyUnicode(72B), "user_id": PyUnicode(64B),
        "glucose_mg_dl": PyFloat(24B), "timestamp": PyDateTime(56B), ...}
Row 1: {"reading_id": PyUnicode(72B), ...}
...
× 966,799 rows × 8 columns = 7,734,392 Python objects

每個 Python 物件的成本：

Type	Arrow Size	Python Object Size	Bloat
int64	8 bytes	28 bytes (PyLongObject)	3.5x
float64	8 bytes	24 bytes (PyFloatObject)	3x
string	variable (dict-encoded)	49+ bytes (PyUnicodeObject)	5-10x
datetime	8 bytes	56 bytes (PyDateTime)	7x
dict (per row)	N/A	232 bytes (8-key PyDictObject)	--

保守估算：966,799 rows × 8 columns → 770 萬個 Python 物件 → 約 200 MB Heap。Arrow 格式下同樣的資料只有 6.5 MB。

Memory Bloat：30x。

Garbage Collection：看不見的殺手

CPython 的 Memory Management 有兩層：Reference Counting 加上 Generational GC。

770 萬個物件意味著：

770 萬次 Py_INCREF / Py_DECREF 操作
Gen-0 GC 的 threshold 預設是 700 個物件——你的 .to_dicts() 一次製造了 700 的 11,000 倍
GC 必須反覆掃描所有物件來偵測 Circular Reference
每次 GC Scan 期間，所有 Python Thread 暫停（Stop-the-World）

.to_dicts() 不只是 "比較慢"——它會把你的 Python Process 變成一台 GC Machine。

for-loop：逐 Row 的致命一擊

即使忍受了 .to_dicts() 的 Memory Bloat，接下來的 for i in range(len(rows)) 更致命：

CPython 的 Bytecode Interpreter 每個 iteration 約 100ns
96 萬次 iteration = 約 0.1 秒的純迴圈成本（什麼都不做）
每次 rows[i]["timestamp"] 是一次 dict hash lookup（~50ns）
每次 if gap <= 120 是 Python 層級的比較（~30ns），而非 SIMD
全程 GIL 鎖定，Single Thread 運行

三層浪費疊加：Serialization（Arrow → Python）+ Memory Bloat（30x）+ Interpreted Execution（no SIMD, no parallel）。

四、實戰：53.2x 的重生

在 Glucose-AI-Lakehouse-POC 專案中，有一個 CGM 血糖補值的 Pipeline。500 位使用者、966,799 筆原始讀數、需要按 FDA/ATTD 規則在 gap 處插入 imputed rows。

v1：Python for-loop（13.887s）

user_ids = df["user_id"].unique().to_list()    # 500 users
for user_id in user_ids:                       # 500× full table scan
    user_df = df.filter(pl.col("user_id") == user_id)
    rows = user_df.to_dicts()                  # Arrow → 7.7M Python objects
    for i in range(1, len(rows)):              # per-row interpreted loop
        gap = (rows[i]["timestamp"] - rows[i-1]["timestamp"])
        if gap <= 120:
            imputed_rows.append({...})         # Python dict allocation

Profiling 拆解：

Step	Time	Share
Read Parquet	0.038s	0.3%
Detect Gaps (vectorized)	0.135s	0.9%
Imputation Loop	13.887s	95.2%
Write Parquet	0.293s	2.0%

95.2% 花在 Python 物件化 + 逐行處理。 I/O 只佔 2.3%。純粹的 CPU Bound。

v2：Polars Vectorization（0.261s）

三個 Polars Expression 取代整個 for-loop：

# 1. shift().over() — Rust Engine 內取前一筆，Zero Serialization
df = df.with_columns([
    pl.col("timestamp").shift(1).over("user_id").alias("_prev_ts"),
    pl.col("glucose_mg_dl").shift(1).over("user_id").alias("_prev_glucose"),
])

# 2. int_ranges().explode() — Gap fan-out in Rust，Zero Python Object
imputed = gaps.with_columns(
    pl.int_ranges(1, pl.col("_num_missing") + 1).alias("_offset")
).explode("_offset")

# 3. when().then().otherwise() — Vectorized Conditional，SIMD Friendly
imputed = imputed.with_columns(
    pl.when(pl.col("gap_duration_minutes") <= 30)
      .then(linear_interpolation_expr)
    .when(pl.col("gap_duration_minutes") <= 120)
      .then(pl.col("_prev_glucose"))      # LOCF
    .otherwise(pl.lit(None))               # NULL for >2hr gaps
    .alias("glucose_mg_dl")
)

結果

Metric	v1 (for-loop)	v2 (Vectorized)	Improvement
Imputation Core	13.887s	0.261s	53.2x
Quality Layer	14.585s	0.694s	21.0x
Full Pipeline	14.943s	~1.285s	11.6x
Total Rows	1,079,888	1,079,888	Exact Match
LOCF Count	85,889	85,889	Exact Match
NULL Count	27,200	27,200	Exact Match

6 項 Validation Metrics 完全一致。不是近似，是 Exact Match。

53.2x 不只是 "少了 Serialization"。shift().over() 底層是 Rust 的 Parallel Iterator，自動利用多核心。Python for-loop 是 Single-threaded + GIL。提升來自三層紅利疊加：Zero Serialization + Rust Parallelism + SIMD Vectorization。

Production 100K users 推算：v1 需要 46 分鐘，v2 只需要 52 秒。

五、Iceberg：從 Edge I/O 到 Lakehouse Governance

把 Resource Isolation 的精神從 DeviceSync 帶到 Lakehouse：不要讀不需要的資料。

Apache Iceberg 解決的問題，和我們在 Edge 端面對的問題驚人地相似。在 DeviceSync 我們學會 "不必要的東西不要寫，寫了的東西要控制粒度和生命週期"。在 Iceberg，這個概念變成了 Hidden Partitioning。

Partition Pruning：I/O Optimization 的延續

315 億筆 CGM 資料，分佈在 4 個國家、36 個月。不分區 → 每次查詢掃描全部 1.41 TB。用 country + month(timestamp) 分區後：

總分區數：144 個（4 x 36）——100-1,000 Sweet Spot 內
查詢 "TW + January" 只讀 1/144——Skip 99.3%
單一分區最大 327M rows，在 Trino 建議的 100M-500M 範圍內

Schema & Partition Evolution

還記得 DRF Serializer 嗎？API Contract 變更時需要 versioning 和 backward compatibility。

Iceberg 的 Schema Evolution 做同一件事：加欄位、改型別、重新命名——不需要 rewrite data。

更強的是 Partition Evolution：傳統 Hive 改 Partition Strategy = rewrite 315 億筆（預估 2-4 小時 downtime）。Iceberg 只需要一行 DDL：

ALTER TABLE cgm_readings SET PARTITION SPEC (
    identity(country),
    day(timestamp)  -- changed from month to day
)

舊資料不動，新資料自動用新 Partition。Zero downtime, zero rewrite。

這不只是 feature——這是 Governance Philosophy：系統應該能在不中斷服務的情況下演進。

六、Trino：不移動資料的 Distributed Computation

Trino 解決另一個問題：Compute-Storage Separation。

PostgreSQL 把資料和運算綁在同一台機器上。想更快？加 CPU、加 Memory、加 replica。Vertical Scaling 有天花板，Horizontal Scaling 有 consistency 問題。

Trino 的做法：資料留在 S3，Compute 用完即釋放。

為什麼 315 億筆能 Sub-second 回傳？

答案不是 "Trino 很快"——這太籠統。具體的加速來自四層 Pruning：

Layer	Mechanism	Effect
Partition Pruning	只讀 `country=TW, month=2026-01`	Skip 99.3%
File-level Min/Max	Parquet Footer Statistics	Skip ~80%
Row Group Pruning	Row Group 內的 min/max	Skip ~50%
Column Projection	只讀需要的 2/8 columns	Save 75% I/O

四層疊加後，315 億筆的查詢實際只掃描約 3,500 筆。這不是魔法，是 Structured Laziness——能不讀的，絕對不讀。

從 Edge 到 Lakehouse 的一致性

DeviceSync (Edge)	Iceberg + Trino (Lakehouse)	共同原則
Registry Log Level Control	Column Projection（只讀需要的欄位）	控制粒度：不需要的資訊不要產生
Size-based Rotation	Partition Pruning（只讀需要的分區）	控制 I/O：不需要的資料不要讀
Time-based Cleanup	Iceberg Snapshot Expiry	控制生命週期：過期的資料不要留

底層邏輯從未改變。改變的是規模：從單機 AppData\Roaming 裡的 MB 級 Log，到分散式的 TB 級 Lakehouse。

七、Data Contract：FastAPI + Pydantic

問題不是 Python 不行。問題是你讓 Python 做了它不該做的事。

正確的設計哲學是：Python 是 Orchestrator，Rust 是 Executor。

Boundary：Pydantic Validation

在資料 Ingestion 和 API 輸出端，用 FastAPI + Pydantic 定義 Data Contract：

class CGMReading(BaseModel):
    user_id: str
    timestamp: datetime
    glucose_mg_dl: float = Field(ge=20, le=600)
    country: Literal["TW", "JP", "KR", "SG"]

    model_config = ConfigDict(strict=True)

Pydantic v2 的 Validation Core 是 Rust 寫的（pydantic-core），比 v1 快 5-50x。它只在 Boundary 做一次 Validation，之後資料進入 Arrow Format 就不再需要 Python 物件化。

Pipeline 內：全程 Arrow

Layer	Tool	Runtime	Data Format
API Boundary	FastAPI + Pydantic	Rust (core)	Python Object（僅一次）
Read/Write	Polars read/write_parquet	Rust	Arrow IPC
Computation	Polars Expressions	Rust	Arrow Columnar
Storage	Parquet on S3	C++ (Arrow)	Columnar + Compressed

Python 在這條鏈路上的角色：定義 DAG、呼叫 API、處理錯誤。所有碰觸資料的運算，都在 Rust 裡完成。

一條禁令

如果你在 Data Pipeline 中看到以下任何一行，請立即重構：

# 以下每一行都是效能殺手
df.to_dicts()           # Arrow → Python dict list (30x Memory Bloat)
df.to_pandas()          # Arrow → NumPy (copy) → Pandas
df.iter_rows()          # Per-row Iteration = 放棄 Vectorization
for row in df.rows():   # 同上
[row for row in df]     # 同上的語法糖

替代方案永遠是：用 Polars Expression 表達邏輯，讓 Rust 去跑。

八、結語

從 Next.js 的 Webpack Hack 到 Polars 的 shift().over()，技術棧變了，但底層原則沒變：

"資料流過 Boundary 時，要有明確的 Contract。資料在 Boundary 內時，不要讓它離開原生 Format。"

Pydantic 守住 Boundary 的 Contract
Polars 守住 Boundary 內的 Memory Format
Iceberg 守住 Storage Layer 的 Governance
Trino 守住 Compute Layer 的 Resource Isolation

一個 Senior Data Engineer 的價值不是知道 .to_dicts() 很慢——任何人跑一次 benchmark 都看得出來。價值在於能解釋 為什麼 慢（Serialization + Memory Bloat + GC + GIL），能預測 影響多大（100K users 從 52 秒變 46 分鐘），能給出 決策建議（三個 Polars Expression 取代整個 for-loop，53.2x 提升，6 項 Metrics Exact Match）。

"不能只跑數字，要能解釋數字。" 這是從 Edge Device 到 Data Lakehouse，一路學到的最重要的事。

所有數據來自個人學習專案 Glucose-AI-Lakehouse-POC，使用程式化生成的合成資料。