TensorFlow Lite Model Quantization

11 min readSep 14, 2023

TensorFlow Lite 是 TensorFlow 輕量化後的產物，適合在各種邊緣裝置包含手機上運行，如果想要部屬到產品上，使用 TensorFlow Lite 是一個副作用最小的選擇，TensorFlow Lite 還能夠量經過 Quantization 加速，來看以下介紹

前言

近年來 AI 的應用越來越多，但其實 AI 模型要快速的部屬應用，最關鍵的問題還是有沒有裝置跑得動，現在 model 動輒幾百萬個參數，普通的邊緣裝置根本跑不起來，於是大家開始意識到 model quantization 的重要性，因此出現了越來越多加速模型和輕量化模型的方法，最常看到的有三種

TensorRT

由 Nvidia 開發的一套輕量化模型的方式，TensorRT 只能跑在 Nvidia 自家出的 GPU 上，主要是利用 CUDA 來跑最佳化後的模型，能夠部署到的邊緣裝置，大多都在 AGX Xavier 這個平台上，桌電當然也可以，只是沒有人會把桌電當然邊緣裝置部署，重點是還必須裝上 Nvidia 的顯示卡

TensorRT详细入门指北，如果你还不了解TensorRT，过来看看吧！

好久不见各位！想我不？经历了五一还有点后遗症，感觉还没有休息够，时间过得太快了啊啊啊。希望大家也能及时调整工作状态，搞起来搞起来！为了更方便地搞起来，早点进入状态。老潘为大家准备了TensorRT入门指...

zhuanlan.zhihu.com

OpenVINO

由 Intel 開發的套件，支援硬體加速，能夠在運算能力較弱的 CPU 上執行，不支援 Nvidia GPU 和 AMD CPU

也可以把行模放到 Intel 出的一根運算棒(stick)上執行，讓任何一台裝置都有基礎的AI 運算能力

Intel® Distribution of OpenVINO™ Toolkit

Optimize and deploy AI inference. Boost deep learning performance in computer vision, automatic speech recognition…

www.intel.com

Tensorflow Lite

接下來是我們要介紹的重點 TensorFlow Lite，TFLite 是由 tensorflow 經過 optimization 之後得到的模型，支援 CPU 硬體加速，最大的特色是他的廣泛使用性，任何 Andriod 裝置都可以運行，能夠輕易地在手機 App 上面執行，也可以轉換成 js 檔並且在網站上部署

TensorFlow

An end-to-end open source machine learning platform for everyone. Discover TensorFlow's flexible ecosystem of tools…

www.tensorflow.org

接下來將介紹 TensorFlow Lite 量化模型的各種方法

基礎格式介紹

在進入量化之前，想要先釐清一下，TensorFlow 中會遇到的模型格式

.onnx

ONNX 是機器學習和深度學習模型使用的開放格式。它可以將來自不同框架（例如 TensorFlow、PyTorch、MATLAB、Caffe、Keras）的深度學習和機器學習模型轉換成單一格式。順帶一提，onnx 要先轉換成 tensorflow 才能夠再轉成 tensorflow lite。

.pb

一種 binary 的儲存格式，可以只存下模型的架構，但是權重就要去另一個 .ckpt 抓取，或者也可以把權重一起存進 .pb 中，稱為 forzenpb。也可以存成 .pbtxt 這個格式，好處是可以用文字編輯器看格式，壞處是檔案比較大。

save_model

是一種內建於 tensorflow 的 api，也是 tensorflow 存下模型的方式，tensorflow 會把所有模型的資訊放到給定一個資料夾內，包含 .pb(saved_model.pb) ，權重檔 (variables/.index, variables/.data)，用這類方式存下來的模型，可以輕易地轉成各種格式。

.tflite

Tensorflow Lite 的檔案格式，常用於各式邊緣裝置上，在 CPU 上運行得以加速，最簡單取得的方式，就是由 save_model 中轉換取得，如果有 frozenpb 也可以轉換取得。

模型輕量化

在 TensorFlow 官方網站中提到，他們的模型 Quantization 有這幾種方法

提供四種方式量化模型，經過 Quantization 的檔案都會變小，精度也會下降一些，也適合不同的裝置

上述提到的 Posting-training 的意思是說，我們不需要重新訓練模型來把他變小，只需要在他完成訓練之後，做一些降低精度等最佳化，就可以讓模型加速，好處是如果我們拿到別人訓練好的模型，不需要訓練的程式碼就能夠優化模型(這是很常見的狀況，有模型，沒程式碼)

Model optimization | TensorFlow Lite

Edge devices often have limited memory or computational power. Various optimizations can be applied to models so that…

www.tensorflow.org

下面詳述官網上提到的方式

Post-training float16 quantization

顧名思義，就是在訓練後，把 tflite 的權重精度從 32-bit 換成 16-bit，雖然 weight 被換成 float16，大小變成一半，但模型實際在 forward 的時候，還是會把 float16 de-quantizate 成 float32 計算

轉換的方法很簡單，就是先從 tf 的 model 轉成 tflite，這時候的 tflite 是 32-bit，再用簡單的幾行指令，就可以輸出 16-bit 的 tflite 模型

Post-training dynamic range quantization

這個方法是在模型跑 inference time 的時候，動態開啟混合精度的運算，最低可以調到 int8。因為是動態調整，能夠在某些重要的層提高精度，所以準確度的損失會比 Post-training float16 quantization 還要來的少。

可以發現 dynamic range quantization 只有設定 converter.optimizations，並且從所有範例中都可以找到這行，因此推斷 dynamic range quantization應該就是基礎的 Quantization

Post-training integer quantization

這是 post-training 最強的方法，把浮點數轉換成整數來做運算

想要把 float 變成 int，我們勢必要做 mapping，至於 mapping 的規則，需要提供原始的訓練資料，從這些訓練資料中找出模型在運算(activation)時，浮點數的最大和最小值，可以讓模型的浮點數權重合理的轉換成整數而不掉太多精準度

integer quantization 又有分成普通的 integer quantization 和 full integer quantization

Integer quantization

經過普通 integer quantization 的模型，輸入輸出的型態都是 float32，權重計算的型態全部都是 int8，但是如果遇到 int 不支援的 operator，就會 de-quantize 成 float 運算完後再轉回 int 繼續往下傳

Full integer quantization

full integer 表示所有的權重包含輸處輸出都要是 int8，為了應對有些裝置(Edge TPU) 只支援全整數的運算，所以在量化的時候要注意，是不是所有的 ops 都支援全整數運算

當然要特別注意的是，由於輸入要是 int8，所以圖片在進入模型前記得要轉好型別，底下附上範例連結

Post-training integer quantization | TensorFlow Lite

Integer quantization is an optimization strategy that converts 32-bit floating-point numbers (such as weights and…

www.tensorflow.org

Quantization-aware training

我們都知道在上述的 Post-training quantization 在精度上會有一些損失，為了在 quantization 後得到更高精度，google research 提出一篇論文

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models…

arxiv.org

主要在說，訓練模型加入一些 layer 並且不斷的 Quantize 和 De-quantize 來模擬精度丟失的狀況，讓模型之後被 Quantize 時還能保持一定的精準度

具體做法是把已經訓練好的模型使用 tensorflow_model_optimization 這個 API 幫現有的模型加上一些虛擬的 quantization，然後從原本的訓練資料中抽樣出少量的資料來 fintune 模型，用這個方式訓練好的模型在做任何的 Quantization 都能保持一定的精準度

(a) 原始的網路, (b) 加上 weight quant 和 activation quant, (c) 用 int 的準確度甚至比 float 高

底下是有沒有經過 Quantization-aware training 的結果比較表格

可以發現使用 Quantization-aware training **精準度幾乎接近原本的模型**

官網上有範例程式

Quantization aware training in Keras example | TensorFlow Model Optimization

Welcome to an end-to-end example for quantization aware training. For an introduction to what quantization aware…

www.tensorflow.org

Comparation

從這張圖中可以看到，不同的量化需要經過的處理，隨著深度越往下 inference 速度越快，失去的精準度也越多

Quantization Tree，每過一個 node 速度就更快，準確度也損失更多，可以看到最快的方式是 quantize 成 integer8，若要提高精度，可以使用 int16

結論

TensorFlow Lite 的 Quantization 沒有絕對的好壞之分，全都要看自己目前任務和模型的內容來決定適合的 Quantization 方式，有些模型內部包含一些目前沒有支援整數運算的 Operator 就只能用相對應的量化法

以前寫論文做研究的時候比較常用 Pytorch 來實作，但工作後發現深度學習想要能夠跟應用緊密的結合，還是 TensorFlow 比較多支援的方式，所以才開始整理這篇量化模型的方法，希望讓人快速了解，以後忘記也可以複習

TensorFlow Lite Model Quantization

前言

TensorRT详细入门指北，如果你还不了解TensorRT，过来看看吧！

好久不见各位！想我不？ 经历了五一还有点后遗症，感觉还没有休息够，时间过得太快了啊啊啊。 希望大家也能及时调整工作状态，搞起来搞起来！ 为了更方便地搞起来，早点进入状态。老潘为大家准备了TensorRT入门指...

Intel® Distribution of OpenVINO™ Toolkit

Optimize and deploy AI inference. Boost deep learning performance in computer vision, automatic speech recognition…

TensorFlow

An end-to-end open source machine learning platform for everyone. Discover TensorFlow's flexible ecosystem of tools…

基礎格式介紹

模型輕量化

Model optimization | TensorFlow Lite

Edge devices often have limited memory or computational power. Various optimizations can be applied to models so that…

Integer quantization

Full integer quantization

Post-training integer quantization | TensorFlow Lite

Integer quantization is an optimization strategy that converts 32-bit floating-point numbers (such as weights and…

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models…

Quantization aware training in Keras example | TensorFlow Model Optimization

Welcome to an end-to-end example for quantization aware training. For an introduction to what quantization aware…

Comparation

結論

Written by How哥

好久不见各位！想我不？经历了五一还有点后遗症，感觉还没有休息够，时间过得太快了啊啊啊。希望大家也能及时调整工作状态，搞起来搞起来！为了更方便地搞起来，早点进入状态。老潘为大家准备了TensorRT入门指...