Skip to main content

6.7.4 Model Compression [Elective]

Section Overview

Model compression is a deployment trade-off, not a magic shrink button. You compress because memory, latency, throughput, or device limits force a decision.

Learning Objectives

  • Explain quantization, pruning, and distillation by what they change.
  • Estimate model size from parameter count and numeric precision.
  • Measure quantization error in a tiny example.
  • Choose a compression path from a deployment bottleneck.
  • Avoid judging compression by size alone.

Start from the Deployment Bottleneck

Model compression trade-off map

BottleneckFirst method to considerWhy
memory too highquantizationsame parameter count, fewer bits per value
many redundant weights/channelspruningremove structure that contributes little
large teacher but retraining is possibledistillationtrain a smaller student to imitate behavior
latency still high after compressionprofiling firstbottleneck may be data transfer or unsupported kernels

The important habit:

measure bottleneck -> choose method -> remeasure size, latency, and metric

Three Compression Paths

MethodChangesTypical benefitMain risk
Quantizationnumeric precisionsmaller memory, sometimes faster inferenceaccuracy drop, hardware support issues
Pruningweights, channels, or blocksless computation if structure is actually removedsparse speedup may not appear on all hardware
Distillationtraining objectivesmaller model with teacher-like behaviorrequires retraining and teacher outputs

Compression is not complete until the task still works after compression.

Lab 1: Quantization Error

weights = [0.12, -1.87, 3.44, -0.03]


def fake_quantize(values, scale):
return [round(v * scale) / scale for v in values]


def mae(a, b):
return sum(abs(x - y) for x, y in zip(a, b)) / len(a)


q8_like = fake_quantize(weights, scale=16)
q4_like = fake_quantize(weights, scale=4)

print("quant_error_lab")
print("original:", weights)
print("q8_like:", q8_like)
print("q4_like:", q4_like)
print("q8_mae:", round(mae(weights, q8_like), 4))
print("q4_mae:", round(mae(weights, q4_like), 4))

Expected output:

quant_error_lab
original: [0.12, -1.87, 3.44, -0.03]
q8_like: [0.125, -1.875, 3.4375, 0.0]
q4_like: [0.0, -1.75, 3.5, 0.0]
q8_mae: 0.0106
q4_mae: 0.0825

More aggressive quantization usually creates more numerical error. The practical question is whether the downstream task metric still stays acceptable.

Lab 2: Estimate Model Size

import torch
from torch import nn

model = nn.Sequential(
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
)

param_count = sum(p.numel() for p in model.parameters())

print("model_size_lab")
print("params:", param_count)

for name, bits in [("fp32", 32), ("fp16", 16), ("int8", 8), ("int4", 4)]:
mb = param_count * bits / 8 / 1024 / 1024
print(f"{name:>4}: {mb:.4f} MB")

Expected output:

model_size_lab
params: 8906
fp32: 0.0340 MB
fp16: 0.0170 MB
int8: 0.0085 MB
int4: 0.0042 MB

Model compression quantization and size result map

This is an estimate for parameters only. Real deployed size can also include metadata, tokenizer files, runtime overhead, and engine-specific packaging.

Choosing a Path

SituationGood first action
model does not fit in memorytry quantization first
model fits but latency is highprofile latency before pruning
most channels appear redundantconsider structured pruning
a smaller model must preserve behaviordistill from a teacher model
metric drops too much after compressionreduce compression strength or fine-tune

For pruning, prefer structured pruning for deployment because removing whole channels or blocks is easier for hardware to exploit than random sparse weights.

For distillation, the common pattern is:

teacher logits or outputs -> student learns labels + teacher behavior

What to Report in a Compression Experiment

MetricBeforeAfterWhy it matters
model sizerequiredrequireddid memory improve?
latencyrequiredrequireddid inference actually speed up?
throughputusefulusefulcan the service handle more requests?
task metricrequiredrequireddid quality remain acceptable?
hardware/runtimerequiredrequiredcompression depends on deployment stack

Never report “int8 works” without task metric and latency. Smaller is not automatically better.

Common Mistakes

MistakeFix
compressing before measuring bottlenecksmeasure memory, latency, and metric first
assuming quantization always speeds things upverify hardware and runtime support
counting only parameter sizeinclude tokenizer, runtime, and packaging where relevant
using unstructured pruning and expecting automatic speedupbenchmark on target hardware
ignoring accuracy after compressioncompare task metric before and after

Exercises

  1. Change scale=16 to scale=32 in Lab 1. Does MAE decrease?
  2. Add a third Linear layer to Lab 2 and recompute model size.
  3. Choose a compression strategy for a model that fits in memory but is too slow.
  4. Write a before/after report template with size, latency, throughput, and metric.
  5. Explain why structured pruning is usually easier to deploy than unstructured pruning.

Key Takeaways

  • Compression starts from deployment constraints.
  • Quantization changes numeric precision.
  • Pruning changes model structure.
  • Distillation changes the training process.
  • Compression is successful only if the deployed task still meets quality and latency requirements.