Skip to content

6.7.4 Model Compression [Elective]

  • Explain quantization, pruning, and distillation by what they change.
  • Estimate model size from parameter count and numeric precision.
  • Measure quantization error in a tiny example.
  • Choose a compression path from a deployment bottleneck.
  • Avoid judging compression by size alone.

Model compression trade-off map

BottleneckFirst method to considerWhy
memory too highquantizationsame parameter count, fewer bits per value
many redundant weights/channelspruningremove structure that contributes little
large teacher but retraining is possibledistillationtrain a smaller student to imitate behavior
latency still high after compressionprofiling firstbottleneck may be data transfer or unsupported kernels

The important habit:

measure bottleneckchoose methodremeasure size, latency, and metric
MethodChangesTypical benefitMain risk
Quantizationnumeric precisionsmaller memory, sometimes faster inferenceaccuracy drop, hardware support issues
Pruningweights, channels, or blocksless computation if structure is actually removedsparse speedup may not appear on all hardware
Distillationtraining objectivesmaller model with teacher-like behaviorrequires retraining and teacher outputs

Compression is not complete until the task still works after compression.

weights = [0.12, -1.87, 3.44, -0.03]
def fake_quantize(values, scale):
return [round(v * scale) / scale for v in values]
def mae(a, b):
return sum(abs(x - y) for x, y in zip(a, b)) / len(a)
q8_like = fake_quantize(weights, scale=16)
q4_like = fake_quantize(weights, scale=4)
print("quant_error_lab")
print("original:", weights)
print("q8_like:", q8_like)
print("q4_like:", q4_like)
print("q8_mae:", round(mae(weights, q8_like), 4))
print("q4_mae:", round(mae(weights, q4_like), 4))

Expected output:

Terminal window
quant_error_lab
original: [0.12, -1.87, 3.44, -0.03]
q8_like: [0.125, -1.875, 3.4375, 0.0]
q4_like: [0.0, -1.75, 3.5, 0.0]
q8_mae: 0.0106
q4_mae: 0.0825

More aggressive quantization usually creates more numerical error. The practical question is whether the downstream task metric still stays acceptable.

import torch
from torch import nn
model = nn.Sequential(
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
)
param_count = sum(p.numel() for p in model.parameters())
print("model_size_lab")
print("params:", param_count)
for name, bits in [("fp32", 32), ("fp16", 16), ("int8", 8), ("int4", 4)]:
mb = param_count * bits / 8 / 1024 / 1024
print(f"{name:>4}: {mb:.4f} MB")

Expected output:

Terminal window
model_size_lab
params: 8906
fp32: 0.0340 MB
fp16: 0.0170 MB
int8: 0.0085 MB
int4: 0.0042 MB

Model compression quantization and size result map

This is an estimate for parameters only. Real deployed size can also include metadata, tokenizer files, runtime overhead, and engine-specific packaging.

SituationGood first action
model does not fit in memorytry quantization first
model fits but latency is highprofile latency before pruning
most channels appear redundantconsider structured pruning
a smaller model must preserve behaviordistill from a teacher model
metric drops too much after compressionreduce compression strength or fine-tune

For pruning, prefer structured pruning for deployment because removing whole channels or blocks is easier for hardware to exploit than random sparse weights.

For distillation, the common pattern is:

teacher logits or outputs -> student learns labels + teacher behavior

What to Report in a Compression Experiment

Section titled “What to Report in a Compression Experiment”
MetricBeforeAfterWhy it matters
model sizerequiredrequireddid memory improve?
latencyrequiredrequireddid inference actually speed up?
throughputusefulusefulcan the service handle more requests?
task metricrequiredrequireddid quality remain acceptable?
hardware/runtimerequiredrequiredcompression depends on deployment stack

Never report “int8 works” without task metric and latency. Smaller is not automatically better.

Save compression results as a before/after report:

Baseline Size
To be filled
Compressed Size
To be filled
Baseline Latency
To be filled
Compressed Latency
To be filled
Baseline Metric
To be filled
Compressed Metric
To be filled
Runtime Hardware
To be filled
Decision
keep, tune, or reject compression

This protects you from a common mistake: reducing file size while making the actual product slower or less accurate.

MistakeFix
compressing before measuring bottlenecksmeasure memory, latency, and metric first
assuming quantization always speeds things upverify hardware and runtime support
counting only parameter sizeinclude tokenizer, runtime, and packaging where relevant
using unstructured pruning and expecting automatic speedupbenchmark on target hardware
ignoring accuracy after compressioncompare task metric before and after
  1. Change scale=16 to scale=32 in Lab 1. Does MAE decrease?
  2. Add a third Linear layer to Lab 2 and recompute model size.
  3. Choose a compression strategy for a model that fits in memory but is too slow.
  4. Write a before/after report template with size, latency, throughput, and metric.
  5. Explain why structured pruning is usually easier to deploy than unstructured pruning.
Reference implementation and walkthrough
  1. Increasing scale to 32 usually reduces quantization error because values are represented with finer steps. Verify with MAE instead of guessing.
  2. A third Linear layer adds both weight and bias parameters. Recompute each layer as in_features * out_features + out_features.
  3. If memory is acceptable but latency is too high, start with quantization, batching/runtime optimization, or distillation. Pruning helps only if the deployment runtime can exploit it.
  4. A useful report compares model_size, latency_p50/p95, throughput, task_metric, hardware, batch size, and the exact compression method.
  5. Structured pruning removes whole channels, heads, or blocks, so common runtimes can speed it up. Unstructured sparsity often needs special kernels to become faster.
  • Compression starts from deployment constraints.
  • Quantization changes numeric precision.
  • Pruning changes model structure.
  • Distillation changes the training process.
  • Compression is successful only if the deployed task still meets quality and latency requirements.