Skip to content

E.A.4 Inference Engines

Inference engine and hardware adaptation diagram

Inference engine selection matrix diagram

An inference engine is the runtime layer between a trained model and real hardware. The model says what to compute; the engine decides how to execute that graph efficiently on CPU, GPU, NPU, or edge hardware.

Use this lesson as a first selection drill. Do not memorize one engine as always best. Match the engine to the deployment constraint.

  • Python 3.10+
  • No external packages
  • Five minutes to run and edit the scoring script
  • Latency: how long one request waits for a result.
  • Throughput: how many requests the system can finish per second.
  • Backend: the hardware-specific execution path, such as CPU, CUDA, TensorRT, or OpenVINO.
  • ONNX: a common model exchange format.
  • Operator: one model graph operation, such as matrix multiplication, convolution, or normalization.

Create engine_selector.py:

engines = [
{
"name": "ONNX Runtime",
"hardware": ["cpu", "nvidia"],
"formats": ["onnx"],
"latency": "medium",
"ops": "easy",
},
{
"name": "TensorRT",
"hardware": ["nvidia"],
"formats": ["onnx", "engine"],
"latency": "low",
"ops": "hard",
},
{
"name": "OpenVINO",
"hardware": ["cpu", "intel"],
"formats": ["onnx", "ir"],
"latency": "low",
"ops": "medium",
},
]
need = {"hardware": "nvidia", "format": "onnx", "latency": "low"}
for engine in engines:
score = 0
score += 2 if need["hardware"] in engine["hardware"] else -3
score += 2 if need["format"] in engine["formats"] else -2
score += 1 if need["latency"] == engine["latency"] else 0
score -= 1 if engine["ops"] == "hard" else 0
engine["score"] = score
best = max(engines, key=lambda item: item["score"])
for engine in engines:
print(engine["name"], engine["score"])
print("selected:", best["name"])

Run it:

Terminal window
python engine_selector.py

Expected output:

Terminal window
ONNX Runtime 4
TensorRT 4
OpenVINO 0
selected: ONNX Runtime

The script gives ONNX Runtime and TensorRT the same score, then selects the first one. That is intentional: in real deployment, if a faster path adds extra operational cost, the simpler path can be the better first release.

Now change:

need = {"hardware": "nvidia", "format": "onnx", "latency": "low"}
print(need)

Expected output for the first snippet:

Terminal window
{'hardware': 'nvidia', 'format': 'onnx', 'latency': 'low'}

to:

need = {"hardware": "intel", "format": "onnx", "latency": "low"}
print(need)

Expected output for the second snippet:

Terminal window
{'hardware': 'intel', 'format': 'onnx', 'latency': 'low'}

Run again. Expected result:

ONNX Runtime -1
TensorRT -2
OpenVINO 5
selected: OpenVINO

This is the core idea: engine choice changes when hardware changes.

Use this order before trying advanced tuning:

  1. Confirm the target hardware.
  2. Confirm the model format the engine can load.
  3. Check whether unsupported operators exist.
  4. Compare latency and throughput with the same input size.
  5. Choose the simplest engine that meets the target.

Keep this page’s proof of learning as a small evidence card:

Deployment Target
local inference, edge device, model server, or optimization experiment
Artifact
C++ snippet, benchmark, model artifact, serving config, or deployment note
Metric
latency, memory, throughput, model size, accuracy drop, or reliability
Failure Check
ABI/build issue, hardware mismatch, quantization loss, or serving bottleneck
Expected Output
reproducible deployment or optimization evidence, not only theory notes
  • Choosing TensorRT only because it is fast, even when the team cannot maintain the engine build pipeline.
  • Testing with a tiny input, then discovering production input is much slower.
  • Forgetting unsupported operators until the final deployment week.

Add a memory field to each engine and subtract one point if it uses more memory than your device allows. Then rerun the selector for CPU-only, NVIDIA GPU, and Intel device scenarios.

Reference implementation and walkthrough

A good answer adds memory as another hard constraint, not as a decorative column. For example, if a target device allows memory_limit=1024, an engine with memory=1800 should lose a point or be marked risky even if its latency score is strong.

Expected reasoning:

  • CPU-only usually favors the engine that supports CPU well and stays within memory.
  • NVIDIA GPU may favor TensorRT if format and operators fit.
  • Intel hardware should usually push the selector toward OpenVINO.
  • The final choice should mention both score and deployment risk, not just the fastest engine.