Skip to content

13.1 Compute Routes: Local CPU, Free Colab, Rented GPU

Open-source LLM compute route selector

Before choosing a model name, choose the place where the experiment will run. A good compute route tells you what can be proven today, what should wait, what evidence to keep, and how to stop before cost or complexity runs away.

This page gives you three routes:

  • Local CPU: safest first loop, no rental, proves code and evidence.
  • Free Colab: useful when a free GPU is available, but not guaranteed.
  • Rented GPU: best for vLLM-style serving or 7B-class models, but only with a written stop plan.

Local CPU

Use When
You want the safest first run on your own machine.
First Target
sshleifer/tiny-gpt2, quantized small model, evaluation script, local API skeleton.
Not For
Proving 7B quality, high throughput, or long-context serving.
Evidence
environment_report.txt, first_run.md, eval_results.csv.

Free Colab

Use When
You need a temporary notebook and a GPU may be available.
First Target
Small instruct model, tokenizer checks, short evaluation, tiny LoRA dry run.
Not For
Private data, long jobs, public services, or guaranteed-GPU planning.
Evidence
Notebook copy, runtime type, nvidia-smi or CPU note, saved outputs.

Rented GPU

Use When
You need predictable VRAM, SSH, serving, or a 7B-class test.
First Target
vLLM/SGLang server, fixed eval set, latency and memory check.
Not For
Starting without budget, exposing a public port, or training before eval.
Evidence
gpu_plan.md, environment_report.txt, request/response log, shutdown proof.

Colab is a good learning route, but treat it as opportunistic. Google’s Colab FAQ says free compute resources can include GPUs and TPUs, but resources are not guaranteed or unlimited and usage limits can fluctuate. Write your plan so the lab still works on CPU if the free GPU is unavailable.

Choose the route by the question you are trying to answer:

QuestionRoute
”Can my Python environment load a model and generate text?”Local CPU
”Can I run the same notebook on a temporary hosted machine?”Free Colab
”Can this model serve requests with known VRAM, latency, and shutdown?”Rented GPU
”Should I fine-tune?”None yet; run fixed eval cases first

The first useful proof is not a clever answer. It is a reproducible trace: environment -> model -> prompt -> output -> evaluation -> stop.

Before running commands, write this file:

# Compute Route
goal: prove the open-source LLM deployment loop for one small project
route: local_cpu / free_colab / rented_gpu
selected_model:
runtime:
expected_runtime_limit:
privacy_level:
budget_limit:
stop_time:
fallback_route:
## Why this route
## What this route can prove
## What this route cannot prove yet
## Evidence to copy back
## Stop or rollback step

If stop_time, fallback_route, or evidence to copy back is empty, do not rent a GPU yet.

Use this route first. It is enough to complete most of 13.2 Hands-on: Run and Serve an Open-Source LLM with the default tiny model.

Terminal window
mkdir openllm_lab
cd openllm_lab
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install "torch" "transformers>=4.41" "accelerate" "safetensors" "sentencepiece" "fastapi" "uvicorn"

Then run the lab with the default smoke-test model:

Terminal window
python environment_report.py
python run_local_llm.py
python eval_openllm.py
uvicorn serve_openai_like:app --host 127.0.0.1 --port 8000

Stop with Ctrl+C. Your pass condition is not quality; it is whether the environment, inference, evaluation, API, and stop path all work.

Use this route when you want to change code quickly. Leave model quality claims for a better model and a fixed evaluation set.

Use this route when you want a hosted notebook and a GPU may be available. Do not assume a GPU will always be assigned.

In the notebook:

Terminal window
!python -V
!nvidia-smi || true
!python -m pip install -U pip
!python -m pip install "torch" "transformers>=4.41" "accelerate" "safetensors" "sentencepiece"

Then copy the local inference and evaluation code from the hands-on page into cells. Start with:

Terminal window
MODEL_ID="sshleifer/tiny-gpt2" python run_local_llm.py
python eval_openllm.py

If GPU is available and the notebook is stable, try a small instruct model:

Terminal window
MODEL_ID="Qwen/Qwen2.5-0.5B-Instruct" python run_local_llm.py

Keep these Colab-specific notes:

runtime_type:
gpu_visible: yes/no
notebook_url_or_copy:
install_cells:
first_run_output:
files_downloaded_back:
what_would_break_if_runtime_resets:

Do not put private documents, secrets, or long-running serving workloads into a free notebook. If you need stable serving, use the rented GPU route or a controlled local/server environment.

Rent only after the local CPU or Colab path has produced a working evidence bundle. A rented machine should answer one bounded question, such as:

  • Can a 7B-class instruct model serve through vLLM?
  • Does the fixed eval set pass on a larger model?
  • What latency and memory do we observe for this route?

Write gpu_plan.md first:

# GPU Plan
goal:
model:
runtime:
instance_vram:
disk:
region:
hourly_budget:
hard_stop_time:
ports_to_open:
access_method: SSH key
evidence_to_copy_back:
shutdown_proof:
fallback_if_oom:

On the remote machine:

Terminal window
python -V
nvidia-smi
df -h
python -m pip install -U pip
python -m pip install "vllm"

Bind to localhost first:

Terminal window
vllm serve Qwen/Qwen2.5-0.5B-Instruct --host 127.0.0.1 --port 8000

From your local machine, connect through an SSH tunnel:

Terminal window
ssh -L 8000:127.0.0.1:8000 user@your-gpu-host

Then test the OpenAI-compatible endpoint:

Terminal window
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"messages": [{"role": "user", "content": "Give one deployment rule for a rented GPU."}]
}'

After the test, copy back the evidence and stop or destroy the instance. A successful model demo that keeps billing silently is still a failed engineering run.

Fill this before continuing:

I will use _____ because _____.
This route can prove _____.
This route cannot prove _____ yet.
I will stop or fall back when _____.
The evidence I must copy back is _____.
How to judge your answer

A strong answer names constraints instead of enthusiasm. For example: local CPU can prove the code path but not service throughput; Colab can test a notebook path but cannot guarantee GPU availability; rented GPU can test serving but needs budget, SSH, ports, and shutdown proof. If the answer only says “because it is faster,” the route decision is not complete.

Compute Route
local_cpu / free_colab / rented_gpu and why
Environment
Python, torch, CUDA/MPS/CPU, disk, runtime reset risk
Budget Or Limit
free quota caveat or rental stop time
Security
private data policy, secrets policy, exposed ports
First Run
model, command, prompt, output, latency or memory note
Stop Proof
Ctrl+C, notebook saved, or rented instance stopped

You pass this lesson when you can choose one compute route, explain what it can and cannot prove, run the environment check, and name the exact stop or fallback step before moving to the hands-on lab.