Skip to content

6.4.3 LSTM and GRU

  • Explain why plain RNNs struggle with long dependencies.
  • Understand LSTM cell state c_t and hidden state h_t.
  • Interpret forget, input, output, update, and reset gates.
  • Run PyTorch nn.LSTM and nn.GRU shape checks.
  • Train a tiny gated recurrent model on a memory task.

LSTM gated memory flow diagram

Read the picture like this:

old memorygate decides what staysnew information entersoutput exposes part of memory

A gate is a learned value between 0 and 1.

Gate valueMeaning
close to 0mostly block the information
close to 1mostly let the information pass

This is the practical difference from a plain RNN: memory is no longer just overwritten at every step.

Plain RNNs summarize the past into one hidden state. That works for short sequences, but long sequences create two problems:

ProblemIntuition
early information gets washed outthe hidden state is rewritten many times
gradients vanishtraining signal becomes weak when backpropagated far back in time

LSTM and GRU are not “deeper RNNs.” They are memory-control designs.

LSTM gated information flow control diagram

An LSTM keeps two states:

StateRole
c_tcell state, the longer-term memory path
h_thidden state, the output exposed at the current step

The three main gates:

GateQuestion it answers
forget gatehow much old memory should stay?
input gatehow much new information should be written?
output gatehow much memory should be exposed now?

This small scalar version keeps the idea visible without matrix notation.

import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
c_prev = 0.8
forget_gate = sigmoid(1.0)
input_gate = sigmoid(0.2)
output_gate = sigmoid(0.7)
c_tilde = np.tanh(0.9)
c_t = forget_gate * c_prev + input_gate * c_tilde
h_t = output_gate * np.tanh(c_t)
print("scalar_lstm_lab")
for name, value in [
("forget_gate", forget_gate),
("input_gate", input_gate),
("output_gate", output_gate),
("c_t", c_t),
("h_t", h_t),
]:
print(f"{name:<12} {float(value):.4f}")

Expected output:

Terminal window
scalar_lstm_lab
forget_gate 0.7311
input_gate 0.5498
output_gate 0.6682
c_t 0.9787
h_t 0.5028

Read the update as:

new cell memory = keep part of old memory + write part of new candidate

That is the core of LSTM.

GRU has fewer moving parts than LSTM. It does not keep a separate cell state. The hidden state carries the memory.

GateRole
update gatecontrols how much old state and new candidate are mixed
reset gatecontrols how much old state is used when making the candidate
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
h_prev = 0.7
x_t = 1.1
update_gate = sigmoid(0.8)
reset_gate = sigmoid(-0.3)
h_candidate = np.tanh(x_t + reset_gate * h_prev)
h_t = (1 - update_gate) * h_prev + update_gate * h_candidate
print("scalar_gru_lab")
for name, value in [
("update_gate", update_gate),
("reset_gate", reset_gate),
("h_candidate", h_candidate),
("h_t", h_t),
]:
print(f"{name:<12} {float(value):.4f}")

Expected output:

Terminal window
scalar_gru_lab
update_gate 0.6900
reset_gate 0.4256
h_candidate 0.8849
h_t 0.8276

Quick memory aid:

LSTM = more explicit memory management
GRU = lighter gated memory management
import torch
from torch import nn
torch.manual_seed(42)
x = torch.randn(4, 6, 8)
lstm = nn.LSTM(input_size=8, hidden_size=16, batch_first=True)
gru = nn.GRU(input_size=8, hidden_size=16, batch_first=True)
lstm_out, (lstm_h, lstm_c) = lstm(x)
gru_out, gru_h = gru(x)
print("shape_lab")
print("lstm_out:", tuple(lstm_out.shape))
print("lstm_h :", tuple(lstm_h.shape))
print("lstm_c :", tuple(lstm_c.shape))
print("gru_out :", tuple(gru_out.shape))
print("gru_h :", tuple(gru_h.shape))

Expected output:

Terminal window
shape_lab
lstm_out: (4, 6, 16)
lstm_h : (1, 4, 16)
lstm_c : (1, 4, 16)
gru_out : (4, 6, 16)
gru_h : (1, 4, 16)

The visible API difference:

  • LSTM returns (h, c);
  • GRU returns only h.

The label depends on the first value in the sequence. The middle values are noisy, so the model must keep early information.

import torch
from torch import nn
torch.manual_seed(42)
def build_dataset(n=160, seq_len=10):
X, y = [], []
for _ in range(n):
first = 1.0 if torch.rand(1).item() > 0.5 else -1.0
seq = torch.randn(seq_len, 1) * 0.25
seq[0, 0] = first
X.append(seq)
y.append(1 if first > 0 else 0)
return torch.stack(X), torch.tensor(y)
X, y = build_dataset()
class GRUClassifier(nn.Module):
def __init__(self):
super().__init__()
self.gru = nn.GRU(input_size=1, hidden_size=8, batch_first=True)
self.fc = nn.Linear(8, 2)
def forward(self, x):
out, h = self.gru(x)
return self.fc(h[-1])
model = GRUClassifier()
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.03)
for epoch in range(1, 81):
logits = model(X)
loss = loss_fn(logits, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch == 1 or epoch % 20 == 0:
acc = (logits.argmax(1) == y).float().mean().item()
print(f"memory epoch={epoch:02d} loss={loss.item():.4f} acc={acc:.3f}")
with torch.no_grad():
final_acc = (model(X).argmax(1) == y).float().mean().item()
print("final_acc", round(final_acc, 3))

Expected output:

Terminal window
memory epoch=01 loss=0.7465 acc=0.431
memory epoch=20 loss=0.6691 acc=0.569
memory epoch=40 loss=0.0023 acc=1.000
memory epoch=60 loss=0.0001 acc=1.000
memory epoch=80 loss=0.0001 acc=1.000
final_acc 1.0

LSTM and GRU memory lab result map

This toy task is small, but it captures the reason gated recurrent models exist: the model needs to preserve useful early information through noisy later steps.

Keep one gated-memory note:

Lstm State
returns hidden state h and cell state c
Gru State
returns hidden state h only
Gate Meaning
values near 0 block, values near 1 pass
Memory Task
label depends on the first time step
Result
final_acc reaches 1.0 on the toy memory task
Limit
validate on held-out sequences before trusting the architecture
SituationGood starting point
quick baselineGRU
small model budgetGRU
long dependency is centralLSTM and GRU both worth trying
you need explicit cell state intuitionLSTM
modern long text tasksoften Transformer instead

In practice, compare validation results. Architecture names are less important than whether the model fits the data and deployment constraints.

MistakeFix
thinking LSTM/GRU are just deeper RNNsthink “memory control,” not depth
confusing out, h, and cout per step, h final hidden, c LSTM cell state
assuming gates never forget important infogates are learned and can still fail
using high learning rate on unstable sequenceslower LR, clip gradients if needed
using only training accuracyvalidate on held-out sequences
  1. Change forget_gate in Lab 1 by replacing sigmoid(1.0) with sigmoid(-1.0). How does c_t change?
  2. Change the memory task so the label depends on the last value. Is it easier?
  3. Replace GRUClassifier with an LSTMClassifier and compare the output API.
  4. Increase seq_len from 10 to 30. Does training become harder?
  5. Explain why GRU has fewer states than LSTM but can still work well.
Reference implementation and walkthrough
  1. sigmoid(-1.0) is smaller than sigmoid(1.0), so less previous cell memory is kept. c_t should rely more on the new candidate.
  2. If the label depends on the last value, the task is usually easier because the model does not need to preserve early information for many steps.
  3. GRU returns an output sequence and final hidden state; LSTM returns an output sequence plus (h_n, c_n). The classifier must unpack the LSTM tuple correctly.
  4. Longer sequences can make training harder because memory must be preserved longer and gradients travel through more steps.
  5. GRU combines memory control into a lighter state design. It can work well when the task does not need the extra separation between cell state and hidden state.
  • LSTM and GRU add gates to control memory flow.
  • LSTM has both c_t and h_t; GRU uses a lighter hidden-state design.
  • Gates are learned soft switches between 0 and 1.
  • Use validation results to choose between LSTM and GRU.
  • Gated recurrent models are an important bridge from plain RNNs to attention-based sequence modeling.