LSTM and GRU: the Gates That Let a Network Remember

Vanilla RNNs forget the distant past — gradients vanish over long sequences. LSTMs fix it with a protected "cell state" and three gates. Drop a memory early, feed noise, and watch the LSTM still hold it 20 steps later while a plain RNN bleeds it away.

🚪 See who remembers: https://dev48v.infy.uk/dl/day11-lstm.html

The cell state: a conveyor belt

The LSTM adds a separate cell state c that flows down the sequence almost untouched — edited only by gentle, gated operations, not repeatedly multiplied by weights. Information (and gradients) survive long gaps.

The three gates (each a sigmoid 0→1)

const f = sigmoid(Wf·[h,x]);  c = f * c;           // FORGET: keep (~1) or erase (~0)
const i = sigmoid(Wi·[h,x]);  const g = tanh(Wg·[h,x]);
c = c + i * g;                                       // INPUT: write new info selectively
const o = sigmoid(Wo·[h,x]);  h = o * tanh(c);       // OUTPUT: reveal a filtered view

Forget gate ≈ 1 → the memory rides along untouched (no decay). That's the fix for vanishing gradients.
Input gate decides what new fact to latch (a subject, a flag) and what to ignore.
Output gate lets the cell hold a fact quietly for many steps and surface it only when relevant.

And GRU

A leaner cousin — two gates, merges cell + hidden state, often as good with less compute. LSTMs/GRUs powered translation, speech, and text generation for years — the bridge from RNNs to attention (the Transformer).