Vanilla RNNs forget the distant past โ gradients vanish over long sequences. LSTMs fix it with a protected "cell state" and three gates. Drop a memory early, feed noise, and watch the LSTM still hold it 20 steps later while a plain RNN bleeds it away.
๐ช See who remembers: https://dev48v.infy.uk/dl/day11-lstm.html
The cell state: a conveyor belt
The LSTM adds a separate cell state c that flows down the sequence almost untouched โ edited only by gentle, gated operations, not repeatedly multiplied by weights. Information (and gradients) survive long gaps.
The three gates (each a sigmoid 0โ1)
const f = sigmoid(Wfยท[h,x]); c = f * c; // FORGET: keep (~1) or erase (~0)
const i = sigmoid(Wiยท[h,x]); const g = tanh(Wgยท[h,x]);
c = c + i * g; // INPUT: write new info selectively
const o = sigmoid(Woยท[h,x]); h = o * tanh(c); // OUTPUT: reveal a filtered view
- Forget gate โ 1 โ the memory rides along untouched (no decay). That's the fix for vanishing gradients.
- Input gate decides what new fact to latch (a subject, a flag) and what to ignore.
- Output gate lets the cell hold a fact quietly for many steps and surface it only when relevant.
And GRU
A leaner cousin โ two gates, merges cell + hidden state, often as good with less compute. LSTMs/GRUs powered translation, speech, and text generation for years โ the bridge from RNNs to attention (the Transformer).
The takeaway
A gated conveyor belt: keep, write, reveal โ memory that lasts. Test it.













