LSTM Networks

Long Short-Term Memory — A Deep Learning Architecture for Sequential Data

LSTM stands for Long Short-Term Memory. It is a type of Recurrent Neural Network (RNN), which places it firmly within the field of deep learning — a subset of machine learning that uses multi-layered neural networks to learn patterns from data.

Where LSTM Fits in the AI Landscape


Artificial Intelligence
  └── Machine Learning
        └── Deep Learning
              ├── CNNs        (images, spatial data)
              ├── Transformers (language models, attention-based)
              └── RNNs        (sequential / time-series data)
                    ├── Vanilla RNN   (simple, suffers from vanishing gradients)
                    ├── GRU           (simplified gating, faster)
                    └── LSTM          (full gating mechanism, best for long sequences)

The Problem LSTM Solves

Standard neural networks treat each input independently — they have no concept of order or memory. But many real-world problems are sequential: stock prices, weather, language, sensor readings, music. The current value depends on what came before it.

Vanilla RNNs attempted to solve this by feeding output back into the network as input, but they suffer from the vanishing gradient problem — during training, gradients shrink exponentially as they propagate backward through time, making it impossible to learn long-range dependencies. An RNN trained on 365 days of data effectively "forgets" what happened 60+ days ago.

LSTM was introduced by Hochreiter & Schmidhuber in 1997 specifically to fix this. It uses a gating mechanism that allows it to selectively remember or forget information over long sequences — hundreds or even thousands of time steps.

LSTM Cell Architecture

Each LSTM cell has three gates and a cell state. Think of the cell state as a conveyor belt — information flows along it largely unchanged, and the gates decide what to add or remove.


                    ┌─────────────────────────────────────────┐
                    │             LSTM Cell                    │
                    │                                         │
   Cell State ─────┤── × ──────────── + ──────────────────────┤──── Cell State
   (long-term      │   │              │                       │     (updated)
    memory)        │   │              │                       │
                    │ ┌─┴──┐   ┌──────┴──────┐   ┌────────┐  │
                    │ │ Fg │   │  Ig  × Cand │   │   Og   │  │
                    │ │gate│   │ gate   gate  │   │  gate  │  │
                    │ └─┬──┘   └──┬────┬─────┘   └───┬────┘  │
                    │   │         │    │              │        │
                    │   └────┬────┘    │         ┌───┘        │
                    │        │         │         │            │
   Hidden State ───┤────────┴─────────┴─────────┤────────────┤──── Hidden State
   (short-term     │     [h(t-1), x(t)] concat  │   tanh()   │     (output)
    memory)        │                             │            │
                    └─────────────────────────────────────────┘

                           Input: x(t)

   Fg = Forget Gate   →  "What old info should I discard?"    σ(0 to 1)
   Ig = Input Gate    →  "What new info is worth storing?"    σ(0 to 1)
   Cand = Candidate   →  "What are the new candidate values?" tanh(-1 to 1)
   Og = Output Gate   →  "What part of cell state to output?" σ(0 to 1)

The Three Gates Explained

Forget Gate (Fg) — looks at the previous hidden state and current input, outputs a number between 0 and 1 for each value in the cell state. A value of 0 means "completely forget this" and 1 means "keep this entirely."
Input Gate (Ig) — decides which new information to store. It has two parts: a sigmoid layer that decides which values to update, and a tanh layer that creates a vector of candidate values. They are multiplied together.
Output Gate (Og) — determines what the cell outputs. The cell state is passed through tanh (squashing to -1 to 1) and multiplied by the sigmoid output of this gate, so only the chosen parts are sent forward.

LSTM in Code — A Minimal Example

Here is a simple LSTM for time-series prediction using TensorFlow/Keras:


import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

# Suppose X_train has shape: (samples, timesteps, features)
# e.g. 1000 samples, 60-day lookback, 8 features per day
model = Sequential([
    LSTM(128, return_sequences=True, input_shape=(60, 8)),
    Dropout(0.2),

    LSTM(128, return_sequences=True),
    Dropout(0.2),

    LSTM(128, return_sequences=False),  # last layer returns single output
    Dropout(0.2),

    Dense(64, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1),  # predict one value (e.g. tomorrow's price)
])

model.compile(optimizer='adam', loss='huber', metrics=['mae'])
model.fit(X_train, y_train, epochs=100, batch_size=64,
          validation_split=0.1)

Key Concepts to Understand

Lookback window — the number of past time steps the model sees at each prediction. A lookback of 60 means the model receives the last 60 days of data and predicts day 61.
Stacked LSTM layers — multiple LSTM layers on top of each other. Lower layers learn short-term patterns; upper layers learn abstract, longer-term patterns. Use return_sequences=True on all layers except the last.
Features — each time step can have multiple input values. For financial data this might be price, volume, RSI, MACD, etc. For weather: temperature, humidity, wind speed.
Scaling — LSTM gates use sigmoid and tanh activations which operate in the 0–1 and -1–1 range. Input data must be scaled (typically with MinMaxScaler) for the network to train effectively.

Common Use Cases

Time-series forecasting — stock prices, energy demand, sales
Natural language processing — text generation, sentiment analysis
Speech recognition — audio waveform to text
Anomaly detection — detecting unusual patterns in sensor data
Music generation — learning and reproducing musical sequences

Practical LSTM Improvements for Time-Series Forecasting

Below are five targeted improvements that significantly impact prediction quality, especially when working with long lookback windows and volatile time-series data.

1. Huber Loss Instead of MSE

Mean Squared Error (MSE) squares every error, which means large price moves (crashes, spikes) dominate the loss function. The model learns to "play it safe" and revert toward historical averages to minimize those squared penalties.

Huber loss behaves like MSE for small errors but switches to linear (MAE-like) behavior for large errors, controlled by a delta threshold. This makes the model robust to outlier moves without ignoring them entirely.


# MSE: loss = (y_true - y_pred)²         ← large errors get squared
# MAE: loss = |y_true - y_pred|           ← linear, but not smooth at 0
# Huber: best of both worlds

import tensorflow as tf

# delta controls the switchover point
loss_fn = tf.keras.losses.Huber(delta=1.0)

model.compile(optimizer='adam', loss=loss_fn, metrics=['mae'])


Loss
  │
  │   MSE ╱
  │      ╱         Huber loss: quadratic near zero,
  │     ╱          linear for large errors
  │    ╱  ╱ Huber
  │   ╱  ╱
  │  ╱  ╱
  │ ╱ ╱  ╱ MAE
  │╱╱╱
  └───────────── Error
       δ

2. Sample Weighting — Recency Bias

With a long lookback (e.g. 1460 days), the model treats all historical data equally. But market conditions from 3 years ago may be irrelevant today. Sample weighting lets you tell the model: "recent data matters more."


import numpy as np

def make_sample_weights(n_samples, method='linear'):
    """Weight recent training samples more heavily."""
    t = np.linspace(0, 1, n_samples)

    if method == 'linear':
        w = 0.5 + 0.5 * t         # range: 0.5 → 1.0
    elif method == 'exponential':
        w = np.exp(3 * t)          # ~20× heavier at the end
    else:
        return None

    w /= w.mean()                  # normalise so mean weight = 1
    return w

weights = make_sample_weights(len(X_train), method='linear')

model.fit(X_train, y_train,
          sample_weight=weights,   # ← pass to .fit()
          epochs=100, batch_size=64)


Weight
 1.0 │                              ╱ linear
     │                          ╱╱╱
     │                      ╱╱╱
     │                  ╱╱╱
 0.5 │──────────────╱╱╱
     │
     └──────────────────────────────── Time
     oldest                      newest
     sample                      sample

3. Log Returns as a Feature

When data grows exponentially over time, MinMaxScaler compresses early values into a tiny band near zero and recent values near one. The LSTM struggles to learn from this distorted distribution. Log returns transform multiplicative price changes into additive ones, producing a roughly stationary series that the network can learn from much more effectively.


import numpy as np
import pandas as pd

# Raw price: [100, 110, 105, 120, 115]
# After MinMaxScaler: compressed, non-stationary

# Log returns: captures the *rate of change* regardless of price level
df['log_returns'] = np.log(df['price'] / df['price'].shift(1))

# A 10% gain at $100 and a 10% gain at $100,000
# both produce log_return ≈ 0.0953
# Without log returns, the $100 move is invisible to the scaler


Raw price (exponential growth):     Log returns (stationary):
  │            ╱                      │
  │           ╱                   0.1 │  ╷   ╷       ╷
  │         ╱╱                        │  │╷  │╷  ╷╷  │╷
  │       ╱╱                      0.0 │──┼┼──┼┼──┼┼──┼┼──
  │    ╱╱╱                            │  ╵│  ╵│  │╵  ╵
  │╱╱╱╱                          -0.1 │   ╵   ╵  ╵
  └────────── Time                    └────────────── Time

  Hard to learn from                  Easy to learn from

4. Monte Carlo Dropout for Uncertainty Estimation

Standard LSTM prediction gives you a single line — no indication of how confident the model is. Monte Carlo Dropout runs the prediction multiple times (e.g. 50 runs) with dropout kept ON during inference. Each run produces a slightly different forecast. The spread of those forecasts gives you a real, data-driven confidence interval.


import numpy as np

def predict_monte_carlo(model, input_sequence, n_days, n_runs=50):
    """Run n_runs stochastic forward passes with dropout active."""
    all_forecasts = []

    for _ in range(n_runs):
        predictions = []
        seq = input_sequence.copy()

        for _ in range(n_days):
            # training=True keeps dropout ON → stochastic output
            pred = model(seq.reshape(1, *seq.shape), training=True).numpy()
            predictions.append(pred[0, 0])

            new_row = seq[-1].copy()
            new_row[0] = pred[0, 0]
            seq = np.vstack([seq[1:], new_row])

        all_forecasts.append(predictions)

    all_forecasts = np.array(all_forecasts)  # shape: (n_runs, n_days)

    median = np.median(all_forecasts, axis=0)
    lower  = np.percentile(all_forecasts, 5, axis=0)    # 5th percentile
    upper  = np.percentile(all_forecasts, 95, axis=0)   # 95th percentile

    return median, lower, upper


Price
  │          ╱╱╱╱╲╲ ← upper 95th percentile
  │        ╱╱╱╱╱╱╱╱╲╲
  │       ╱╱╱╱╱╱╱╱╱╱╱╱
  │     ╱╱╱ ── median ──╲╲
  │    ╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱
  │   ╱╱╱╱╱╱╱╱╱╱╲╲
  │  ╱╱╱╱╲╲        ← lower 5th percentile
  │ │
  │─┤
  │ │ ← today
  └──────────────────── Time
     Wider band = more uncertainty
     (grows further into the future)

5. Bollinger Band Division-by-Zero Guard

The Bollinger Band position feature calculates where the current price sits relative to the upper and lower bands. When volatility drops to near zero (flat price action), the bands collapse and the denominator approaches zero, producing inf or NaN values. These poison the MinMaxScaler and corrupt the entire training dataset.


import numpy as np

sma_20 = df['price'].rolling(20).mean()
std_20 = df['price'].rolling(20).std()
upper_band = sma_20 + 2 * std_20
lower_band = sma_20 - 2 * std_20
band_width = upper_band - lower_band

# BEFORE (dangerous):
# bb_position = (price - lower) / (upper - lower)
# → inf when upper == lower

# AFTER (safe):
bb_position = np.where(
    band_width > 0,
    (df['price'] - lower_band) / band_width,
    0.5  # neutral position when bands collapse
)


                  Normal bands:           Collapsed bands:
Price             upper ─────────         upper ═══════════
  │              ╱               ╲        lower ═══════════
  │         ────╱── price ────────╲──     price ═══════════
  │              ╲               ╱
  │               lower ────────          band_width ≈ 0
  │                                       division → inf ✗
  │               band_width > 0          use 0.5 instead ✓
  └──────────── Time

Summary Table

Improvement	Problem Solved	Impact
Huber Loss	MSE over-penalizes large moves	Model stops reverting to mean
Sample Weighting	Old data drowns out recent trends	Learns current market structure
Log Returns	Scaler compresses exponential data	Captures multiplicative patterns
MC Dropout	No confidence measure	Real uncertainty bands
BB Zero Guard	inf values corrupt training	Stable feature engineering

Common Interview Questions:

What are the three gates in an LSTM and what does each do?

The forget gate decides what to drop from the cell state — a sigmoid over (previous hidden, current input) producing values 0 to 1, multiplied element-wise into the cell. The input gate decides what new information to write — a sigmoid for "how much" times a tanh for "what content". The output gate decides what part of the cell state to expose as the next hidden state — a sigmoid masking a tanh of the cell. Together they let the network learn to selectively remember, update, and reveal information across long sequences. The cell state is the highway; the gates are the on-ramps and off-ramps.

How does LSTM solve the vanishing gradient problem?

Vanilla RNNs apply a non-linear transformation at every time step, so gradients backpropagating through long sequences shrink exponentially toward zero (or explode). LSTM's cell state has an additive update path — c_t = f_t * c_{t-1} + i_t * g_t — with no activation function on the recurrent connection itself. Gradients flow back through the cell state largely unattenuated when the forget gate is near 1. The gates can still saturate and cause vanishing in pathological cases, but in practice LSTMs train successfully on sequences of hundreds of steps where vanilla RNNs fail by step 20.

What's the difference between hidden state and cell state?

The cell state (c_t) is the long-term memory — it flows through the sequence largely unchanged via the additive update, gated by forget/input. The hidden state (h_t) is the short-term, exposed memory — computed as h_t = o_t * tanh(c_t), it's what the next layer (or output head) sees and what feeds into the next time step's gate computations. In code you carry both as the recurrent state. Conceptually: cell = what the network remembers, hidden = what it's currently saying. The split is what gives LSTM its memory advantage over GRU (which collapses both into one state and is faster but slightly less expressive on hard sequence tasks).

When does LSTM still beat a Transformer in 2026?

Honestly, rarely. A few niches survive: (1) very long streaming sequences where Transformer's O(n²) attention is prohibitive and you don't have access to a state-space model like Mamba; (2) tiny-footprint deployments (microcontrollers, embedded sensor fusion) where a small LSTM fits in 100KB and a Transformer doesn't; (3) classical time-series forecasting on small datasets where a Transformer overfits and an LSTM with proper regularization generalizes better. For NLP and most sequence modeling, Transformers (or hybrid SSM/Transformer architectures) have eaten the field; LSTM is increasingly a "I know this works and I don't need to retrain" choice rather than a "this is best" choice.

How do you initialize an LSTM and why does it matter?

Standard practice: orthogonal initialization for recurrent weights (preserves gradient norm through repeated multiplication), Xavier/Glorot for input weights, and — this is the key one — initialize the forget-gate bias to +1 (or higher). At initialization with bias 0, the forget gate sigmoid is ~0.5, so half the memory is lost per step. With bias +1, the gate starts near 1 (remember everything) and the network learns when to forget rather than learning when to remember. This single trick dramatically improves training stability and is built into most modern LSTM implementations by default.

How do you handle variable-length sequences in an LSTM?

Pad batches to the max length with a reserved PAD token, then mask. In PyTorch the idiomatic pattern is pack_padded_sequence before the LSTM and pad_packed_sequence after — this skips actual computation on PAD positions, so longer-padded shorter-content batches don't waste GPU time. Sort by length descending within each batch (the packing API requires it) or use enforce_sorted=False. For attention or pooling on top of the LSTM output, apply the mask so PAD positions don't contribute. Forgetting to mask is a classic bug that makes the model learn to predict PAD frequencies.