LSTM stands for Long Short-Term Memory. It is a type of Recurrent Neural Network (RNN), which places it firmly within the field of deep learning — a subset of machine learning that uses multi-layered neural networks to learn patterns from data.
Artificial Intelligence
└── Machine Learning
└── Deep Learning
├── CNNs (images, spatial data)
├── Transformers (language models, attention-based)
└── RNNs (sequential / time-series data)
├── Vanilla RNN (simple, suffers from vanishing gradients)
├── GRU (simplified gating, faster)
└── LSTM (full gating mechanism, best for long sequences)
Standard neural networks treat each input independently — they have no concept of order or memory. But many real-world problems are sequential: stock prices, weather, language, sensor readings, music. The current value depends on what came before it.
Vanilla RNNs attempted to solve this by feeding output back into the network as input, but they suffer from the vanishing gradient problem — during training, gradients shrink exponentially as they propagate backward through time, making it impossible to learn long-range dependencies. An RNN trained on 365 days of data effectively "forgets" what happened 60+ days ago.
LSTM was introduced by Hochreiter & Schmidhuber in 1997 specifically to fix this. It uses a gating mechanism that allows it to selectively remember or forget information over long sequences — hundreds or even thousands of time steps.
Each LSTM cell has three gates and a cell state. Think of the cell state as a conveyor belt — information flows along it largely unchanged, and the gates decide what to add or remove.
┌─────────────────────────────────────────┐
│ LSTM Cell │
│ │
Cell State ─────┤── × ──────────── + ──────────────────────┤──── Cell State
(long-term │ │ │ │ (updated)
memory) │ │ │ │
│ ┌─┴──┐ ┌──────┴──────┐ ┌────────┐ │
│ │ Fg │ │ Ig × Cand │ │ Og │ │
│ │gate│ │ gate gate │ │ gate │ │
│ └─┬──┘ └──┬────┬─────┘ └───┬────┘ │
│ │ │ │ │ │
│ └────┬────┘ │ ┌───┘ │
│ │ │ │ │
Hidden State ───┤────────┴─────────┴─────────┤────────────┤──── Hidden State
(short-term │ [h(t-1), x(t)] concat │ tanh() │ (output)
memory) │ │ │
└─────────────────────────────────────────┘
Input: x(t)
Fg = Forget Gate → "What old info should I discard?" σ(0 to 1)
Ig = Input Gate → "What new info is worth storing?" σ(0 to 1)
Cand = Candidate → "What are the new candidate values?" tanh(-1 to 1)
Og = Output Gate → "What part of cell state to output?" σ(0 to 1)
Here is a simple LSTM for time-series prediction using TensorFlow/Keras:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
# Suppose X_train has shape: (samples, timesteps, features)
# e.g. 1000 samples, 60-day lookback, 8 features per day
model = Sequential([
LSTM(128, return_sequences=True, input_shape=(60, 8)),
Dropout(0.2),
LSTM(128, return_sequences=True),
Dropout(0.2),
LSTM(128, return_sequences=False), # last layer returns single output
Dropout(0.2),
Dense(64, activation='relu'),
Dense(32, activation='relu'),
Dense(1), # predict one value (e.g. tomorrow's price)
])
model.compile(optimizer='adam', loss='huber', metrics=['mae'])
model.fit(X_train, y_train, epochs=100, batch_size=64,
validation_split=0.1)
return_sequences=True on all layers except the last.
Below are five targeted improvements that significantly impact prediction quality, especially when working with long lookback windows and volatile time-series data.
Mean Squared Error (MSE) squares every error, which means large price moves (crashes, spikes) dominate the loss function. The model learns to "play it safe" and revert toward historical averages to minimize those squared penalties.
Huber loss behaves like MSE for small errors but switches to
linear (MAE-like) behavior for large errors, controlled by a delta
threshold. This makes the model robust to outlier moves without ignoring them entirely.
# MSE: loss = (y_true - y_pred)² ← large errors get squared
# MAE: loss = |y_true - y_pred| ← linear, but not smooth at 0
# Huber: best of both worlds
import tensorflow as tf
# delta controls the switchover point
loss_fn = tf.keras.losses.Huber(delta=1.0)
model.compile(optimizer='adam', loss=loss_fn, metrics=['mae'])
Loss
│
│ MSE ╱
│ ╱ Huber loss: quadratic near zero,
│ ╱ linear for large errors
│ ╱ ╱ Huber
│ ╱ ╱
│ ╱ ╱
│ ╱ ╱ ╱ MAE
│╱╱╱
└───────────── Error
δ
With a long lookback (e.g. 1460 days), the model treats all historical data equally. But market conditions from 3 years ago may be irrelevant today. Sample weighting lets you tell the model: "recent data matters more."
import numpy as np
def make_sample_weights(n_samples, method='linear'):
"""Weight recent training samples more heavily."""
t = np.linspace(0, 1, n_samples)
if method == 'linear':
w = 0.5 + 0.5 * t # range: 0.5 → 1.0
elif method == 'exponential':
w = np.exp(3 * t) # ~20× heavier at the end
else:
return None
w /= w.mean() # normalise so mean weight = 1
return w
weights = make_sample_weights(len(X_train), method='linear')
model.fit(X_train, y_train,
sample_weight=weights, # ← pass to .fit()
epochs=100, batch_size=64)
Weight
1.0 │ ╱ linear
│ ╱╱╱
│ ╱╱╱
│ ╱╱╱
0.5 │──────────────╱╱╱
│
└──────────────────────────────── Time
oldest newest
sample sample
When data grows exponentially over time, MinMaxScaler compresses early values into a tiny band near zero and recent values near one. The LSTM struggles to learn from this distorted distribution. Log returns transform multiplicative price changes into additive ones, producing a roughly stationary series that the network can learn from much more effectively.
import numpy as np
import pandas as pd
# Raw price: [100, 110, 105, 120, 115]
# After MinMaxScaler: compressed, non-stationary
# Log returns: captures the *rate of change* regardless of price level
df['log_returns'] = np.log(df['price'] / df['price'].shift(1))
# A 10% gain at $100 and a 10% gain at $100,000
# both produce log_return ≈ 0.0953
# Without log returns, the $100 move is invisible to the scaler
Raw price (exponential growth): Log returns (stationary):
│ ╱ │
│ ╱ 0.1 │ ╷ ╷ ╷
│ ╱╱ │ │╷ │╷ ╷╷ │╷
│ ╱╱ 0.0 │──┼┼──┼┼──┼┼──┼┼──
│ ╱╱╱ │ ╵│ ╵│ │╵ ╵
│╱╱╱╱ -0.1 │ ╵ ╵ ╵
└────────── Time └────────────── Time
Hard to learn from Easy to learn from
Standard LSTM prediction gives you a single line — no indication of how confident the model is. Monte Carlo Dropout runs the prediction multiple times (e.g. 50 runs) with dropout kept ON during inference. Each run produces a slightly different forecast. The spread of those forecasts gives you a real, data-driven confidence interval.
import numpy as np
def predict_monte_carlo(model, input_sequence, n_days, n_runs=50):
"""Run n_runs stochastic forward passes with dropout active."""
all_forecasts = []
for _ in range(n_runs):
predictions = []
seq = input_sequence.copy()
for _ in range(n_days):
# training=True keeps dropout ON → stochastic output
pred = model(seq.reshape(1, *seq.shape), training=True).numpy()
predictions.append(pred[0, 0])
new_row = seq[-1].copy()
new_row[0] = pred[0, 0]
seq = np.vstack([seq[1:], new_row])
all_forecasts.append(predictions)
all_forecasts = np.array(all_forecasts) # shape: (n_runs, n_days)
median = np.median(all_forecasts, axis=0)
lower = np.percentile(all_forecasts, 5, axis=0) # 5th percentile
upper = np.percentile(all_forecasts, 95, axis=0) # 95th percentile
return median, lower, upper
Price
│ ╱╱╱╱╲╲ ← upper 95th percentile
│ ╱╱╱╱╱╱╱╱╲╲
│ ╱╱╱╱╱╱╱╱╱╱╱╱
│ ╱╱╱ ── median ──╲╲
│ ╱╱╱╱╱╱╱╱╱╱╱╱╱╱╱
│ ╱╱╱╱╱╱╱╱╱╱╲╲
│ ╱╱╱╱╲╲ ← lower 5th percentile
│ │
│─┤
│ │ ← today
└──────────────────── Time
Wider band = more uncertainty
(grows further into the future)
The Bollinger Band position feature calculates where the current price sits
relative to the upper and lower bands. When volatility drops to near zero (flat
price action), the bands collapse and the denominator approaches zero, producing
inf or NaN values. These poison the MinMaxScaler and
corrupt the entire training dataset.
import numpy as np
sma_20 = df['price'].rolling(20).mean()
std_20 = df['price'].rolling(20).std()
upper_band = sma_20 + 2 * std_20
lower_band = sma_20 - 2 * std_20
band_width = upper_band - lower_band
# BEFORE (dangerous):
# bb_position = (price - lower) / (upper - lower)
# → inf when upper == lower
# AFTER (safe):
bb_position = np.where(
band_width > 0,
(df['price'] - lower_band) / band_width,
0.5 # neutral position when bands collapse
)
Normal bands: Collapsed bands:
Price upper ───────── upper ═══════════
│ ╱ ╲ lower ═══════════
│ ────╱── price ────────╲── price ═══════════
│ ╲ ╱
│ lower ──────── band_width ≈ 0
│ division → inf ✗
│ band_width > 0 use 0.5 instead ✓
└──────────── Time
| Improvement | Problem Solved | Impact |
|---|---|---|
| Huber Loss | MSE over-penalizes large moves | Model stops reverting to mean |
| Sample Weighting | Old data drowns out recent trends | Learns current market structure |
| Log Returns | Scaler compresses exponential data | Captures multiplicative patterns |
| MC Dropout | No confidence measure | Real uncertainty bands |
| BB Zero Guard | inf values corrupt training | Stable feature engineering |
The forget gate decides what to drop from the cell state — a sigmoid over (previous hidden, current input) producing values 0 to 1, multiplied element-wise into the cell. The input gate decides what new information to write — a sigmoid for "how much" times a tanh for "what content". The output gate decides what part of the cell state to expose as the next hidden state — a sigmoid masking a tanh of the cell. Together they let the network learn to selectively remember, update, and reveal information across long sequences. The cell state is the highway; the gates are the on-ramps and off-ramps.
Vanilla RNNs apply a non-linear transformation at every time step, so gradients backpropagating through long sequences shrink exponentially toward zero (or explode). LSTM's cell state has an additive update path — c_t = f_t * c_{t-1} + i_t * g_t — with no activation function on the recurrent connection itself. Gradients flow back through the cell state largely unattenuated when the forget gate is near 1. The gates can still saturate and cause vanishing in pathological cases, but in practice LSTMs train successfully on sequences of hundreds of steps where vanilla RNNs fail by step 20.
The cell state (c_t) is the long-term memory — it flows through the sequence largely unchanged via the additive update, gated by forget/input. The hidden state (h_t) is the short-term, exposed memory — computed as h_t = o_t * tanh(c_t), it's what the next layer (or output head) sees and what feeds into the next time step's gate computations. In code you carry both as the recurrent state. Conceptually: cell = what the network remembers, hidden = what it's currently saying. The split is what gives LSTM its memory advantage over GRU (which collapses both into one state and is faster but slightly less expressive on hard sequence tasks).
Honestly, rarely. A few niches survive: (1) very long streaming sequences where Transformer's O(n²) attention is prohibitive and you don't have access to a state-space model like Mamba; (2) tiny-footprint deployments (microcontrollers, embedded sensor fusion) where a small LSTM fits in 100KB and a Transformer doesn't; (3) classical time-series forecasting on small datasets where a Transformer overfits and an LSTM with proper regularization generalizes better. For NLP and most sequence modeling, Transformers (or hybrid SSM/Transformer architectures) have eaten the field; LSTM is increasingly a "I know this works and I don't need to retrain" choice rather than a "this is best" choice.
Standard practice: orthogonal initialization for recurrent weights (preserves gradient norm through repeated multiplication), Xavier/Glorot for input weights, and — this is the key one — initialize the forget-gate bias to +1 (or higher). At initialization with bias 0, the forget gate sigmoid is ~0.5, so half the memory is lost per step. With bias +1, the gate starts near 1 (remember everything) and the network learns when to forget rather than learning when to remember. This single trick dramatically improves training stability and is built into most modern LSTM implementations by default.
Pad batches to the max length with a reserved PAD token, then mask. In PyTorch the idiomatic pattern is pack_padded_sequence before the LSTM and pad_packed_sequence after — this skips actual computation on PAD positions, so longer-padded shorter-content batches don't waste GPU time. Sort by length descending within each batch (the packing API requires it) or use enforce_sorted=False. For attention or pooling on top of the LSTM output, apply the mask so PAD positions don't contribute. Forgetting to mask is a classic bug that makes the model learn to predict PAD frequencies.