Definition: Algorithm to compute gradients in neural networks
Advantages: Efficient gradient computation • Enables training of deep networks
Disadvantages: Can be computationally expensive • Susceptible to vanishing/exploding gradients
Happens when: Core part of neural network training
Why it works: Applies chain rule to compute gradients efficiently
Usage: All gradient-based neural network training
Definition: Normalizes layer inputs for each mini-batch
Advantages: Faster training • Reduces internal covariate shift • Regularization effect
Disadvantages: Adds complexity • Can amplify small perturbations
Why it works: Stabilizes input distributions, allowing higher learning rates
Usage: Most deep networks, especially CNNs
Definition: Measures divergence between predicted and true distributions
Advantages: Suitable for classification, provides strong gradients
Disadvantages: Sensitive to outliers
Usage: Classification tasks (binary and multi-class)
Definition: Randomly deactivates neurons during training
Advantages: Reduces overfitting • Ensemble-like behavior
Disadvantages: Increased training time • Potential underfitting if overused
Why it works: Prevents co-adaptation of neurons, forcing robust feature learning
Usage: Most deep networks, adjust rate based on overfitting
Definition: Gradients become extremely large, causing unstable updates
Disadvantages: Unstable training • Poor or diverging performance
Happens when: Deep networks • High learning rates • Poor weight initialization
Solutions: Gradient clipping • Proper weight initialization • Batch normalization • Lower learning rates
Definition: Like ReLU, but allows small negative values
Advantages: Prevents dying ReLU problem
Disadvantages: Inconsistent benefits across tasks
Usage: Alternative to ReLU in hidden layers
Definition: Model performs well on training data but poorly on unseen data
Advantages: Perfect performance on training set
Disadvantages: Poor generalization • High variance
Happens when: Complex model relative to data amount • Training too long
Solutions: Regularization (L1, L2) • More training data • Early stopping • Dropout
Why it works: Reduces model complexity, forcing it to learn general patterns
Usage: Avoid in all scenarios; balance with underfitting
Definition: Returns input for positive values, zero otherwise
Advantages: Reduces vanishing gradients, computationally efficient
Disadvantages: "Dying ReLU" problem
Usage: Most common in hidden layers
Definition: S-shaped function, outputs between 0 and 1
Advantages: Smooth gradient, good for binary classification
Disadvantages: Vanishing gradients, not zero-centered
Usage: Output layer for binary classification, rarely in hidden layers
Definition: Hyperbolic tangent function
Advantages: Zero-centered, stronger gradients than sigmoid
Disadvantages: Still suffers from vanishing gradients
Usage: Hidden layers, but less common now
Definition: Model fails to capture underlying patterns in data
Advantages: Avoids memorizing noise
Disadvantages: Poor performance on both training and test data • High bias
Occurrence: Too simple model • Insufficient training
Solutions: Increase model complexity • Train longer • Feature engineering
Why it works: Allows model to capture more complex patterns
Usage: Avoid; balance with overfitting
Definition: Gradients become extremely small, slowing learning in early layers
Disadvantages: Slow or stalled learning • Poor performance in deep networks
Happens when: Deep networks • Certain activation functions (e.g., sigmoid, tanh)
Solutions: ReLU activation • Residual connections • Proper weight initialization • Batch normalization