Deep Learning | Uzay Macar

Backpropagation

Algorithm

Definition: Algorithm to compute gradients in neural networks

\frac{\partial L}{\partial w_{ij}^{(l)}} = \frac{\partial L}{\partial a_j^{(l)}} \frac{\partial a_j^{(l)}}{\partial z_j^{(l)}} \frac{\partial z_j^{(l)}}{\partial w_{ij}^{(l)}}

Advantages: Efficient gradient computation • Enables training of deep networks

Disadvantages: Can be computationally expensive • Susceptible to vanishing/exploding gradients

Happens when: Core part of neural network training

Why it works: Applies chain rule to compute gradients efficiently

Usage: All gradient-based neural network training

Batch Normalization

Layer

Definition: Normalizes layer inputs for each mini-batch

\hat{x}^{(k)} = \frac{x^{(k)} - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

y^{(k)} = \gamma\hat{x}^{(k)} + \beta

Advantages: Faster training • Reduces internal covariate shift • Regularization effect

Disadvantages: Adds complexity • Can amplify small perturbations

Why it works: Stabilizes input distributions, allowing higher learning rates

Usage: Most deep networks, especially CNNs

Cross Entropy

Loss Function

Definition: Measures divergence between predicted and true distributions

L = -\sum_{i=1}^C y_i \log(\hat{y_i})

Advantages: Suitable for classification, provides strong gradients

Disadvantages: Sensitive to outliers

Usage: Classification tasks (binary and multi-class)

Dropout

Layer

Definition: Randomly deactivates neurons during training

y = f(x) \odot \text{Bernoulli}(p)

Advantages: Reduces overfitting • Ensemble-like behavior

Disadvantages: Increased training time • Potential underfitting if overused

Why it works: Prevents co-adaptation of neurons, forcing robust feature learning

Usage: Most deep networks, adjust rate based on overfitting

Exploding Gradients

Training

Definition: Gradients become extremely large, causing unstable updates

\left\|\frac{\partial L}{\partial w}\right\| \rightarrow \infty

Disadvantages: Unstable training • Poor or diverging performance

Happens when: Deep networks • High learning rates • Poor weight initialization

Solutions: Gradient clipping • Proper weight initialization • Batch normalization • Lower learning rates

LeakyReLU

Activation Function

Definition: Like ReLU, but allows small negative values

f(x) = \max(\alpha x, x) \text{ where } \alpha <<

Advantages: Prevents dying ReLU problem

Disadvantages: Inconsistent benefits across tasks

Usage: Alternative to ReLU in hidden layers

Overfitting

Training

Definition: Model performs well on training data but poorly on unseen data

E_{train}(\theta) << E_{test}(\theta)

E_{train} = \text{train error}; E_{test} = \text{test error}; \theta = \text{learnable parameters}

Advantages: Perfect performance on training set

Disadvantages: Poor generalization • High variance

Happens when: Complex model relative to data amount • Training too long

Solutions: Regularization (L1, L2) • More training data • Early stopping • Dropout

Why it works: Reduces model complexity, forcing it to learn general patterns

Usage: Avoid in all scenarios; balance with underfitting

ReLU

Activation Function

Definition: Returns input for positive values, zero otherwise

f(x) = \max(0, x)

Advantages: Reduces vanishing gradients, computationally efficient

Disadvantages: "Dying ReLU" problem

Usage: Most common in hidden layers

Sigmoid

Activation Function

Definition: S-shaped function, outputs between 0 and 1

\sigma(x) = \frac{1}{1 + e^{-x}}

Advantages: Smooth gradient, good for binary classification

Disadvantages: Vanishing gradients, not zero-centered

Usage: Output layer for binary classification, rarely in hidden layers

Tanh

Activation Function

Definition: Hyperbolic tangent function

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

Advantages: Zero-centered, stronger gradients than sigmoid

Disadvantages: Still suffers from vanishing gradients

Usage: Hidden layers, but less common now

Underfitting

Training

Definition: Model fails to capture underlying patterns in data

E_{train}(\theta) \approx E_{test}(\theta) >> E_{optimal}

Advantages: Avoids memorizing noise

Disadvantages: Poor performance on both training and test data • High bias

Occurrence: Too simple model • Insufficient training

Solutions: Increase model complexity • Train longer • Feature engineering

Why it works: Allows model to capture more complex patterns

Usage: Avoid; balance with overfitting

Vanishing Gradients

Training

Definition: Gradients become extremely small, slowing learning in early layers

\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial a_n} \prod_{i=1}^n \frac{\partial a_i}{\partial a_{i-1}} \approx 0

Disadvantages: Slow or stalled learning • Poor performance in deep networks

Happens when: Deep networks • Certain activation functions (e.g., sigmoid, tanh)

Solutions: ReLU activation • Residual connections • Proper weight initialization • Batch normalization