- converge : hội tụ
-
gradient : độ dốc descent : sự hạ xuống
Gradient Descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. It’s the core key to train a model in Machine Learning.
-
Objective: Minimize a function $f(x)$, where $x \in \mathbb R^{n}$ is a vector of parameters. The function $f(x)$ could represent the error or cost function in machine learning.
-
Gradient: The gradient of f with respect to x is denoted by:
\[\Delta f(x) = \Bigr[\frac{\partial f}{\partial x_{1}}, \frac{\partial f}{\partial x_{2}},.. ,\frac{\partial f}{\partial x_{n}} \Bigr]\]$\Delta f(x)$ is a vector that points in the direction of the greatest increase of $f$ (direction where $f$ increases most rapidly).
-
Update Rule: To minimize $f(x)$, we move in the direction opposite to the gradient:
\[x^{t+1} = x^{t} - \alpha \cdot \Delta f(x^{t})\]Where:
- $x^{t}$ is the parameter vector at iteration t,
- $\alpha > 0$ is the learning rate, which controls the step size,
- $\Delta f(x^{t})$ is the gradient of $f$ at $x^{t}$, denoting how $f(x)$ changes with respect to $x$.
Explanation : At each iteration, compute the $\Delta f(x^{t})$, update $x^{t+1}$, repeat until convergence (e.g. when the gradient magnitude is close to zero OR the change of $f(x)$ is smallest).
Figure 1: Gradient Descent
-
Convergence Conditions: Gradient descent converges to a local minimum if:
-
The learning rate $\alpha$ is set properly (neither too large nor too small).
-
Smoothness (Êm): $\Delta f(x)$ > 0 exists everywhere and should not change too abruptly.
-
Convexity (Cong): The Hessian matrix H (matrix of second derivatives) must have all eigenvalues > 0. To be simple, we all learnt at school:
- Poor initialization of $x_{0}$ can require many iterations to reach the minimum.
-
-
Variants (not yet):
-
Stochastic Gradient Descent (SGD): Uses a single or small batch of data to estimate the gradient, introducing randomness.
-
Mini-Batch Gradient Descent: Combines the benefits of full-batch and stochastic methods by using small batches.
-
Momentum, Adam, RMSProp: Modifications of gradient descent that adjust the learning rate or use momentum to accelerate convergence.
-