for t = 1, 2,...N: (N is the number of mini-batches)
 - Forward propagate on $X^{\{ t\}}$
 - Compute cost function $J^{\{ t\}}$
 - Backpropagate to compute gradients wrt $J^{\{ t\}}$ (using $X^{\{ t\}}$,$Y^{\{ t\}}$)
 - $W^{[l]} =W^{[l]}-\alpha dW^{[l]}, b^{[l]} = b^{[l]}-\alpha db^{[l]}$

This is one pass through your training set using mini-batch gradient descent. It is also called doing one epoch of training.

University of Michigan - Ann Arbor

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

Epoch is every iteration of gradient descent through the entire training set.

Epoch in Gradient Descent

Machine Learning Yearning, a free ebook from Andrew Ng, teaches you how to structure Machine Learning projects.
https://www.deeplearning.ai/machine-learning-yearning/

Note: The content of the book is aligned with the Coursera Deeplearning.ai specialization.  https://www.deeplearning.ai/deep-learning-specialization/ 

Machine Learning Yearning (Deeplearning.ai)

If you have a huge training set with 5000000 training samples,
$X=[ x^{(1)},  x^{(2)}  ,...,  x^{(5000000)}]$
Let's say each of your baby training sets have just 1,000 examples each. So, you take the first mini-batch as $X^{\{ 1\}} =[x^{(1)},  x^{(2)}  ,...,  x^{(1000)}] $. And then you take home the next 1,000 samples $X^{\{2\}} =[x^{(1001)},  ...,  x^{(2000)}] $ and so on.

Altogether you would have 5,000 of these mini-batches and then similarly you do the same thing for Y. Hence we end up with mini-batches $X^{\{ T\}}, Y^{\{ T\}}$, T = 1,2...,5000.

An Example of Mini-Batches

Mini-Batch Gradient Descent Algorithm

Batch gradient descent (batch size = N) takes relatively low noise, relatively large steps. And you could just keep matching to the minimum. However, it may take a long time to process and need additional memory.

Stochastic gradient descent (batch size = 1)  is easy to fit in memory and efficient for large datasets. But it can be extremely noisy since sometimes you hit in the wrong direction if that a training example happens to point in a bad direction. It won't ever converge, and will always just kind of oscillate and wander around the region of the minimum. 

in practice, mini-batch gradient descent with batch size in between 1 and N works better. It's not guaranteed to always head toward the minimum but it tends to head more consistently in direction of the minimum. 

Batch vs Stochastic vs Mini-Batch Gradient Descent

First, we take a constant learning rate represented by the blue line. We see that, as we iterate, the steps are large and noisy and do not converge on a minimum. Instead, it wanders around the minimum.

Next, we take a decaying learning rate represented by the green line. At the start, the learning rate takes large steps with each iteration. But the learning rate is reduced or decayed as it approaches the minimum. This slower learning rate takes smaller tighter steps around the minimum and is closer to convergence.

This method allows us to have relatively fast learning during the initial phases with large steps, but also converge to a minimum during the final phases with slower learning rates and smaller steps.

Example Using Mini-Batch Gradient Descent (Learning Rate Decay)

If your training set is small (m < 2,000), it's better to use Batch Gradient Descent.
Make sure that every mini-batch fits in your CPU/GPU memory.
It is a common practice to use powers of two as a mini-batch size: 64, 128, 256. This is related to the fact that the number of physical processors of the GPU tend to be a power of 2.
If the batch size is too small, the loss curve will oscillate and affect the stability of training

Mini-Batches Size

Which of these statements about mini-batch gradient descent do you agree with?

Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

Which of the following do you agree with?

Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like the image below:

If we choose the mini-batch size to be 1, then it gives the algorithm called Stochastic Gradient Descent or SGD.

In this case, on every iteration, you're taking gradient descent with just a single training example
$w = w - \alpha \nabla_w J(x^i, y^i; w)$

The most important property of SGD is that computation time per step does not grow with the number of examples. This makes SGD very efficient with large training sets.

The learning rate is a hyperparameter that must be adjusted. Unlike regular parameters of a model (weights like w and b), which are learned by the algorithm from the training set, hyperparameters are special parameters chosen by the algorithm designer that affect how the algorithm works.

Stochastic Gradient Descent Algorithm

The term $\frac{\partial L_{\theta_t}(D_{\text{mini}})}{\partial \theta_t}$ represents the gradient of the loss function $L$ with respect to the model parameters $\theta_t$. This gradient is calculated based on a specific mini-batch of training data, $D_{\text{mini}}$, and indicates the direction of the steepest increase in the loss for that batch.

Loss Gradient over a Mini-batch

$$\alpha = \frac{1}{1 + decay\_rate * epoch\_num}\alpha_0$$
, where $\alpha$ is the learning rate in the current epoch, $\alpha_0$ is the initial learning rate, $epoch\_num$ is the current epoch and $decay\_rate$ is the decay rate selected. The decay rate is a tunable hyperparameter.

Initializing $decay\_rate = 1$ and $\alpha_0 = 0.2$, we graph an example with $epoch\_num$ on x-axis and $\alpha$ on y-axis. In the graph we observe the decay of learning rate.

Learn Before

Related