This set of Machine Learning Multiple Choice Questions & Answers (MCQs) focuses on “Gradient Descent”.

1. Gradient descent is an optimization algorithm for finding the local minimum of a function.
a) True
b) False

Explanation: Gradient descent is an optimization algorithm for finding the local minimum of a function. It is used to find the values of parameters of a function that minimizes a cost function. The slope of this cost function curve tells us how to update our parameters to make the model more accurate.

2. We can use gradient descent as a best solution, when the parameters cannot be calculated analytically.
a) False
b) True

Explanation: Gradient descent is best used when the parameters cannot be calculated using linear algebra (analytically). So, in order to solve a system of nonlinear equations numerically, we have to reformulate it as an optimization problem. And it must be searched by an optimization algorithm like gradient descent.

a) It updates the weight to comprise a small step in the direction of the negative gradient
b) The learning rate parameter is η where η > 0
c) In each iteration, the gradient is re-evaluated for the new weight vector
d) In each iteration, the weight is updated in the direction of positive gradient

Explanation: Gradient descent is an optimization algorithm, and in each iteration the weight is not updated in the direction of positive gradient. Here it updates the weight in the direction of the negative gradient. And the gradient is re-evaluated for the new weight vector with a learning parameter η > 0.

4. In batch method gradient descent, each step requires the entire training set be processed in order to evaluate the error function.
a) True
b) False

Explanation: Techniques that use the whole data set at once are called batch methods. So in batch method gradient descent, each step requires the entire training set be processed in order to evaluate the error function. Here the error function is defined with respect to a training set.

5. Simple gradient descent is a better batch optimization method than conjugate gradients and quasi-newton methods.
a) False
b) True

Explanation: Conjugate gradients and quasi-newton methods are the more robust and faster batch optimization methods than simple gradient descent. In these algorithms the error function always decreases at each iteration unless the weight vector has arrived at a local or global minimum unlike gradient descent.

6. What is the gradient of the function 2x2 – 3y2 + 4y – 10 at point (0, 0)?
a) 0i + 4j
b) 1i + 10j
c) 2i – 3j
d) -3i + 4j

Explanation: Given the function f = 2x2-3y2+4y-10 at point (0, 0). Then the gradient of the function can be calculated as:
$$\frac {\partial f}{\partial x} = \frac {\partial (2x^2 – 3y^2 + 4y – 10)}{\partial x}$$
= 4x
= 4 * 0
= 0
$$\frac {\partial f}{\partial y} = \frac {\partial (2x^2 – 3y^2 + 4y – 10)}{\partial y}$$
= -6y + 4
= (-6 * 0) + 4
= 0 + 4
= 4
Gradient, ∇f = 0i + 4j

7. The gradient is set to zero to find the minimum or the maximum of a function.
a) False
b) True

Explanation: The gradient is set to zero, to find the minimum or the maximum of a function. Because the value of gradient at extremes (minimum or maximum) of a function is always zero. So the derivative of the function is zero at any local maximum or minimum.

8. The main difference between gradient descents variants are based on the amount of data.
a) True
b) False

Explanation: There are mainly three types of gradient descents. They are batch gradient descent, stochastic gradient, and mini-batch gradient descent. The main difference between these algorithms is the amount of data they handle. And based on this their accuracy, and time taken for the weight updating varies.

9. Which of the following statements is false about choosing learning rate in gradient descent?
a) Small learning rate leads to slow convergence
b) Large learning rate cause the loss function to fluctuate around the minimum
c) Large learning rate can cause to divergence
d) Small learning rate cause the training to progress very fast

Explanation: If the learning rate is too small then the training will progress very slowly because the weight updating is very small. So, it leads to slow convergence. Whereas the large learning rate causes the loss function to fluctuate around the minimum and even can cause divergence.

10. Which of the following is not related to a gradient descent?
d) RMSprop

Explanation: AdaBoost is a meta algorithm to combine the base learners to form a final classifier. Where Adadelta, Adagrad and RMSprop are the gradient descent optimization algorithms. And these algorithms are most widely used by the deep learning community to solve a number of challenges.

11. Given a function y = (x + 4)2. What is the local minima of the function starting from the point x = 3 and the value of x after the first iteration using gradient descent (Assume the learning rate is 0.01)?
a) 0, 3.02
b) 0, 4.08
c) -4, 2.86
d) 4, 3.8

Explanation: We know y = (x + 4)2 reaches its minimum value when x = -4 (i.e when x = -4, y = 0). Hence x = -4 is the local and global minima of the function.
Let x0 = 3, Learning rate = 0.01 and y = (x + 4)2. Then using gradient descent,
$$\frac {dy}{dx} = \frac {d(x + 4)^2}{dy}$$
= 2 * (x + 4)
During the first iteration,
x1 = x0 – (learning rate * $$\frac {dy}{dx}$$)
= 3 – (0.01 * (2 * (3 + 4)))
= 3 – (0.01 * (2 * 7))
= 3 – (0.01 * 14)
= 3 – 0.14
= 2.86

12. Given a function y = (x + 30)2. How many iterations does it need to reach the first negative value of the function starting from the point x = 1 using gradient descent (Assume the learning rate is 0.01)?
a) 3
b) 4
c) 2
d) 5

Explanation: Let x0 = 1, Learning rate = 0.01 and y = (x + 30)2. Then using gradient descent,
$$\frac {dy}{dx} = \frac {d(x + 30)^2}{dy}$$
= 2 * (x + 30)
During the first iteration,
x1 = x0 – (learning rate * $$\frac {dy}{dx}$$)
= 1 – (0.01 * (2 * (1 + 30)))
= 1 – (0.01 * (2 * 31))
= 1 – (0.01 * 62)
= 1 – 0.62
= 0.38
During the second iteration,
x2 = x1 – (learning rate * $$\frac {dy}{dx}$$)
= 0.38 – (0.01 * (2 * (0.38 + 30)))
= 0.38 – (0.01 * (2 * 30.38))
= 0.38 – (0.01 * 60.76)
= 0.38 – 0.61
= -0.23
So, the function reaches the first negative value after the two iterations.

Sanfoundry Global Education & Learning Series – Machine Learning.

To practice all areas of Machine Learning, here is complete set of 1000+ Multiple Choice Questions and Answers.

If you find a mistake in question / option / answer, kindly take a screenshot and email to [email protected]