This set of Machine Learning Multiple Choice Questions & Answers (MCQs) focuses on “Stochastic Gradient Descent”.

1. Stochastic gradient descent is also known as on-line gradient descent.

a) True

b) False

View Answer

Explanation: Stochastic gradient descent is also known as online gradient descent. It is said to be online because it can update coefficients on new samples as it comes in the system. So it makes an update to the weight vector based on one data point at a time.

2. Stochastic gradient descent (SGD) methods handle redundancy in the data much more efficiently than batch methods.

a) True

b) False

View Answer

Explanation: Stochastic gradient descent methods handle redundancy in the data much more efficiently than batch methods. If we are doubling the dataset size then the error function will be multiplied by a factor of 2. And SGD can handle this error function normally but batch methods need double the computational power to handle this error function.

3. Which of the following statements is true about stochastic gradient descent?

a) It processes all the training examples for each iteration of gradient descent

b) It is computationally very expensive, if the number of training examples is large

c) It processes one training example per iteration

d) It is not preferred, if the number of training examples is large

View Answer

Explanation: Stochastic gradient descent processes one training example per iteration. That is it updates the weight vector based on one data point at a time. All other three are the features of Batch Gradient Descent.

4. Which of the following statements is not true about the stochastic gradient descent?

a) The parameters are being updated after one iteration

b) It is quite faster than batch gradient descent

c) Stochastic gradient descent is faster than mini batch gradient descent

d) When the number of training examples is large, it can be additional overhead for the system

View Answer

Explanation: Stochastic gradient descent is not faster than mini batch gradient descent but is slower than it. But it is faster than batch gradient descent and the parameters are updated after each iteration. And when the number of training examples is large, then the number of iterations increases and it will be an overhead for the system.

5. Stochastic gradient descent falls under Non-convex optimization.

a) True

b) False

View Answer

Explanation: Stochastic gradient descent falls under Non-convex optimization. A non-convex optimization problem is any problem where the objective or any of the constraints are non-convex.

6. Which of the following statements is not true about stochastic gradient descent?

a) Due to the frequent updates, there can be so many noisy steps

b) It may take longer to achieve convergence to the minima of the loss function

c) Frequent updates are computationally expensive

d) It is computationally slower

View Answer

Explanation: Stochastic gradient descent (SGD) is not computationally slower but is faster, as only one sample is processed at a time. All other three are the disadvantages of SGD. Where the frequent updates make noisy steps and make it to achieve convergence to the minima very slowly. And it is computationally expensive also.

7. In stochastic gradient descent the high variance frequent parameter updates causes the loss function to fluctuate heavily.

a) False

b) True

View Answer

Explanation: In stochastic gradient descent the frequent parameter updates have high variance and cause the loss function (objective function) to fluctuate to different intensities. The high variance parameter updates helps to discover better local minima but at the same time it complicates the convergence (unstable convergence) to the exact minimum.

8. Stochastic gradient descent has the possibility of escaping from local minima.

a) False

b) True

View Answer

Explanation: One of the properties of stochastic gradient descent is the possibility of escaping from local minima. Since a stationary point with respect to the error function for the whole data set will generally not be a stationary point for each data point individually.

9. Given an example from a dataset (x_{1}, x_{2}) = (4, 1), observed value y = 2 and the initial weights w_{1}, w_{2}, bias b as -0.015, -0.038 and 0. What will be the prediction y’.

a) 0.01

b) 0.03

c) 0.05

d) 0.1

View Answer

Explanation: Given x

_{1}= 4, x

_{2}= 1, w

_{1}= -0.015, w

_{2}= -0.038, y = 2 and b = 0.

Then prediction y’ = w

_{1}x

_{1}+ w

_{2}x

_{2}+ b

= (-0.015 * 4) + (-0.038 * 1) + 0

= -0.06 + -0.038 + 0

= -0.098

= -0.1

10. Given an example from a dataset (x_{1}, x_{2}) = (2,8) and the dependent variable y = -14, and the model prediction y’ = -11. What will be the loss function if we are using a squared difference method?

a) 6

b) -3

c) 9

d) 3

View Answer

Explanation: Given the observed variable, y = -14, predicted value y’ = -11, and additional parameters x

_{1}= 2, x

_{2}= 8.

Then using squared difference method Loss, L = (y’ – y)

^{2}

= (-11 – -14)

^{2}

= (-11 +14)

^{2}

= (3)

^{2}

= 9

11. Given the current bias b = 0, learning rate = 0.01 and gradient = -4.2. What will be the b’ value after the update?

a) -0.42

b) 0.042

c) 0.42

d) -0.042

View Answer

Explanation: Given b = 0, learning rate η = 0.01 gradient = -4.2.

Then bias value after update, b’ = b – (η * Gradient)

= 0 – (0.01 * -4.2)

= 0 – -0.042

= 0.042

12. Given the example from a data set x_{1} = 3, x_{2} = 1, observed value y = 2 and predicted value y’ = -0.05. What will be the gradient if you are using a squared difference method?

a) -4.1

b) -2.05

c) 4.1

d) 2.05

View Answer

Explanation: Given x

_{1}= 3, x

_{2}= 1, y = 2 and y’ = -0.05.

Then Gradient = 2 (y’ – y) as we are taking the partial derivative of (y’ – y)

^{2}with respect to y’.

Gradient = 2 (y’ – y)

= 2 (-0.05 – 2)

= 2 * -2.05

= -4.1

13. Given the example from a data set x_{1} = 4, x_{2} = 1, weights w_{1} = -0.02, w_{2} = -0.03, bias b = 0, observed value y = 2, predicted value y’ = -0.11 and learning rate = 0.05. What will be the next weight updating values if you are using a squared difference approach?

a) -0.902, -0.314

b) -0.864, -0.241

c) -0.594, -0.324

d) -0.625, -0.524

View Answer

Explanation: Given x

_{1}= 4, x

_{2}= 1, w

_{1}= -0.02, w

_{2}= -0.03, bias b = 0, y = 2, y’ = -0.11 and η= 0.05.

Then w

_{1}’ = w

_{1}– η(2 (y’ – y) * x

_{1})

= -0.02 – 0.05 * (2(-0.11 – 2) * 4)

= -0.02 – 0.05 * (2 * 2.11 * 4)

= -0.02 – 0.05 * 16.88

= -0.02 – 0.844

= -0.864

Then w

_{2}’ = w

_{2}– η(2 (y’ – y) * x

_{2})

= -0.03 – 0.05 * (2(-0.11 – 2) * 1)

= -0.03 – 0.05 * (2 * 2.11 * 1)

= -0.03 – 0.05 * 4.22

= -0.03 – 0.211

= -0.241

**More MCQs on Stochastic Gradient Descent:**

**Sanfoundry Global Education & Learning Series – Machine Learning**.

To practice all areas of Machine Learning, __ here is complete set of 1000+ Multiple Choice Questions and Answers__.

**If you find a mistake in question / option / answer, kindly take a screenshot and email to [email protected]**