机器学习的梯度下降

0 / 415

优化是机器学习的重要组成部分。几乎每种机器学习算法的核心都是优化算法。

Optimization is a big part of machine learning. Almost every machine learning algorithm has an optimization algorithm at it’s core.

Summary

In this post you discovered gradient descent for machine learning. You learned that:

  • Optimization is a big part of machine learning.
  • Gradient descent is a simple optimization procedure that you can use with many machine learning algorithms.
  • Batch gradient descent refers to calculating the derivative from all training data before calculating an update.
  • Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.

在这篇文章中,您发现了用于机器学习的梯度下降。您了解到:

  • 优化是机器学习的重要组成部分。
  • 梯度下降是一个简单的优化过程,可以与许多机器学习算法一起使用。
  • 批次梯度下降是指在计算更新之前从所有训练数据计算导数。
  • 随机梯度下降是指从每个训练数据实例计算导数并立即计算更新。

Kick-start your project with my new book Master Machine Learning Algorithms
including step-by-step tutorials and the Excel Spreadsheet files for all examples.

用我的新书Master Machine Learning Algorithms来
启动您的项目,其中包括所有示例的
分步教程
Excel Spreadsheet
文件。

让我们开始吧。

Let’s get started.

Gradient Descent For Machine Learning
Gradient Descent For Machine Learning
Photo by Grand Canyon National Park

Gradient Descent 梯度下降

梯度下降是一种优化算法,用于查找使成本函数(cost)最小化的函数(f)的参数(系数)的值。

当无法解析地计算参数(例如,使用线性代数)并且必须通过优化算法进行搜索时,最好使用梯度下降。

Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).

Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

Intuition for Gradient Descent 梯度下降的直观描述

Think of a large bowl like what you would eat cereal out of or store fruit in. This bowl is a plot of the cost function (f).

想象一个大碗,就像您要吃掉谷物或在其中存放水果一样。该碗是成本函数(f)的图。

Large Bowl
Large Bowl
Photo by William Warby

A random position on the surface of the bowl is the cost of the current values of the coefficients (cost).

The bottom of the bowl is the cost of the best set of coefficients, the minimum of the function.

The goal is to continue to try different values for the coefficients, evaluate their cost and select new coefficients that have a slightly better (lower) cost.

Repeating this process enough times will lead to the bottom of the bowl and you will know the values of the coefficients that result in the minimum cost.
碗表面上的随机位置是系数的当前值的成本(成本)。

碗的底部是最佳系数集(函数的最小值)的成本。

目标是继续尝试使用不同的系数值,评估其成本并选择成本稍高(较低)的新系数。

重复此过程足够多的时间将导致碗的底部,您将知道导致最低成本的系数值。

Get your FREE Algorithms Mind Map

Machine Learning Algorithms Mind Map
Sample of the handy machine learning algorithms mind map.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Download For Free
Also get exclusive access to the machine learning algorithms email mini-course.

Gradient Descent Procedure 梯度下降程序

该过程从函数的一个或多个系数的初始值开始。这些可以是0.0或较小的随机值。

The procedure starts off with initial values for the coefficient or coefficients for the function. These could be 0.0 or a small random value.

coefficient = 0.0

The cost of the coefficients is evaluated by plugging them into the function and calculating the cost.
系数的成本是通过将其插入函数并计算成本来评估的。

cost = f(coefficient)

or

cost = evaluate(f(coefficient))

The derivative of the cost is calculated. The derivative is a concept from calculus and refers to the slope of the function at a given point. We need to know the slope so that we know the direction (sign) to move the coefficient values in order to get a lower cost on the next iteration.
计算成本的导数。导数是微积分的概念,是指函数在给定点的斜率。我们需要知道斜率,以便知道移动系数值的方向(符号),以便在下一次迭代中获得较低的成本。

delta = derivative(cost)

Now that we know from the derivative which direction is downhill, we can now update the coefficient values. A learning rate parameter
(alpha) must be specified that controls how much the coefficients can change on each update.

coefficient = coefficient – (alpha * delta)

This process is repeated until the cost of the coefficients (cost) is 0.0 or close enough to zero to be good enough.

You can see how simple gradient descent is. It does require you to know the gradient of your cost function or the function you are optimizing, but besides that, it’s very straightforward. Next we will see how we can use this in machine learning algorithms.
重复该过程,直到系数的成本(cost)为0.0或足够接近零为止才足够好。

您可以看到梯度下降有多简单。它确实需要您了解成本函数或要优化的函数的梯度,但除此之外,它非常简单。接下来,我们将看到如何在机器学习算法中使用它。

Batch Gradient Descent for Machine Learning 机器学习的批次梯度下降

The goal of all supervised machine learning algorithms is to best estimate a target function (f) that maps input data (X) onto output variables (Y). This describes all classification and regression problems.

Some machine learning algorithms have coefficients that characterize the algorithms estimate for the target function (f). Different algorithms have different representations and different coefficients, but many of them require a process of optimization to find the set of coefficients that result in the best estimate of the target function.

Common examples of algorithms with coefficients that can be optimized using gradient descent are Linear Regression and Logistic Regression.

The evaluation of how close a fit a machine learning model estimates the target function can be calculated a number of different ways, often specific to the machine learning algorithm. The cost function involves evaluating the coefficients in the machine learning model by calculating a prediction for the model for each training instance in the dataset and comparing the predictions to the actual output values and calculating a sum or average error (such as the Sum of Squared Residuals or SSR in the case of linear regression).

From the cost function a derivative can be calculated for each coefficient so that it can be updated using exactly the update equation described above.

The cost is calculated for a machine learning algorithm over the entire training dataset for each iteration of the gradient descent algorithm. One iteration of the algorithm is called one batch and this form of gradient descent is referred to as batch gradient descent.

Batch gradient descent is the most common form of gradient descent described in machine learning.
所有监督式机器学习算法的目标是最好地估计将输入数据(X)映射到输出变量(Y)的目标函数(f)。这描述了所有分类和回归问题。

一些机器学习算法的系数表征了目标函数(f)的算法估计。不同的算法具有不同的表示形式和不同的系数,但是其中许多算法都需要进行优化过程才能找到一组系数,从而对目标函数进行最佳估计。

具有可以使用梯度下降进行优化的系数的算法的常见示例是线性回归和逻辑回归。

可以通过许多不同的方法来计算机器学习模型对目标函数的拟合程度的估计,这些方法通常特定于机器学习算法。成本函数包括通过为数据集中每个训练实例计算模型的预测并将预测与实际输出值进行比较并计算总误差或平均误差(例如残差平方和)来评估机器学习模型中的系数如果是线性回归,则为SSR)。

根据成本函数,可以为每个系数计算导数,以便可以使用上述确切的更新方程式对其进行更新。

对于梯度下降算法的每次迭代,在整个训练数据集上为机器学习算法计算成本。该算法的一次迭代称为“批处理”,这种形式的梯度下降称为“批处理梯度下降”。

批量梯度下降是机器学习中描述的最常见的梯度下降形式。

Stochastic Gradient Descent for Machine Learning 机器学习的随机梯度下降

Gradient descent can be slow to run on very large datasets.

Because one iteration of the gradient descent algorithm requires a prediction for each instance in the training dataset, it can take a long time when you have many millions of instances.

In situations when you have large amounts of data, you can use a variation of gradient descent called stochastic gradient descent.

In this variation, the gradient descent procedure described above is run but the update to the coefficients is performed for each training instance, rather than at the end of the batch of instances.

The first step of the procedure requires that the order of the training dataset is randomized. This is to mix up the order that updates are made to the coefficients. Because the coefficients are updated after every training instance, the updates will be noisy jumping all over the place, and so will the corresponding cost function. By mixing up the order for the updates to the coefficients, it harnesses this random walk and avoids it getting distracted or stuck.

The update procedure for the coefficients is the same as that above, except the cost is not summed over all training patterns, but instead calculated for one training pattern.

The learning can be much faster with stochastic gradient descent for very large training datasets and often you only need a small number of passes through the dataset to reach a good or good enough set of coefficients, e.g. 1-to-10 passes through the dataset.
梯度下降在非常大的数据集上运行可能会很慢。

因为梯度下降算法的一次迭代需要对训练数据集中的每个实例进行预测,所以当您拥有数百万个实例时,这可能会花费很长时间。

在具有大量数据的情况下,可以使用称为随机梯度下降的梯度下降变体。

在该变体中,运行上述梯度下降过程,但是针对每个训练实例而不是在一批实例的末尾执行对系数的更新。

该过程的第一步要求训练数据集的顺序是随机的。这是为了混合对系数进行更新的顺序。因为系数是在每个训练实例之后更新的,所以更新会在整个地方引起很大的噪音,因此相应的成本函数也会随之变化。通过混合更新系数的顺序,它可以利用这种随机游动,避免分散注意力或卡住它。

系数的更新过程与上面的相同,除了不对所有训练模式求和,而是针对一种训练模式进行计算。

对于非常大的训练数据集,使用随机梯度下降法可以更快地进行学习,并且通常您只需要少量遍历数据集即可达到一组良好或足够好的系数,例如1遍到10遍遍数据集。

Tips for Gradient Descent 梯度下降技巧

This section lists some tips and tricks for getting the most out of the gradient descent algorithm for machine learning.

  • Plot Cost versus Time: Collect and plot the cost values calculated by the algorithm each iteration. The expectation for a well performing gradient descent run is a decrease in cost each iteration. If it does not decrease, try reducing your learning rate.

  • Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Try different values for your problem and see which works best.

  • Rescale Inputs: The algorithm will reach the minimum cost faster if the shape of the cost function is not skewed and distorted. You can achieved this by rescaling all of the input variables (X) to the same range, such as [0, 1] or [-1, 1].

  • Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients.

  • Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Taking the average over 10, 100, or 1000 updates can give you a better idea of the learning trend for the algorithm.
    本节列出了一些技巧和窍门,它们可以帮助您充分利用梯度下降算法进行机器学习。

  • 绘制成本与时间的关系图:每次迭代收集并绘制算法计算出的成本值。良好的梯度下降运行预期是每次迭代的成本降低。如果没有减少,请尝试降低学习率。

  • 学习率:学习率值是一个较小的实际值,例如0.1、0.001或0.0001。尝试使用不同的值来解决您的问题,然后看看哪种方法效果最好。

  • 重新缩放输入:如果成本函数的形状不偏斜和失真,则算法将更快地达到最小成本。您可以通过将所有输入变量(X)重新缩放到相同的范围来实现此目标,例如[0,1]或[-1,1]。

  • 很少通过:随机梯度下降通常不需要超过1到10次通过训练数据集即可收敛到良好或足够好的系数。

  • 绘制平均成本图:使用随机梯度下降法时,每个训练数据集实例的更新可能会导致噪声随时间变化的噪声图。平均进行10、100或1000次更新,可以使您更好地了解算法的学习趋势。

    Do you have any questions about gradient descent for machine learning or this post? Leave a comment and ask your question and I will do my best to answer it.

by Jason Brownlee
on March 23, 2016 in Machine Learning Algorithms

Tweet Share Share

Last Updated on August 12, 2019