第6章决策树

参考:作者的Jupyter Notebook
Chapter 6 – Decision Trees

  1. 保存图片
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    from __future__ import division, print_function, unicode_literals
    import numpy as np
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    import os
    np.random.seed(42)

    mpl.rc('axes', labelsize=14)
    mpl.rc('xtick', labelsize=12)
    mpl.rc('ytick', labelsize=12)

    # Where to save the figures
    PROJECT_ROOT_DIR = "images"
    CHAPTER_ID = "decision_trees"

    def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, CHAPTER_ID, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
    plt.tight_layout()
    plt.savefig(path, format='png', dpi=600)

决策树训练和可视化

  1. 要了解决策树,让我们先构建一个决策树,看看它是如何做出预测的。下面的代码在鸢尾花数据集(见第4章)上训练了一个DecisionTreeClassifier:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    from sklearn.datasets import load_iris
    from sklearn.tree import DecisionTreeClassifier

    iris = load_iris()
    X = iris.data[:, 2:] # petal length and width
    y = iris.target

    tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
    tree_clf.fit(X, y)
    #print(tree_clf.fit(X, y))
  2. 要将决策树可视化,首先,使用export_graphviz()方法输出一个图形定义文件,命名为iris_tree.dot:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    from sklearn.tree import export_graphviz

    export_graphviz(
    tree_clf,
    out_file=image_path("iris_tree.dot"),
    feature_names=iris.feature_names[2:],
    class_names=iris.target_names,
    rounded=True,
    filled=True
    )
    #下面这行命令将.dot文件转换为.png图像文件:
    #$ dot -Tpng iris_tree.dot -o iris_tree.png

做出预测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[0, 7.5, 0, 3], iris=True, legend=False, plot_training=True):
x1s = np.linspace(axes[0], axes[1], 100)
x2s = np.linspace(axes[2], axes[3], 100)
x1, x2 = np.meshgrid(x1s, x2s)
X_new = np.c_[x1.ravel(), x2.ravel()]
y_pred = clf.predict(X_new).reshape(x1.shape)
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
if not iris:
custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
if plot_training:
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris-Setosa")
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris-Versicolor")
plt.plot(X[:, 0][y==2], X[:, 1][y==2], "g^", label="Iris-Virginica")
plt.axis(axes)
if iris:
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
else:
plt.xlabel(r"$x_1$", fontsize=18)
plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
if legend:
plt.legend(loc="lower right", fontsize=14)

plt.figure(figsize=(8, 4))
plot_decision_boundary(tree_clf, X, y)
plt.plot([2.45, 2.45], [0, 3], "k-", linewidth=2)
plt.plot([2.45, 7.5], [1.75, 1.75], "k--", linewidth=2)
plt.plot([4.95, 4.95], [0, 1.75], "k:", linewidth=2)
plt.plot([4.85, 4.85], [1.75, 3], "k:", linewidth=2)
plt.text(1.40, 1.0, "Depth=0", fontsize=15)
plt.text(3.2, 1.80, "Depth=1", fontsize=13)
plt.text(4.05, 0.5, "(Depth=2)", fontsize=11)

save_fig("decision_tree_decision_boundaries_plot")
plt.show()

估算类别概率

  1. 决策树同样可以估算某个实例属于特定类别k的概率
    1
    2
    #print(tree_clf.predict_proba([[5, 1.5]]))
    #print(tree_clf.predict([[5, 1.5]]))0

CART训练算法

Scikit-Learn使用的是分类与回归树(Classification And Regression Tree,简称CART)算法来训练决策树(也叫作“生长”树)。

计算复杂度

基尼不纯度还是信息熵

正则化超参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.datasets import make_moons
Xm, ym = make_moons(n_samples=100, noise=0.25, random_state=53)

deep_tree_clf1 = DecisionTreeClassifier(random_state=42)
deep_tree_clf2 = DecisionTreeClassifier(min_samples_leaf=4, random_state=42)
deep_tree_clf1.fit(Xm, ym)
deep_tree_clf2.fit(Xm, ym)

plt.figure(figsize=(11, 4))
plt.subplot(121)
plot_decision_boundary(deep_tree_clf1, Xm, ym, axes=[-1.5, 2.5, -1, 1.5], iris=False)
plt.title("No restrictions", fontsize=16)
plt.subplot(122)
plot_decision_boundary(deep_tree_clf2, Xm, ym, axes=[-1.5, 2.5, -1, 1.5], iris=False)
plt.title("min_samples_leaf = {}".format(deep_tree_clf2.min_samples_leaf), fontsize=14)

save_fig("min_samples_leaf_plot正则化超参数")
plt.show()

左图使用默认参数(即无约束)来训练决策树,右图的决策树应用min_samples_leaf=4进行训练。很明显,左图模型过度拟合,右图的泛化效果更佳。

回归

  1. 决策树也可以执行回归任务。我们用Scikit_Learn的DecisionTreeRegressor来构建一个回归树,在一个带噪声的二次数据集上进行训练,其中max_depth=2:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    np.random.seed(42)
    m = 200
    X = np.random.rand(m, 1)
    y = 4 * (X - 0.5) ** 2
    y = y + np.random.randn(m, 1) / 10

    from sklearn.tree import DecisionTreeRegressor

    tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42)
    tree_reg.fit(X, y)
    #print(tree_reg.fit(X, y))
  2. 两个决策树回归模型的预测对比

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    from sklearn.tree import DecisionTreeRegressor

    tree_reg1 = DecisionTreeRegressor(random_state=42, max_depth=2)
    tree_reg2 = DecisionTreeRegressor(random_state=42, max_depth=3)
    tree_reg1.fit(X, y)
    tree_reg2.fit(X, y)

    def plot_regression_predictions(tree_reg, X, y, axes=[0, 1, -0.2, 1], ylabel="$y$"):
    x1 = np.linspace(axes[0], axes[1], 500).reshape(-1, 1)
    y_pred = tree_reg.predict(x1)
    plt.axis(axes)
    plt.xlabel("$x_1$", fontsize=18)
    if ylabel:
    plt.ylabel(ylabel, fontsize=18, rotation=0)
    plt.plot(X, y, "b.")
    plt.plot(x1, y_pred, "r.-", linewidth=2, label=r"$\hat{y}$")

    plt.figure(figsize=(11, 4))
    plt.subplot(121)
    plot_regression_predictions(tree_reg1, X, y)
    for split, style in ((0.1973, "k-"), (0.0917, "k--"), (0.7718, "k--")):
    plt.plot([split, split], [-0.2, 1], style, linewidth=2)
    plt.text(0.21, 0.65, "Depth=0", fontsize=15)
    plt.text(0.01, 0.2, "Depth=1", fontsize=13)
    plt.text(0.65, 0.8, "Depth=1", fontsize=13)
    plt.legend(loc="upper center", fontsize=18)
    plt.title("max_depth=2", fontsize=14)

    plt.subplot(122)
    plot_regression_predictions(tree_reg2, X, y, ylabel=None)
    for split, style in ((0.1973, "k-"), (0.0917, "k--"), (0.7718, "k--")):
    plt.plot([split, split], [-0.2, 1], style, linewidth=2)
    for split in (0.0458, 0.1298, 0.2873, 0.9040):
    plt.plot([split, split], [-0.2, 1], "k:", linewidth=1)
    plt.text(0.3, 0.5, "Depth=2", fontsize=13)
    plt.title("max_depth=3", fontsize=14)

    save_fig("tree_regression_plot两个决策树回归模型的预测对比")
    plt.show()

不稳定性

  1. 对数据旋转敏感

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    np.random.seed(6)
    Xs = np.random.rand(100, 2) - 0.5
    ys = (Xs[:, 0] > 0).astype(np.float32) * 2

    angle = np.pi / 4
    rotation_matrix = np.array([[np.cos(angle), -np.sin(angle)], [np.sin(angle), np.cos(angle)]])
    Xsr = Xs.dot(rotation_matrix)

    tree_clf_s = DecisionTreeClassifier(random_state=42)
    tree_clf_s.fit(Xs, ys)
    tree_clf_sr = DecisionTreeClassifier(random_state=42)
    tree_clf_sr.fit(Xsr, ys)

    plt.figure(figsize=(11, 4))
    plt.subplot(121)
    plot_decision_boundary(tree_clf_s, Xs, ys, axes=[-0.7, 0.7, -0.7, 0.7], iris=False)
    plt.subplot(122)
    plot_decision_boundary(tree_clf_sr, Xsr, ys, axes=[-0.7, 0.7, -0.7, 0.7], iris=False)

    save_fig("sensitivity_to_rotation_plot对数据旋转敏感")
    plt.show()
  2. 对训练集细节敏感

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    X[(X[:, 1]==X[:, 1][y==1].max()) & (y==1)] # widest Iris-Versicolor flower

    not_widest_versicolor = (X[:, 1]!=1.8) | (y==2)
    X_tweaked = X[not_widest_versicolor]
    y_tweaked = y[not_widest_versicolor]

    tree_clf_tweaked = DecisionTreeClassifier(max_depth=2, random_state=40)
    tree_clf_tweaked.fit(X_tweaked, y_tweaked)

    plt.figure(figsize=(8, 4))
    plot_decision_boundary(tree_clf_tweaked, X_tweaked, y_tweaked, legend=False)
    plt.plot([0, 7.5], [0.8, 0.8], "k-", linewidth=2)
    plt.plot([0, 7.5], [1.75, 1.75], "k--", linewidth=2)
    plt.text(1.0, 0.9, "Depth=0", fontsize=15)
    plt.text(1.0, 1.80, "Depth=1", fontsize=13)

    save_fig("decision_tree_inst ability_plot对训练集细节敏感")
    plt.show()

第5章支持向量机

参考:作者的Jupyter Notebook
Chapter 5 – Support Vector Machines

支持向量机(简称SVM)是一个功能强大并且全面的机器学习模型,它能够执行线性或非线性分类、回归,甚至是异常值检测任务。它是机器学习领域最受欢迎的模型之一,任何对机器学习感兴趣的人都应该在工具箱中配备一个。SVM特别适用于中小型复杂数据集的分类。

  1. 保存图片
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    from __future__ import division, print_function, unicode_literals
    import numpy as np
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    import os
    np.random.seed(42)

    mpl.rc('axes', labelsize=14)
    mpl.rc('xtick', labelsize=12)
    mpl.rc('ytick', labelsize=12)

    # Where to save the figures
    PROJECT_ROOT_DIR = "images"
    CHAPTER_ID = "traininglinearmodels"

    def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, CHAPTER_ID, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
    plt.tight_layout()
    plt.savefig(path, format='png', dpi=600)

线性SVM分类

  1. 加载鸢尾花数据集,缩放特征,然后训练一个线性SVM模型(使用LinearSVC类,C=0.1,用即将介绍的hinge损失函数)用来检测Virginica鸢尾花。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    from sklearn.svm import SVC
    from sklearn import datasets

    iris = datasets.load_iris()
    X = iris["data"][:, (2, 3)] # petal length, petal width
    y = iris["target"]

    setosa_or_versicolor = (y == 0) | (y == 1)
    X = X[setosa_or_versicolor]
    y = y[setosa_or_versicolor]

    # SVM Classifier model
    svm_clf = SVC(kernel="linear", C=float("inf"))
    print(svm_clf.fit(X, y))

    # Bad models
    x0 = np.linspace(0, 5.5, 200)
    pred_1 = 5*x0 - 20
    pred_2 = x0 - 1.8
    pred_3 = 0.1 * x0 + 0.5

    def plot_svc_decision_boundary(svm_clf, xmin, xmax):
    w = svm_clf.coef_[0]
    b = svm_clf.intercept_[0]

    # At the decision boundary, w0*x0 + w1*x1 + b = 0
    # => x1 = -w0/w1 * x0 - b/w1
    x0 = np.linspace(xmin, xmax, 200)
    decision_boundary = -w[0]/w[1] * x0 - b/w[1]

    margin = 1/w[1]
    gutter_up = decision_boundary + margin
    gutter_down = decision_boundary - margin

    svs = svm_clf.support_vectors_
    plt.scatter(svs[:, 0], svs[:, 1], s=180, facecolors='#FFAAAA')
    plt.plot(x0, decision_boundary, "k-", linewidth=2)
    plt.plot(x0, gutter_up, "k--", linewidth=2)
    plt.plot(x0, gutter_down, "k--", linewidth=2)

    plt.figure(figsize=(12,2.7))

    plt.subplot(121)
    plt.plot(x0, pred_1, "g--", linewidth=2)
    plt.plot(x0, pred_2, "m-", linewidth=2)
    plt.plot(x0, pred_3, "r-", linewidth=2)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris-Versicolor")
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris-Setosa")
    plt.xlabel("Petal length", fontsize=14)
    plt.ylabel("Petal width", fontsize=14)
    plt.legend(loc="upper left", fontsize=14)
    plt.axis([0, 5.5, 0, 2])

    plt.subplot(122)
    plot_svc_decision_boundary(svm_clf, 0, 5.5)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs")
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo")
    plt.xlabel("Petal length", fontsize=14)
    plt.axis([0, 5.5, 0, 2])

    save_fig("large_margin_classification_plot较少间隔违例和大间隔对比")
    plt.show()

非线性SVM分类

  1. 通过添加特征使数据集线性可分离

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    X1D = np.linspace(-4, 4, 9).reshape(-1, 1)
    X2D = np.c_[X1D, X1D**2]
    y = np.array([0, 0, 1, 1, 1, 1, 1, 0, 0])
    '''
    plt.figure(figsize=(11, 4))

    plt.subplot(121)
    plt.grid(True, which='both')
    plt.axhline(y=0, color='k')
    plt.plot(X1D[:, 0][y==0], np.zeros(4), "bs")
    plt.plot(X1D[:, 0][y==1], np.zeros(5), "g^")
    plt.gca().get_yaxis().set_ticks([])
    plt.xlabel(r"$x_1$", fontsize=20)
    plt.axis([-4.5, 4.5, -0.2, 0.2])

    plt.subplot(122)
    plt.grid(True, which='both')
    plt.axhline(y=0, color='k')
    plt.axvline(x=0, color='k')
    plt.plot(X2D[:, 0][y==0], X2D[:, 1][y==0], "bs")
    plt.plot(X2D[:, 0][y==1], X2D[:, 1][y==1], "g^")
    plt.xlabel(r"$x_1$", fontsize=20)
    plt.ylabel(r"$x_2$", fontsize=20, rotation=0)
    plt.gca().get_yaxis().set_ticks([0, 4, 8, 12, 16])
    plt.plot([-4.5, 4.5], [6.5, 6.5], "r--", linewidth=3)
    plt.axis([-4.5, 4.5, -1, 17])

    plt.subplots_adjust(right=1)

    save_fig("higher_dimensions_plot", tight_layout=False)
    plt.show()
  2. 要使用Scikit-Learn实现这个想法,可以搭建一条流水线:一个PolynomialFeatures转换器,接着一个StandardScaler,然后是LinearSVC。我们用卫星数据集来测试一下

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    from sklearn.preprocessing import StandardScaler
    from sklearn.svm import LinearSVC
    from sklearn.datasets import make_moons
    X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

    def plot_dataset(X, y, axes):
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs")
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^")
    plt.axis(axes)
    plt.grid(True, which='both')
    plt.xlabel(r"$x_1$", fontsize=20)
    plt.ylabel(r"$x_2$", fontsize=20, rotation=0)

    #plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
    #plt.show()

    from sklearn.datasets import make_moons
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import PolynomialFeatures

    polynomial_svm_clf = Pipeline([
    ("poly_features", PolynomialFeatures(degree=3)),
    ("scaler", StandardScaler()),
    ("svm_clf", LinearSVC(C=10, loss="hinge", random_state=42))
    ])

    polynomial_svm_clf.fit(X, y)

    def plot_predictions(clf, axes):
    x0s = np.linspace(axes[0], axes[1], 100)
    x1s = np.linspace(axes[2], axes[3], 100)
    x0, x1 = np.meshgrid(x0s, x1s)
    X = np.c_[x0.ravel(), x1.ravel()]
    y_pred = clf.predict(X).reshape(x0.shape)
    y_decision = clf.decision_function(X).reshape(x0.shape)
    plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2)
    plt.contourf(x0, x1, y_decision, cmap=plt.cm.brg, alpha=0.1)

    plot_predictions(polynomial_svm_clf, [-1.5, 2.5, -1, 1.5])
    plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])

    save_fig("moons_polynomial_svc_plot")
    plt.show()
  3. 多项式核

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    from sklearn.svm import SVC

    poly_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
    ])
    poly_kernel_svm_clf.fit(X, y)
    #print(poly_kernel_svm_clf.fit(X, y))

    poly100_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="poly", degree=10, coef0=100, C=5))
    ])
    poly100_kernel_svm_clf.fit(X, y)
    #print(poly100_kernel_svm_clf.fit(X, y))

    plt.figure(figsize=(11, 4))

    plt.subplot(121)
    plot_predictions(poly_kernel_svm_clf, [-1.5, 2.5, -1, 1.5])
    plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
    plt.title(r"$d=3, r=1, C=5$", fontsize=18)

    plt.subplot(122)
    plot_predictions(poly100_kernel_svm_clf, [-1.5, 2.5, -1, 1.5])
    plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
    plt.title(r"$d=10, r=100, C=5$", fontsize=18)

    save_fig("moons_kernelized_polynomial_svc_plot")
    plt.show()
  4. 添加相似特征

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    def gaussian_rbf(x, landmark, gamma):
    return np.exp(-gamma * np.linalg.norm(x - landmark, axis=1)**2)

    gamma = 0.3

    x1s = np.linspace(-4.5, 4.5, 200).reshape(-1, 1)
    x2s = gaussian_rbf(x1s, -2, gamma)
    x3s = gaussian_rbf(x1s, 1, gamma)

    XK = np.c_[gaussian_rbf(X1D, -2, gamma), gaussian_rbf(X1D, 1, gamma)]
    yk = np.array([0, 0, 1, 1, 1, 1, 1, 0, 0])

    plt.figure(figsize=(11, 4))

    plt.subplot(121)
    plt.grid(True, which='both')
    plt.axhline(y=0, color='k')
    plt.scatter(x=[-2, 1], y=[0, 0], s=150, alpha=0.5, c="red")
    plt.plot(X1D[:, 0][yk==0], np.zeros(4), "bs")
    plt.plot(X1D[:, 0][yk==1], np.zeros(5), "g^")
    plt.plot(x1s, x2s, "g--")
    plt.plot(x1s, x3s, "b:")
    plt.gca().get_yaxis().set_ticks([0, 0.25, 0.5, 0.75, 1])
    plt.xlabel(r"$x_1$", fontsize=20)
    plt.ylabel(r"Similarity", fontsize=14)
    plt.annotate(r'$\mathbf{x}$',
    xy=(X1D[3, 0], 0),
    xytext=(-0.5, 0.20),
    ha="center",
    arrowprops=dict(facecolor='black', shrink=0.1),
    fontsize=18,
    )
    plt.text(-2, 0.9, "$x_2$", ha="center", fontsize=20)
    plt.text(1, 0.9, "$x_3$", ha="center", fontsize=20)
    plt.axis([-4.5, 4.5, -0.1, 1.1])

    plt.subplot(122)
    plt.grid(True, which='both')
    plt.axhline(y=0, color='k')
    plt.axvline(x=0, color='k')
    plt.plot(XK[:, 0][yk==0], XK[:, 1][yk==0], "bs")
    plt.plot(XK[:, 0][yk==1], XK[:, 1][yk==1], "g^")
    plt.xlabel(r"$x_2$", fontsize=20)
    plt.ylabel(r"$x_3$ ", fontsize=20, rotation=0)
    plt.annotate(r'$\phi\left(\mathbf{x}\right)$',
    xy=(XK[3, 0], XK[3, 1]),
    xytext=(0.65, 0.50),
    ha="center",
    arrowprops=dict(facecolor='black', shrink=0.1),
    fontsize=18,
    )
    plt.plot([-0.1, 1.1], [0.57, -0.1], "r--", linewidth=3)
    plt.axis([-0.1, 1.1, -0.1, 1.1])

    plt.subplots_adjust(right=1)

    #save_fig("kernel_method_plot")
    #plt.show()
  5. 高斯RBF

    1
    2
    3
    4
    x1_example = X1D[3, 0]
    for landmark in (-2, 1):
    k = gaussian_rbf(np.array([[x1_example]]), np.array([[landmark]]), gamma)
    print("Phi({}, {}) = {}".format(x1_example, landmark, k))
  6. 使用SVC类试试高斯RBF核

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    rbf_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
    ])
    rbf_kernel_svm_clf.fit(X, y)
    print(rbf_kernel_svm_clf.fit(X, y))

    from sklearn.svm import SVC

    gamma1, gamma2 = 0.1, 5
    C1, C2 = 0.001, 1000
    hyperparams = (gamma1, C1), (gamma1, C2), (gamma2, C1), (gamma2, C2)

    svm_clfs = []
    for gamma, C in hyperparams:
    rbf_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="rbf", gamma=gamma, C=C))
    ])
    rbf_kernel_svm_clf.fit(X, y)
    svm_clfs.append(rbf_kernel_svm_clf)

    plt.figure(figsize=(11, 7))

    for i, svm_clf in enumerate(svm_clfs):
    plt.subplot(221 + i)
    plot_predictions(svm_clf, [-1.5, 2.5, -1, 1.5])
    plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
    gamma, C = hyperparams[i]
    plt.title(r"$\gamma = {}, C = {}$".format(gamma, C), fontsize=16)
    #使用RBF核的SVM分类器
    save_fig("moons_rbf_svc_plot")
    plt.show()

SVM回归

  1. SVM回归

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    np.random.seed(42)
    m = 50
    X = 2 * np.random.rand(m, 1)
    y = (4 + 3 * X + np.random.randn(m, 1)).ravel()

    from sklearn.svm import LinearSVR

    svm_reg = LinearSVR(epsilon=1.5, random_state=42)
    svm_reg.fit(X, y)

    svm_reg1 = LinearSVR(epsilon=1.5, random_state=42)
    svm_reg2 = LinearSVR(epsilon=0.5, random_state=42)
    svm_reg1.fit(X, y)
    svm_reg2.fit(X, y)

    def find_support_vectors(svm_reg, X, y):
    y_pred = svm_reg.predict(X)
    off_margin = (np.abs(y - y_pred) >= svm_reg.epsilon)
    return np.argwhere(off_margin)

    svm_reg1.support_ = find_support_vectors(svm_reg1, X, y)
    svm_reg2.support_ = find_support_vectors(svm_reg2, X, y)

    eps_x1 = 1
    eps_y_pred = svm_reg1.predict([[eps_x1]])

    def plot_svm_regression(svm_reg, X, y, axes):
    x1s = np.linspace(axes[0], axes[1], 100).reshape(100, 1)
    y_pred = svm_reg.predict(x1s)
    plt.plot(x1s, y_pred, "k-", linewidth=2, label=r"$\hat{y}$")
    plt.plot(x1s, y_pred + svm_reg.epsilon, "k--")
    plt.plot(x1s, y_pred - svm_reg.epsilon, "k--")
    plt.scatter(X[svm_reg.support_], y[svm_reg.support_], s=180, facecolors='#FFAAAA')
    plt.plot(X, y, "bo")
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.legend(loc="upper left", fontsize=18)
    plt.axis(axes)

    plt.figure(figsize=(9, 4))
    plt.subplot(121)
    plot_svm_regression(svm_reg1, X, y, [0, 2, 3, 11])
    plt.title(r"$\epsilon = {}$".format(svm_reg1.epsilon), fontsize=18)
    plt.ylabel(r"$y$", fontsize=18, rotation=0)
    #plt.plot([eps_x1, eps_x1], [eps_y_pred, eps_y_pred - svm_reg1.epsilon], "k-", linewidth=2)
    plt.annotate(
    '', xy=(eps_x1, eps_y_pred), xycoords='data',
    xytext=(eps_x1, eps_y_pred - svm_reg1.epsilon),
    textcoords='data', arrowprops={'arrowstyle': '<->', 'linewidth': 1.5}
    )
    plt.text(0.91, 5.6, r"$\epsilon$", fontsize=20)
    plt.subplot(122)
    plot_svm_regression(svm_reg2, X, y, [0, 2, 3, 11])
    plt.title(r"$\epsilon = {}$".format(svm_reg2.epsilon), fontsize=18)
    save_fig("svm_regression_plot使用二阶多项式核的SVM回归")
    plt.show()
  2. 使用二阶多项式核的SVM回归

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    np.random.seed(42)
    m = 100
    X = 2 * np.random.rand(m, 1) - 1
    y = (0.2 + 0.1 * X + 0.5 * X**2 + np.random.randn(m, 1)/10).ravel()
    #SVR类是SVC类的回归等价物,LinearSVR类也是LinearSVC类的回归等价物。LinearSVR与训练集的大小线性相关
    #(跟LinearSVC一样),而SVR则在训练集变大时,变得很慢(SVC也是一样)。

    from sklearn.svm import SVR

    svm_poly_reg1 = SVR(kernel="poly", degree=2, C=100, epsilon=0.1, gamma="auto")
    svm_poly_reg2 = SVR(kernel="poly", degree=2, C=0.01, epsilon=0.1, gamma="auto")
    svm_poly_reg1.fit(X, y)
    svm_poly_reg2.fit(X, y)

    plt.figure(figsize=(9, 4))
    plt.subplot(121)
    plot_svm_regression(svm_poly_reg1, X, y, [-1, 1, 0, 1])
    plt.title(r"$degree={}, C={}, \epsilon = {}$".format(svm_poly_reg1.degree, svm_poly_reg1.C, svm_poly_reg1.epsilon), fontsize=18)
    plt.ylabel(r"$y$", fontsize=18, rotation=0)
    plt.subplot(122)
    plot_svm_regression(svm_poly_reg2, X, y, [-1, 1, 0, 1])
    plt.title(r"$degree={}, C={}, \epsilon = {}$".format(svm_poly_reg2.degree, svm_poly_reg2.C, svm_poly_reg2.epsilon), fontsize=18)
    save_fig("svm_with_polynomial_kernel_plot使用二阶多项式核的SVM回归")
    plt.show()
  3. 鸢尾花数据集的决策函数

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    scaler = StandardScaler()
    svm_clf1 = LinearSVC(C=1, loss="hinge", random_state=42)
    svm_clf2 = LinearSVC(C=100, loss="hinge", random_state=42)

    scaled_svm_clf1 = Pipeline([
    ("scaler", scaler),
    ("linear_svc", svm_clf1),
    ])
    scaled_svm_clf2 = Pipeline([
    ("scaler", scaler),
    ("linear_svc", svm_clf2),
    ])

    scaled_svm_clf1.fit(X, y)
    scaled_svm_clf2.fit(X, y)

    # Convert to unscaled parameters
    b1 = svm_clf1.decision_function([-scaler.mean_ / scaler.scale_])
    b2 = svm_clf2.decision_function([-scaler.mean_ / scaler.scale_])
    w1 = svm_clf1.coef_[0] / scaler.scale_
    w2 = svm_clf2.coef_[0] / scaler.scale_
    svm_clf1.intercept_ = np.array([b1])
    svm_clf2.intercept_ = np.array([b2])
    svm_clf1.coef_ = np.array([w1])
    svm_clf2.coef_ = np.array([w2])

    # Find support vectors (LinearSVC does not do this automatically)
    t = y * 2 - 1
    support_vectors_idx1 = (t * (X.dot(w1) + b1) < 1).ravel()
    support_vectors_idx2 = (t * (X.dot(w2) + b2) < 1).ravel()
    svm_clf1.support_vectors_ = X[support_vectors_idx1]
    svm_clf2.support_vectors_ = X[support_vectors_idx2]

    from sklearn import datasets

    iris = datasets.load_iris()
    X = iris["data"][:, (2, 3)] # petal length, petal width
    y = (iris["target"] == 2).astype(np.float64) # Iris-Virginica

    from mpl_toolkits.mplot3d import Axes3D

    def plot_3D_decision_function(ax, w, b, x1_lim=[4, 6], x2_lim=[0.8, 2.8]):
    x1_in_bounds = (X[:, 0] > x1_lim[0]) & (X[:, 0] < x1_lim[1])
    X_crop = X[x1_in_bounds]
    y_crop = y[x1_in_bounds]
    x1s = np.linspace(x1_lim[0], x1_lim[1], 20)
    x2s = np.linspace(x2_lim[0], x2_lim[1], 20)
    x1, x2 = np.meshgrid(x1s, x2s)
    xs = np.c_[x1.ravel(), x2.ravel()]
    df = (xs.dot(w) + b).reshape(x1.shape)
    m = 1 / np.linalg.norm(w)
    boundary_x2s = -x1s*(w[0]/w[1])-b/w[1]
    margin_x2s_1 = -x1s*(w[0]/w[1])-(b-1)/w[1]
    margin_x2s_2 = -x1s*(w[0]/w[1])-(b+1)/w[1]
    ax.plot_surface(x1s, x2, np.zeros_like(x1),
    color="b", alpha=0.2, cstride=100, rstride=100)
    ax.plot(x1s, boundary_x2s, 0, "k-", linewidth=2, label=r"$h=0$")
    ax.plot(x1s, margin_x2s_1, 0, "k--", linewidth=2, label=r"$h=\pm 1$")
    ax.plot(x1s, margin_x2s_2, 0, "k--", linewidth=2)
    ax.plot(X_crop[:, 0][y_crop==1], X_crop[:, 1][y_crop==1], 0, "g^")
    ax.plot_wireframe(x1, x2, df, alpha=0.3, color="k")
    ax.plot(X_crop[:, 0][y_crop==0], X_crop[:, 1][y_crop==0], 0, "bs")
    ax.axis(x1_lim + x2_lim)
    ax.text(4.5, 2.5, 3.8, "Decision function $h$", fontsize=15)
    ax.set_xlabel(r"Petal length", fontsize=15)
    ax.set_ylabel(r"Petal width", fontsize=15)
    ax.set_zlabel(r"$h = \mathbf{w}^T \mathbf{x} + b$", fontsize=18)
    ax.legend(loc="upper left", fontsize=16)

    fig = plt.figure(figsize=(11, 6))
    ax1 = fig.add_subplot(111, projection='3d')
    plot_3D_decision_function(ax1, w=svm_clf2.coef_[0], b=svm_clf2.intercept_[0])

    #save_fig("iris_3D_plot鸢尾花数据集的决策函数")
    #plt.show()
  4. 权重向量越小,间隔越大

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    def plot_2D_decision_function(w, b, ylabel=True, x1_lim=[-3, 3]):
    x1 = np.linspace(x1_lim[0], x1_lim[1], 200)
    y = w * x1 + b
    m = 1 / w

    plt.plot(x1, y)
    plt.plot(x1_lim, [1, 1], "k:")
    plt.plot(x1_lim, [-1, -1], "k:")
    plt.axhline(y=0, color='k')
    plt.axvline(x=0, color='k')
    plt.plot([m, m], [0, 1], "k--")
    plt.plot([-m, -m], [0, -1], "k--")
    plt.plot([-m, m], [0, 0], "k-o", linewidth=3)
    plt.axis(x1_lim + [-2, 2])
    plt.xlabel(r"$x_1$", fontsize=16)
    if ylabel:
    plt.ylabel(r"$w_1 x_1$ ", rotation=0, fontsize=16)
    plt.title(r"$w_1 = {}$".format(w), fontsize=16)

    plt.figure(figsize=(12, 3.2))
    plt.subplot(121)
    plot_2D_decision_function(1, 0)
    plt.subplot(122)
    plot_2D_decision_function(0.5, 0, ylabel=False)
    save_fig("small_w_large_margin_plot")
    plt.show()
    '''

    from sklearn.svm import SVC
    from sklearn import datasets

    iris = datasets.load_iris()
    X = iris["data"][:, (2, 3)] # petal length, petal width
    y = (iris["target"] == 2).astype(np.float64) # Iris-Virginica

    svm_clf = SVC(kernel="linear", C=1)
    svm_clf.fit(X, y)
    svm_clf.predict([[5.3, 1.3]])
    #print(svm_clf.predict([[5.3, 1.3]]))

    #Hinge loss
    t = np.linspace(-2, 4, 200)
    h = np.where(1 - t < 0, 0, 1 - t) # max(0, 1-t)

    plt.figure(figsize=(5,2.8))
    plt.plot(t, h, "b-", linewidth=2, label="$max(0, 1 - t)$")
    plt.grid(True, which='both')
    plt.axhline(y=0, color='k')
    plt.axvline(x=0, color='k')
    plt.yticks(np.arange(-1, 2.5, 1))
    plt.xlabel("$t$", fontsize=16)
    plt.axis([-2, 4, -1, 2.5])
    plt.legend(loc="upper right", fontsize=16)
    save_fig("hinge_plot")
    plt.show()

第4章训练模型

参考:作者的Jupyter Notebook
Chapter 4 – Training Linear Models

  1. 生成图片并保存
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    from __future__ import division, print_function, unicode_literals
    import numpy as np
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    import os
    np.random.seed(42)

    mpl.rc('axes', labelsize=14)
    mpl.rc('xtick', labelsize=12)
    mpl.rc('ytick', labelsize=12)

    # Where to save the figures
    PROJECT_ROOT_DIR = "images"
    CHAPTER_ID = "traininglinearmodels"

    def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, CHAPTER_ID, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
    plt.tight_layout()
    plt.savefig(path, format='png', dpi=600)

线性回归

  1. 生成一些线性数据来测试这个公式(标准方程)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    import numpy as np
    X = 2 * np.random.rand(100, 1)
    y = 4 + 3 * X + np.random.randn(100, 1)

    plt.plot(X, y, "b.")
    plt.xlabel("$x_1$", fontsize=18)
    plt.ylabel("$y$", rotation=0, fontsize=18)
    plt.axis([0, 2, 0, 15])
    #save_fig("generated_data_plot")
  2. 使用NumPy的线性代数模块(np.linalg)中的inv()函数来对矩阵求逆,并用dot()方法计算矩阵的内积:

    1
    2
    3
    X_b = np.c_[np.ones((100, 1)), X]  # add x0 = 1 to each instance
    theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
    #print(theta_best)
  3. 可以用*做出预测:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    X_new = np.array([[0], [2]])
    X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
    y_predict = X_new_b.dot(theta_best)
    #print(y_predict)

    #绘制模型的预测结果
    plt.plot(X_new, y_predict, "r-")
    plt.plot(X, y, "b.")
    plt.axis([0, 2, 0, 15])
    plt.plot(X_new, y_predict, "r-", linewidth=2, label="Predictions")
    plt.plot(X, y, "b.")
    plt.xlabel("$x_1$", fontsize=18)
    plt.ylabel("$y$", rotation=0, fontsize=18)
    plt.legend(loc="upper left", fontsize=14)
    plt.axis([0, 2, 0, 15])
    #save_fig("linear_model_predictions")
    #plt.show()
  4. Scikit-Learn的等效代码如下所示

    1
    2
    3
    4
    5
    6
    7
    8
    from sklearn.linear_model import LinearRegression
    lin_reg = LinearRegression()
    lin_reg.fit(X, y)
    print(lin_reg.intercept_, lin_reg.coef_)
    print(lin_reg.predict(X_new))
    theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)
    print(theta_best_svd)
    print(np.linalg.pinv(X_b).dot(y))

梯度下降

梯度下降的中心思想就是迭代地调整参数从而使成本函数最小化。如果学习率太低,算法需要经过大量迭代才能收敛,这将耗费很长时间如果学习率太高,那你可能会越过山谷直接到达山的另一边,甚至有可能比之前的起点还要高。这会导致算法发散,值越来越大,最后无法找到好的解决方案

  1. 批量梯度下降:3个公式,这个算法的快速实现:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    eta = 0.1
    n_iterations = 1000
    m = 100
    theta = np.random.randn(2,1)

    for iteration in range(n_iterations):
    gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
    theta = theta - eta * gradients
    #print(theta)
    #print(X_new_b.dot(theta))
  2. 分别使用三种不同的学习率时,梯度下降的前十步:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    theta_path_bgd = []
    def plot_gradient_descent(theta, eta, theta_path=None):
    m = len(X_b)
    plt.plot(X, y, "b.")
    n_iterations = 1000
    for iteration in range(n_iterations):
    if iteration < 10:
    y_predict = X_new_b.dot(theta)
    style = "b-" if iteration > 0 else "r--"
    plt.plot(X_new, y_predict, style)
    gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
    theta = theta - eta * gradients
    if theta_path is not None:
    theta_path.append(theta)
    plt.xlabel("$x_1$", fontsize=18)
    plt.axis([0, 2, 0, 15])
    plt.title(r"$\eta = {}$".format(eta), fontsize=16)

    np.random.seed(42)
    theta = np.random.randn(2,1) # random initialization
    plt.figure(figsize=(10,4))
    plt.subplot(131); plot_gradient_descent(theta, eta=0.02)
    plt.ylabel("$y$", rotation=0, fontsize=18)
    plt.subplot(132); plot_gradient_descent(theta, eta=0.1, theta_path=theta_path_bgd)
    plt.subplot(133); plot_gradient_descent(theta, eta=0.5)
    #save_fig("gradient_descent_plot")
    #plt.show()
  3. 下面这段代码使用了一个简单的学习计划实现随机梯度下降:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    theta_path_sgd = []
    m = len(X_b)
    np.random.seed(42)

    n_epochs = 50
    t0, t1 = 5, 50 # learning schedule hyperparameters

    def learning_schedule(t):
    return t0 / (t + t1)

    theta = np.random.randn(2,1) # random initialization

    for epoch in range(n_epochs):
    for i in range(m):
    if epoch == 0 and i < 20: # not shown in the book
    y_predict = X_new_b.dot(theta) # not shown
    style = "b-" if i > 0 else "r--" # not shown
    plt.plot(X_new, y_predict, style) # not shown
    random_index = np.random.randint(m)
    xi = X_b[random_index:random_index+1]
    yi = y[random_index:random_index+1]
    gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
    eta = learning_schedule(epoch * m + i)
    theta = theta - eta * gradients
    theta_path_sgd.append(theta) # not shown

    plt.plot(X, y, "b.") # not shown
    plt.xlabel("$x_1$", fontsize=18) # not shown
    plt.ylabel("$y$", rotation=0, fontsize=18) # not shown
    plt.axis([0, 2, 0, 15]) # not shown
    #save_fig("sgd_plot") # not shown
    #plt.show()

    from sklearn.linear_model import SGDRegressor
    sgd_reg = SGDRegressor(n_iter=50, penalty=None, eta0=0.1)
    sgd_reg.fit(X, y.ravel())
    print(sgd_reg.intercept_, sgd_reg.coef_)
  4. 小批量梯度下降

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    theta_path_bgd = []
    theta_path_sgd = []
    theta_path_mgd = []
    m = 100
    n_iterations = 50
    minibatch_size = 20

    np.random.seed(42)
    theta = np.random.randn(2,1) # random initialization

    t0, t1 = 200, 1000
    def learning_schedule(t):
    return t0 / (t + t1)

    t = 0
    for epoch in range(n_iterations):
    shuffled_indices = np.random.permutation(m)
    X_b_shuffled = X_b[shuffled_indices]
    y_shuffled = y[shuffled_indices]
    for i in range(0, m, minibatch_size):
    t += 1
    xi = X_b_shuffled[i:i+minibatch_size]
    yi = y_shuffled[i:i+minibatch_size]
    gradients = 2/minibatch_size * xi.T.dot(xi.dot(theta) - yi)
    eta = learning_schedule(t)
    theta = theta - eta * gradients
    theta_path_mgd.append(theta)

    theta_path_bgd = np.array(theta_path_bgd)
    theta_path_sgd = np.array(theta_path_sgd)
    theta_path_mgd = np.array(theta_path_mgd)

    plt.figure(figsize=(7,4))
    plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "r-s", linewidth=1, label="Stochastic")
    plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], "g-+", linewidth=2, label="Mini-batch")
    plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], "b-o", linewidth=3, label="Batch")
    plt.legend(loc="upper left", fontsize=16)
    plt.xlabel(r"$\theta_0$", fontsize=20)
    plt.ylabel(r"$\theta_1$ ", fontsize=20, rotation=0)
    plt.axis([2.5, 4.5, 2.3, 3.9])
    save_fig("gradient_descent_paths_plot")
    plt.show()

多项式回归

  1. 基于简单的二次方程(注:二次方程的形式为y=ax2+bx+c)制造一些非线性数据(添加随机噪声)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    import numpy as np
    import numpy.random as rnd
    np.random.seed(42)

    m = 100
    X = 6 * np.random.rand(m, 1) - 3
    y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

    plt.plot(X, y, "b.")
    plt.xlabel("$x_1$", fontsize=18)
    plt.ylabel("$y$", rotation=0, fontsize=18)
    plt.axis([-3, 3, 0, 10])
    save_fig("quadratic_data_plot")
    plt.show()
  2. 使用Scikit-Learn的PolynomialFeatures类来对训练数据进行转换

    1
    2
    3
    4
    5
    from sklearn.preprocessing import PolynomialFeatures
    poly_features = PolynomialFeatures(degree=2, include_bias=False)
    X_poly = poly_features.fit_transform(X)
    #print(X[0])
    #print(X_poly[0])
  3. 对这个扩展后的训练集匹配一个LinearRegression模型

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    from sklearn.linear_model import LinearRegression
    lin_reg = LinearRegression()
    lin_reg.fit(X_poly, y)
    #print(lin_reg.intercept_, lin_reg.coef_)

    X_new=np.linspace(-3, 3, 100).reshape(100, 1)
    X_new_poly = poly_features.transform(X_new)
    y_new = lin_reg.predict(X_new_poly)
    plt.plot(X, y, "b.")
    plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
    plt.xlabel("$x_1$", fontsize=18)
    plt.ylabel("$y$", rotation=0, fontsize=18)
    plt.legend(loc="upper left", fontsize=14)
    plt.axis([-3, 3, 0, 10])
    #save_fig("quadratic_predictions_plot(多项式回归模型预测)")
    #plt.show()

学习曲线

  1. 高阶多项式回归

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import Pipeline

    for style, width, degree in (("g-", 1, 300), ("b--", 2, 2), ("r-+", 2, 1)):
    polybig_features = PolynomialFeatures(degree=degree, include_bias=False)
    std_scaler = StandardScaler()
    lin_reg = LinearRegression()
    polynomial_regression = Pipeline([
    ("poly_features", polybig_features),
    ("std_scaler", std_scaler),
    ("lin_reg", lin_reg),
    ])
    polynomial_regression.fit(X, y)
    y_newbig = polynomial_regression.predict(X_new)
    plt.plot(X_new, y_newbig, style, label=str(degree), linewidth=width)

    plt.plot(X, y, "b.", linewidth=3)
    plt.legend(loc="upper left")
    plt.xlabel("$x_1$", fontsize=18)
    plt.ylabel("$y$", rotation=0, fontsize=18)
    plt.axis([-3, 3, 0, 10])
    #save_fig("high_degree_polynomials_plot(高阶多项式回归)")
    #plt.show()
  2. 纯线性回归模型学习曲线

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    from sklearn.metrics import mean_squared_error
    from sklearn.model_selection import train_test_split

    def plot_learning_curves(model, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=10)
    train_errors, val_errors = [], []
    for m in range(1, len(X_train)):
    model.fit(X_train[:m], y_train[:m])
    y_train_predict = model.predict(X_train[:m])
    y_val_predict = model.predict(X_val)
    train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
    val_errors.append(mean_squared_error(y_val, y_val_predict))

    plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
    plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
    plt.legend(loc="upper right", fontsize=14) # not shown in the book
    plt.xlabel("Training set size", fontsize=14) # not shown
    plt.ylabel("RMSE", fontsize=14) # not shown
    '''
    '''
    lin_reg = LinearRegression()
    plot_learning_curves(lin_reg, X, y)
    plt.axis([0, 80, 0, 3]) # not shown in the book
    #save_fig("underfitting_learning_curves_plot(学习曲线)") # not shown
    #plt.show() # not shown
  3. 多项式回归模型的学习曲线 10阶

    1
    2
    3
    4
    5
    6
    7
    8
    9
    polynomial_regression = Pipeline([
    ("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
    ("lin_reg", LinearRegression()),
    ])

    plot_learning_curves(polynomial_regression, X, y)
    plt.axis([0, 80, 0, 3]) # not shown
    save_fig("learning_curves_plot(多项式回归模型的学习曲线)") # not shown
    plt.show()

正则线性模型

  1. 岭回归(也叫作吉洪诺夫正则化)是线性回归的正则化版

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    from sklearn.linear_model import Ridge

    np.random.seed(42)
    m = 20
    X = 3 * np.random.rand(m, 1)
    y = 1 + 0.5 * X + np.random.randn(m, 1) / 1.5
    X_new = np.linspace(0, 3, 100).reshape(100, 1)

    def plot_model(model_class, polynomial, alphas, **model_kargs):
    for alpha, style in zip(alphas, ("b-", "g--", "r:")):
    model = model_class(alpha, **model_kargs) if alpha > 0 else LinearRegression()
    if polynomial:
    model = Pipeline([
    ("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
    ("std_scaler", StandardScaler()),
    ("regul_reg", model),
    ])
    model.fit(X, y)
    y_new_regul = model.predict(X_new)
    lw = 2 if alpha > 0 else 1
    plt.plot(X_new, y_new_regul, style, linewidth=lw, label=r"$\alpha = {}$".format(alpha))
    plt.plot(X, y, "b.", linewidth=3)
    plt.legend(loc="upper left", fontsize=15)
    plt.xlabel("$x_1$", fontsize=18)
    plt.axis([0, 3, 0, 4])

    plt.figure(figsize=(8,4))
    plt.subplot(121)
    plot_model(Ridge, polynomial=False, alphas=(0, 10, 100), random_state=42)
    plt.ylabel("$y$", rotation=0, fontsize=18)
    plt.subplot(122)
    plot_model(Ridge, polynomial=True, alphas=(0, 10**-5, 1), random_state=42)

    save_fig("ridge_regression_plot(岭回归)")
    plt.show()
  2. 使用Scikit-Learn执行闭式解的岭回归

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    from sklearn.linear_model import Ridge
    ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
    ridge_reg.fit(X, y)
    ridge_reg.predict([[1.5]])

    ridge_reg = Ridge(alpha=1, solver="sag", random_state=42)
    ridge_reg.fit(X, y)
    ridge_reg.predict([[1.5]])

    #使用随机梯度下降
    from sklearn.linear_model import SGDRegressor
    sgd_reg = SGDRegressor(max_iter=50, tol=-np.infty, penalty="l2", random_state=42)
    sgd_reg.fit(X, y.ravel())
    sgd_reg.predict([[1.5]])
  3. 套索回归、Lasso回归.线性回归的另一种正则化,叫作最小绝对收缩和选择算子回归(Least Absolute Shrinkage and Selection Operator Regression,简称Lasso回归,或套索回归)。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    from sklearn.linear_model import Lasso

    plt.figure(figsize=(8,4))
    plt.subplot(121)
    plot_model(Lasso, polynomial=False, alphas=(0, 0.1, 1), random_state=42)
    plt.ylabel("$y$", rotation=0, fontsize=18)
    plt.subplot(122)
    plot_model(Lasso, polynomial=True, alphas=(0, 10**-7, 1), tol=1, random_state=42)
    #save_fig("lasso_regression_plot(套索回归Lasso回归)")
    #plt.show()
  4. 使用Scikit-Learn的Lasso类的小例子。

    1
    2
    3
    4
    from sklearn.linear_model import ElasticNet
    elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)
    elastic_net.fit(X, y)
    #print(elastic_net.predict([[1.5]]))
  5. 弹性网络,使用Scikit-Learn的ElasticNet的小例子

    1
    2
    3
    4
    from sklearn.linear_model import ElasticNet
    elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)
    elastic_net.fit(X, y)
    #print(elastic_net.predict([[1.5]]))
  6. 早期停止法

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    from sklearn.linear_model import SGDRegressor
    np.random.seed(42)
    m = 100
    X = 6 * np.random.rand(m, 1) - 3
    y = 2 + X + 0.5 * X**2 + np.random.randn(m, 1)

    X_train, X_val, y_train, y_val = train_test_split(X[:50], y[:50].ravel(), test_size=0.5, random_state=10)

    poly_scaler = Pipeline([
    ("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
    ("std_scaler", StandardScaler()),
    ])

    X_train_poly_scaled = poly_scaler.fit_transform(X_train)
    X_val_poly_scaled = poly_scaler.transform(X_val)

    sgd_reg = SGDRegressor(max_iter=1,
    tol=-np.infty,
    penalty=None,
    eta0=0.0005,
    warm_start=True,
    learning_rate="constant",
    random_state=42)

    n_epochs = 500
    train_errors, val_errors = [], []
    for epoch in range(n_epochs):
    sgd_reg.fit(X_train_poly_scaled, y_train)
    y_train_predict = sgd_reg.predict(X_train_poly_scaled)
    y_val_predict = sgd_reg.predict(X_val_poly_scaled)
    train_errors.append(mean_squared_error(y_train, y_train_predict))
    val_errors.append(mean_squared_error(y_val, y_val_predict))

    best_epoch = np.argmin(val_errors)
    best_val_rmse = np.sqrt(val_errors[best_epoch])

    plt.annotate('Best model',
    xy=(best_epoch, best_val_rmse),
    xytext=(best_epoch, best_val_rmse + 1),
    ha="center",
    arrowprops=dict(facecolor='black', shrink=0.05),
    fontsize=16,
    )

    best_val_rmse -= 0.03 # just to make the graph look better
    plt.plot([0, n_epochs], [best_val_rmse, best_val_rmse], "k:", linewidth=2)
    plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="Validation set")
    plt.plot(np.sqrt(train_errors), "r--", linewidth=2, label="Training set")
    plt.legend(loc="upper right", fontsize=14)
    plt.xlabel("Epoch", fontsize=14)
    plt.ylabel("RMSE", fontsize=14)
    save_fig("early_stopping_plot(早期停止法)")
    plt.show()
  7. Lasso回归与岭回归

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    t1a, t1b, t2a, t2b = -1, 3, -1.5, 1.5

    # ignoring bias term
    t1s = np.linspace(t1a, t1b, 500)
    t2s = np.linspace(t2a, t2b, 500)
    t1, t2 = np.meshgrid(t1s, t2s)
    T = np.c_[t1.ravel(), t2.ravel()]
    Xr = np.array([[-1, 1], [-0.3, -1], [1, 0.1]])
    yr = 2 * Xr[:, :1] + 0.5 * Xr[:, 1:]

    J = (1/len(Xr) * np.sum((T.dot(Xr.T) - yr.T)**2, axis=1)).reshape(t1.shape)

    N1 = np.linalg.norm(T, ord=1, axis=1).reshape(t1.shape)
    N2 = np.linalg.norm(T, ord=2, axis=1).reshape(t1.shape)

    t_min_idx = np.unravel_index(np.argmin(J), J.shape)
    t1_min, t2_min = t1[t_min_idx], t2[t_min_idx]

    t_init = np.array([[0.25], [-1]])

    def bgd_path(theta, X, y, l1, l2, core = 1, eta = 0.1, n_iterations = 50):
    path = [theta]
    for iteration in range(n_iterations):
    gradients = core * 2/len(X) * X.T.dot(X.dot(theta) - y) + l1 * np.sign(theta) + 2 * l2 * theta

    theta = theta - eta * gradients
    path.append(theta)
    return np.array(path)

    plt.figure(figsize=(12, 8))
    for i, N, l1, l2, title in ((0, N1, 0.5, 0, "Lasso"), (1, N2, 0, 0.1, "Ridge")):
    JR = J + l1 * N1 + l2 * N2**2

    tr_min_idx = np.unravel_index(np.argmin(JR), JR.shape)
    t1r_min, t2r_min = t1[tr_min_idx], t2[tr_min_idx]

    levelsJ=(np.exp(np.linspace(0, 1, 20)) - 1) * (np.max(J) - np.min(J)) + np.min(J)
    levelsJR=(np.exp(np.linspace(0, 1, 20)) - 1) * (np.max(JR) - np.min(JR)) + np.min(JR)
    levelsN=np.linspace(0, np.max(N), 10)

    path_J = bgd_path(t_init, Xr, yr, l1=0, l2=0)
    path_JR = bgd_path(t_init, Xr, yr, l1, l2)
    path_N = bgd_path(t_init, Xr, yr, np.sign(l1)/3, np.sign(l2), core=0)

    plt.subplot(221 + i * 2)
    plt.grid(True)
    plt.axhline(y=0, color='k')
    plt.axvline(x=0, color='k')
    plt.contourf(t1, t2, J, levels=levelsJ, alpha=0.9)
    plt.contour(t1, t2, N, levels=levelsN)
    plt.plot(path_J[:, 0], path_J[:, 1], "w-o")
    plt.plot(path_N[:, 0], path_N[:, 1], "y-^")
    plt.plot(t1_min, t2_min, "rs")
    plt.title(r"$\ell_{}$ penalty".format(i + 1), fontsize=16)
    plt.axis([t1a, t1b, t2a, t2b])
    if i == 1:
    plt.xlabel(r"$\theta_1$", fontsize=20)
    plt.ylabel(r"$\theta_2$", fontsize=20, rotation=0)

    plt.subplot(222 + i * 2)
    plt.grid(True)
    plt.axhline(y=0, color='k')
    plt.axvline(x=0, color='k')
    plt.contourf(t1, t2, JR, levels=levelsJR, alpha=0.9)
    plt.plot(path_JR[:, 0], path_JR[:, 1], "w-o")
    plt.plot(t1r_min, t2r_min, "rs")
    plt.title(title, fontsize=16)
    plt.axis([t1a, t1b, t2a, t2b])
    if i == 1:
    plt.xlabel(r"$\theta_1$", fontsize=20)

    save_fig("lasso_vs_ridge_plot")
    plt.show()
  8. 逻辑回归

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    #逻辑函数
    t = np.linspace(-10, 10, 100)
    sig = 1 / (1 + np.exp(-t))
    plt.figure(figsize=(9, 3))
    plt.plot([-10, 10], [0, 0], "k-")
    plt.plot([-10, 10], [0.5, 0.5], "k:")
    plt.plot([-10, 10], [1, 1], "k:")
    plt.plot([0, 0], [-1.1, 1.1], "k-")
    plt.plot(t, sig, "b-", linewidth=2, label=r"$\sigma(t) = \frac{1}{1 + e^{-t}}$")
    plt.xlabel("t")
    plt.legend(loc="upper left", fontsize=20)
    plt.axis([-10, 10, -0.1, 1.1])
    save_fig("logistic_function_plot")
    plt.show()
  9. 决策边界

    1
    2
    3
    4
    5
    6
    7
    8
    9
    #创建一个分类器来检测Virginica鸢尾花。
    from sklearn import datasets
    iris = datasets.load_iris()
    list(iris.keys())
    #print(list(iris.keys()))
    #print(iris.DESCR)

    X = iris["data"][:, 3:] # petal width
    y = (iris["target"] == 2).astype(np.int) # 1 if Iris-Virginica, else 0
  10. 训练逻辑回归模型

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    from sklearn.linear_model import LogisticRegression
    log_reg = LogisticRegression(solver="liblinear", random_state=42)
    log_reg.fit(X, y)

    #精简版
    X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
    y_proba = log_reg.predict_proba(X_new)

    plt.plot(X_new, y_proba[:, 1], "g-", linewidth=2, label="Iris-Virginica")
    plt.plot(X_new, y_proba[:, 0], "b--", linewidth=2, label="Not Iris-Virginica")
    plt.show()

    #完整版
    X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
    y_proba = log_reg.predict_proba(X_new)
    decision_boundary = X_new[y_proba[:, 1] >= 0.5][0]

    plt.figure(figsize=(8, 3))
    plt.plot(X[y==0], y[y==0], "bs")
    plt.plot(X[y==1], y[y==1], "g^")
    plt.plot([decision_boundary, decision_boundary], [-1, 2], "k:", linewidth=2)
    plt.plot(X_new, y_proba[:, 1], "g-", linewidth=2, label="Iris-Virginica")
    plt.plot(X_new, y_proba[:, 0], "b--", linewidth=2, label="Not Iris-Virginica")
    plt.text(decision_boundary+0.02, 0.15, "Decision boundary", fontsize=14, color="k", ha="center")
    plt.arrow(decision_boundary, 0.08, -0.3, 0, head_width=0.05, head_length=0.1, fc='b', ec='b')
    plt.arrow(decision_boundary, 0.92, 0.3, 0, head_width=0.05, head_length=0.1, fc='g', ec='g')
    plt.xlabel("Petal width (cm)", fontsize=14)
    plt.ylabel("Probability", fontsize=14)
    plt.legend(loc="center left", fontsize=14)
    plt.axis([0, 3, -0.02, 1.02])
    #save_fig("logistic_regression_plot(估算概率和决策边界)")
    #plt.show()
    print(decision_boundary)
    print(log_reg.predict([[1.7], [1.5]]))
  11. Softmax回归 多元逻辑回归

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    from sklearn.linear_model import LogisticRegression

    X = iris["data"][:, (2, 3)] # petal length, petal width
    y = (iris["target"] == 2).astype(np.int)

    log_reg = LogisticRegression(solver="liblinear", C=10**10, random_state=42)
    log_reg.fit(X, y)

    x0, x1 = np.meshgrid(
    np.linspace(2.9, 7, 500).reshape(-1, 1),
    np.linspace(0.8, 2.7, 200).reshape(-1, 1),
    )
    X_new = np.c_[x0.ravel(), x1.ravel()]

    y_proba = log_reg.predict_proba(X_new)

    plt.figure(figsize=(10, 4))
    plt.plot(X[y==0, 0], X[y==0, 1], "bs")
    plt.plot(X[y==1, 0], X[y==1, 1], "g^")

    zz = y_proba[:, 1].reshape(x0.shape)
    contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)


    left_right = np.array([2.9, 7])
    boundary = -(log_reg.coef_[0][0] * left_right + log_reg.intercept_[0]) / log_reg.coef_[0][1]

    plt.clabel(contour, inline=1, fontsize=12)
    plt.plot(left_right, boundary, "k--", linewidth=3)
    plt.text(3.5, 1.5, "Not Iris-Virginica", fontsize=14, color="b", ha="center")
    plt.text(6.5, 2.3, "Iris-Virginica", fontsize=14, color="g", ha="center")
    plt.xlabel("Petal length", fontsize=14)
    plt.ylabel("Petal width", fontsize=14)
    plt.axis([2.9, 7, 0.8, 2.7])
    save_fig("logistic_regression_contour_plot")
    plt.show()

    X = iris["data"][:, (2, 3)] # petal length, petal width
    y = iris["target"]

    softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs", C=10, random_state=42)
    softmax_reg.fit(X, y)

    x0, x1 = np.meshgrid(
    np.linspace(0, 8, 500).reshape(-1, 1),
    np.linspace(0, 3.5, 200).reshape(-1, 1),
    )
    X_new = np.c_[x0.ravel(), x1.ravel()]

    y_proba = softmax_reg.predict_proba(X_new)
    y_predict = softmax_reg.predict(X_new)

    zz1 = y_proba[:, 1].reshape(x0.shape)
    zz = y_predict.reshape(x0.shape)

    plt.figure(figsize=(10, 4))
    plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris-Virginica")
    plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris-Versicolor")
    plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris-Setosa")

    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])

    plt.contourf(x0, x1, zz, cmap=custom_cmap)
    contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
    plt.clabel(contour, inline=1, fontsize=12)
    plt.xlabel("Petal length", fontsize=14)
    plt.ylabel("Petal width", fontsize=14)
    plt.legend(loc="center left", fontsize=14)
    plt.axis([0, 7, 0, 3.5])
    save_fig("softmax_regression_contour_plot")
    plt.show()

    print(softmax_reg.predict([[5, 2]]))
    print(softmax_reg.predict_proba([[5, 2]]))

python爬取网页图片

完整代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import os
import re
import urllib
import urllib.request


def get_html(url):
page = urllib.request.urlopen(url)
html_a = page.read()
return html_a.decode('utf-8')

def get_img(url):
x = 1 # 声明一个变量赋值
html_b = get_html(url)
reg = r'https://[^\s]*?\.jpg'
imgre = re.compile(reg) # 转换成一个正则对象
imglist = imgre.findall(html_b) # 表示在整个网页过滤出所有图片的地址,放在imgList中
path = os.getcwd() + os.sep + 'image' # 设置图片的保存地址
if not os.path.isdir(path):
os.makedirs(path) # 判断没有此路径则创建
paths = path + '\\' # 保存在test路径下
for imgurl in imglist:
urllib.request.urlretrieve(imgurl, '{0}{1}.jpg'.format(paths, x)) # 打开imgList,下载图片到本地
print('开始下载第%s张图片'%x)
x = x + 1

url = input("请输入网址")
#url =
get_img(url)

第3章分类

参考:作者的Jupyter Notebook
Chapter 3 – Classification

  1. 获取MNIST数据集的代码:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    def sort_by_target(mnist):
    reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
    reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
    mnist.data[:60000] = mnist.data[reorder_train]
    mnist.target[:60000] = mnist.target[reorder_train]
    mnist.data[60000:] = mnist.data[reorder_test + 60000]
    mnist.target[60000:] = mnist.target[reorder_test + 60000]
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', version=1, cache=True)
    mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings
    sort_by_target(mnist) # fetch_openml() returns an unsorted dataset
  2. 查看这些数组

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    #print(mnist["data"], mnist["target"])
    #print(mnist.data.shape)
    X, y = mnist["data"], mnist["target"]
    #print(X.shape)
    #print(y.shape)

    some_digit = X[36000]
    some_digit_image = some_digit.reshape(28, 28)
    plt.imshow(some_digit_image, cmap = mpl.cm.binary,
    interpolation="nearest")
    plt.axis("off")
    #plt.show()
    #print(y[36000])
  3. MNIST数据集中的部分数字图像

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
    rimages = images[row * images_per_row : (row + 1) * images_per_row]
    row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = mpl.cm.binary, **options)
    plt.axis("off")

    plt.figure(figsize=(9,9))
    example_images = np.r_[X[:12000:600], X[13000:30600:600], X[30600:60000:590]]
    plot_digits(example_images, images_per_row=10)
    #save_fig("more_digits_plot")
    #plt.show()
  4. 给数据集洗牌

    1
    2
    3
    X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
    shuffle_index = np.random.permutation(60000)
    X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]
  5. 训练一个二元分类器,为此分类任务创建目标向量:

    1
    2
    y_train_5 = (y_train == 5)  # True for all 5s, False for all other digits.
    y_test_5 = (y_test == 5)
  6. 创建一个SGDClassifier(随机梯度下降分类器)并在整个训练集上进行训练:

    1
    2
    3
    4
    5
    6
    7
    from sklearn.linear_model import SGDClassifier
    sgd_clf = SGDClassifier(max_iter=5, tol=-np.infty, random_state=42) #random_state=42
    sgd_clf.fit(X_train, y_train_5)
    #print(sgd_clf.fit(X_train, y_train_5))
    #现在可以用它来检测数字5的图像了:
    sgd_clf.predict([some_digit])
    #print(sgd_clf.predict([some_digit]))
  7. 交叉验证

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    from sklearn.model_selection import cross_val_score
    cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
    print(cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy"))

    #下面这段代码与前面的cross_val_score()大致相同,并打印出相同的结果:
    from sklearn.model_selection import StratifiedKFold
    from sklearn.base import clone
    skfolds = StratifiedKFold(n_splits=3, random_state=42) #random_state=42
    for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = (y_train_5[train_index])
    X_test_fold = X_train[test_index]
    y_test_fold = (y_train_5[test_index])

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))
  8. 一个蠢笨的分类器(不是我说的),它将每张图都分类成“非5”:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    from sklearn.base import BaseEstimator
    class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
    pass
    def predict(self, X):
    return np.zeros((len(X), 1), dtype=bool)
    #准确度
    never_5_clf = Never5Classifier()
    print(cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy"))
  9. 混淆矩阵:评估分类器性能的更好方法是混淆矩阵。

    1
    2
    3
    4
    5
    6
    7
    8
    from sklearn.model_selection import cross_val_predict
    y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
    #cross_val_predict()函数同样执行K-fold交叉验证,但返回的不是评估分数,而是每个折叠的预测。这意味着对于每个实例都可以得到一个干净的预测
    from sklearn.metrics import confusion_matrix
    #confusion_matrix(y_train_5, y_train_pred)
    #print(confusion_matrix(y_train_5, y_train_pred))
    y_train_perfect_predictions = y_train_5
    #print(confusion_matrix(y_train_5, y_train_perfect_predictions))
  10. 精度和召回率

    1
    2
    3
    4
    5
    6
    7
    8
    9
    #精度=TP/(TP+FP):TP是真正类的数量,FP是假正类的数量。
    #召回率=TP/(TP+FN):FN是假负类的数量。
    from sklearn.metrics import precision_score, recall_score
    print(precision_score(y_train_5, y_train_pred)) #精度4344 / (4344 + 1307)
    print(recall_score(y_train_5, y_train_pred)) #召回率4344 / (4344 + 1077)

    #F1分数:F1=2/(1/精度+1/召回率)=TP/(TP+(FN+FP)/2)
    from sklearn.metrics import f1_score
    print(f1_score(y_train_5, y_train_pred))
  11. 精度/召回率权衡:阈值

    1
    2
    3
    4
    5
    6
    7
    8
    9
    y_scores = sgd_clf.decision_function([some_digit])
    print(y_scores)
    threshold = 0
    y_some_digit_pred = (y_scores > threshold)
    print(y_some_digit_pred)
    #提高阈值
    threshold = 200000
    y_some_digit_pred_a = (y_scores > threshold)
    print(y_some_digit_pred_a)
  12. 决定使用什么阈值

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    #获取训练集中所有实例的分数
    y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
    #计算所有可能的阈值的精度和召回率
    from sklearn.metrics import precision_recall_curve
    precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
    #使用Matplotlib绘制精度和召回率相对于阈值的函数图
    def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])
    plt.figure(figsize=(8, 4))
    plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
    plt.xlim([-700000, 700000])
    plt.show()
    #print((y_train_pred == (y_scores > 0)).all())
    y_train_pred_90 = (y_scores > 70000)
    from sklearn.metrics import precision_score, recall_score
    print(precision_score(y_train_5, y_train_pred_90)) #精度
    print(recall_score(y_train_5, y_train_pred_90)) #召回率
  13. 精度和召回率的函数图PR

    1
    2
    3
    4
    5
    6
    7
    8
    9
    def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])

    plt.figure(figsize=(8, 6))
    plot_precision_vs_recall(precisions, recalls)
    plt.show()
  14. ROC曲线(受试者工作特征曲线):真正类率和假正类率

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    from sklearn.metrics import roc_curve
    fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
    def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    '''
    plt.figure(figsize=(8, 6))
    plot_roc_curve(fpr, tpr)
    plt.show()
    from sklearn.metrics import roc_auc_score
    print(roc_auc_score(y_train_5, y_scores))
  15. 训练一个RandomForestClassifier分类器,并比较它和SGDClassifier分类器的ROC曲线和ROC AUC分数。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    from sklearn.ensemble import RandomForestClassifier
    forest_clf = RandomForestClassifier(n_estimators=10, random_state=42)
    y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")
    y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
    fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, "b:", linewidth=2, label="SGD")
    plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
    plt.legend(loc="lower right", fontsize=16)
    plt.show()
    from sklearn.metrics import roc_auc_score
    print(roc_auc_score(y_train_5, y_scores_forest))
  16. 多类别分类器,用SGDClassifier试试

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    #用SGDClassifier试试:
    sgd_clf.fit(X_train, y_train)
    sgd_clf.predict([some_digit])
    #print(sgd_clf.predict([some_digit]))
    some_digit_scores = sgd_clf.decision_function([some_digit])
    #print(some_digit_scores)
    #print(np.argmax(some_digit_scores))
    #print(sgd_clf.classes_)
    #print(sgd_clf.classes_[5])

    #下面这段代码使用OvO策略,基于SGDClassifier创建了一个多类别分类器:
    from sklearn.multiclass import OneVsOneClassifier
    ovo_clf = OneVsOneClassifier(SGDClassifier(max_iter=5, tol=-np.infty, random_state=42))
    ovo_clf.fit(X_train, y_train)
    ovo_clf.predict([some_digit])
    len(ovo_clf.estimators_)
    #print(ovo_clf.predict([some_digit]))
    #print(len(ovo_clf.estimators_))
  17. 训练RandomForestClassifier

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    from sklearn.model_selection import cross_val_score
    forest_clf.fit(X_train, y_train)
    #print(forest_clf.predict([some_digit]))
    #print(forest_clf.predict_proba([some_digit])) #概率列表
    #print(cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")) #准确率
    #将输入进行简单缩放
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
    #print(cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy"))
  18. 使用Matplotlib的matshow()函数来查看混淆矩阵

    1
    2
    3
    4
    5
    6
    y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
    conf_mx = confusion_matrix(y_train, y_train_pred)
    #print(conf_mx)
    #使用Matplotlib的matshow()函数来查看混淆矩阵的图像表示
    #plt.matshow(conf_mx, cmap=plt.cm.gray)
    #save_fig("confusion_matrix_plot", tight_layout=False)
  19. 你需要将混淆矩阵中的每个值除以相应类别中的图片数量,这样你比较的就是错误率而不是错误的绝对值

    1
    2
    3
    4
    5
    6
    row_sums = conf_mx.sum(axis=1, keepdims=True)
    norm_conf_mx = conf_mx / row_sums
    #用0填充对角线,只保留错误,重新绘制结果:
    np.fill_diagonal(norm_conf_mx, 0)
    plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
    #save_fig("confusion_matrix_errors_plot", tight_layout=False)
  20. 看看数字3和数字5的例子:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    cl_a, cl_b = 3, 5
    X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
    X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
    X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
    X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

    plt.figure(figsize=(8,8))
    plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
    plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
    plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
    plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
    #save_fig("error_analysis_digits_plot")
  21. 多标签分类

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    #这段代码会创建一个y_multilabel数组,其中包含两个数字图片的目标标签:第一个表示数字是否是大数(7、8、9),第二个表示是否为奇数。
    from sklearn.neighbors import KNeighborsClassifier
    y_train_large = (y_train >= 7)
    y_train_odd = (y_train % 2 == 1)
    y_multilabel = np.c_[y_train_large, y_train_odd]

    knn_clf = KNeighborsClassifier()
    knn_clf.fit(X_train, y_multilabel)
    #print(knn_clf.fit(X_train, y_multilabel))
    #下一行创建一个KNeighborsClassifier实例(它支持多标签分类,不是所有的分类器都支持),然后使用多个目标数组对它进行
    #训练。现在用它做一个预测,注意它输出的两个标签:
    knn_clf.predict([some_digit]) #数字5确实不大(False),为奇数(True)。
    #print(knn_clf.predict([some_digit]))
  22. 下面这段代码计算所有标签的平均F1分数:

    1
    2
    3
    4
    from sklearn.metrics import f1_score
    y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3, n_jobs=-1)
    f1_score(y_multilabel, y_train_knn_pred, average="macro")
    #print(f1_score(y_multilabel, y_train_knn_pred, average="macro"))
  23. 多输出分类(多输出-多类别分类)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    #还先从创建训练集和测试集开始,使用NumPy的randint()函数
    #为MNIST图片的像素强度增加噪声。目标是将图片还原为原始图片:
    noise = np.random.randint(0, 100, (len(X_train), 784))
    X_train_mod = X_train + noise
    noise = np.random.randint(0, 100, (len(X_test), 784))
    X_test_mod = X_test + noise
    y_train_mod = X_train
    y_test_mod = X_test

    some_index = 5500
    #plt.subplot(121); plot_digit(X_test_mod[some_index])
    #plt.subplot(122); plot_digit(y_test_mod[some_index])
    #save_fig("noisy_digit_example_plot")
  24. 清洗这张图片:

    1
    2
    3
    4
    5
    knn_clf.fit(X_train_mod, y_train_mod)
    clean_digit = knn_clf.predict([X_test_mod[some_index]])
    plot_digit(clean_digit)
    save_fig("cleaned_digit_example_plot")
    plt.show()

第2章端到端的机器学习项目

参考:作者的Jupyter Notebook
Chapter 2 – End-to-end Machine Learning project

  1. 下载数据

    • 打开vscode,建立新的python文件,输入以下代码,下载housing.tgz文件,并将housing.csv解压到这个目录
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      import os
      import tarfile
      from six.moves import urllib

      download_root = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
      HOUSING_PATH = "datasets/housing"
      HOUSING_URL = download_root + HOUSING_PATH + "/housing.tgz"

      def fetch_housing_data(housing_url=HOUSING_URL,housing_path=HOUSING_PATH):
      if not os.path.isdir(housing_path):
      os.makedirs(housing_path)
      tgz_path = os.path.join(housing_path, "housing.tgz")
      urllib.request.urlretrieve(housing_url, tgz_path)
      housing_tgz = tarfile.open(tgz_path)
      housing_tgz.extractall(path=housing_path)
      housing_tgz.close()

      fetch_housing_data()
      下载后可将函数注释
  2. 快速查看数据结构

    • 使用pandas加载数据

      1
      2
      3
      4
      mport pandas as pd
      def load_housing_data(housing_path=HOUSING_PATH):
      csv_path = os.path.join(housing_path, "housing.csv")
      return pd.read_csv(csv_path)

      函数返回一个包含所有数据的Pandas DataFrame对象

    • 调用DataFrames的head()方法查看前5行数据(由于使用的是vscode所以会和书里有所不同),查看完可注释

      1
      2
      housing = load_housing_data()
      print(housing.head())

      总共有10个属性

    • 通过info()方法可以快速获取数据集的简单描述,特别是总行数、每个属性的类型和非空值的数量
      print(housing.info())

    • 使用value_counts()方法查看有多少种分类存在,每种类别下分别有多少个区域
      print(housing["ocean_proximity"].value_counts())

    • 通过describe()方法可以显示数值属性的摘要
      print(housing.describe())

    • 在整个数据集上调用hist()方法,绘制每个属性的直方图

      1
      2
      3
      import matplotlib.pyplot as plt
      housing.hist(bins=50, figsize=(50,15))
      plt.show()
  3. 创建测试集

    • 理论上,创建测试集非常简单:只需要随机选择一些实例,通常是数据集的20%,然后将它们放在一边:
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      import numpy as np
      def split_train_test(data, test_ratio):
      shuffled_indices = np.random.permutation(len(data))
      test_set_size = int(len(data) * test_ratio)
      test_indices = shuffled_indices[:test_set_size]
      train_indices = shuffled_indices[test_set_size:]
      return data.iloc[train_indices], data.iloc[test_indices]

      train_set, test_set = split_train_test(housing, 0.2)
      print(len(train_set), "train +", len(test_set), "test")
    • 但这并不完美:如果你再运行一遍,它又会产生一个不同的数据集!这样下去,你(或者是你的机器学习算法)将会看到整个完整的数据集,而这正是创建测试集时需要避免的。常见的解决办法是每个实例都使用一个标识符(identifier)来决定是否进入测试集(假定每个实例都有一个唯一且不变的标识符)
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      import hashlib
      def test_set_check(identifier,test_ratio, hash):
      return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

      def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
      ids = data[id_column]
      in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
      return data.loc[~in_test_set], data.loc[in_test_set]

      #housing_with_id = housing.reset_index()
      #housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
      #train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")
      from sklearn.model_selection import train_test_split
      train_set, test_set = train_test_split(housing, test_size=0.2, random=42)
    • 分层抽样
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
      housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
      from sklearn.model_selection import StratifiedShuffleSplit
      split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
      for train_index, test_index in split.split(housing, housing["income_cat"]):
      strat_train_set = housing.loc[train_index]
      strat_test_set = housing.loc[test_index]
      print(housing["income_cat"].value_counts() / len(housing))
      for set in (strat_train_set, strat_test_set):
      set.drop(["income_cat"], axis=1, inplace=True)
  4. 数据探索和可视化

    • 创建一个副本housing = strat_train_set.copy()
    • 将地理数据可视化
      1
      2
      3
      4
      5
      6
      7
      #housing.plot(kind="scatter", x="longitude", y="latitude")
      #housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
      housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
      s=housing["population"] / 100, label="population",
      c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,)
      plt.legend()
      plt.show()
    • 寻找相关性
      1
      2
      3
      4
      5
      6
      7
      #corr_matrix = housing.corr()
      #print(corr_matrix["median_house_value"].sort_values(ascending=False))
      from pandas.plotting import scatter_matrix #少了tools
      attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
      scatter_matrix(housing[attributes], figsize=(12, 8))
      housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)
      plt.show()
  5. 试验不同属性的组合

    1
    2
    3
    4
    5
    housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
    housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
    housing["population_per_household"]=housing["population"]/housing["households"]
    corr_matrix = housing.corr()
    print(corr_matrix["median_house_value"].sort_values(ascending=False))
  6. 机器学习算法的数据准备

    1
    2
    housing = strat_train_set.drop("median_house_value", axis=1)
    housing_labels = strat_train_set["median_house_value"].copy()
  7. 数据清理4选1

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    #housing.dropna(subset=["total_bedrooms"])    # option 1
    #housing.drop("total_bedrooms", axis=1) # option 2
    #median = housing["total_bedrooms"].median()
    #housing["total_bedrooms"].fillna(median) # option 3

    #option4: Scikit-Learn提供的imputer, 指定你要用属性的中位数值替换该属性的缺失值
    from sklearn.impute import SimpleImputer #与书中不同,进化了
    imputer = SimpleImputer(strategy="median") #创建一个imputer实例
    housing_num = housing.drop("ocean_proximity", axis=1) #创建一个没有文本属性的数据副本ocean_proximity
    imputer.fit(housing_num) #使用fit()方法将imputer实例适配到训练集
    #print(imputer.statistics_)
    #print(housing_num.median().values)
    X = imputer.transform(housing_num) #替换
    housing_tr = pd.DataFrame(X, columns=housing_num.columns) #放回Pandas DataFrame
  8. 处理文本和分类属性

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    #先将这些文本标签转化为数字,Scikit-Learn为这类任务提供了一个转换器LabelEncoder:
    from sklearn.preprocessing import LabelEncoder
    encoder = LabelEncoder()
    housing_cat = housing["ocean_proximity"]
    housing_cat_encoded = encoder.fit_transform(housing_cat)
    #print(housing_cat_encoded)
    #print(encoder.classes_)

    #Scikit-Learn提供了一个OneHotEncoder编码器,可以将整数分类值转换为独热向量
    from sklearn.preprocessing import OneHotEncoder
    encoder = OneHotEncoder()
    housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
    #print(housing_cat_1hot.toarray())

    #使用LabelBinarizer类可以一次性完成两个转换
    from sklearn.preprocessing import LabelBinarizer
    encoder = LabelBinarizer()
    housing_cat_1hot = encoder.fit_transform(housing_cat)
    print(housing_cat_1hot)
  9. 自定义转换器

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    from sklearn.base import BaseEstimator, TransformerMixin
    rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
    class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
    self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
    return self #nothing else to do
    def transform(self, X, y=None):
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = X[:, population_ix] / X[:, household_ix]
    if self.add_bedrooms_per_room:
    bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
    return np.c_[X, rooms_per_household, population_per_household,bedrooms_per_room]
    else:
    return np.c_[X, rooms_per_household, population_per_household]
    attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
    housing_extra_attribs = attr_adder.transform(housing.values)
  10. 转换流水线

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
    ])
    housing_num_tr = num_pipeline.fit_transform(housing_num)
    #print(housing_num_tr)

    from sklearn.compose import ColumnTransformer
    num_attribs = list(housing_num)
    cat_attribs = ["ocean_proximity"]

    full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
    ])

    housing_prepared = full_pipeline.fit_transform(housing)
    #print(housing_prepared)
    #print(housing_prepared.shape)
  11. 选择和训练模型

    • 训练一个线性回归模型:
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      from sklearn.linear_model import LinearRegression
      lin_reg = LinearRegression()
      lin_reg.fit(housing_prepared, housing_labels)
      #print(lin_reg)
      #实例试试
      some_data = housing.iloc[:5]
      some_labels = housing_labels.iloc[:5]
      some_data_prepared = full_pipeline.transform(some_data)
      #print("Predictions:", lin_reg.predict(some_data_prepared))
      #print("Labels:", list(some_labels))
      #print(some_data_prepared)
    • 使用Scikit-Learn的mean_squared_error函数来测量整个训练集上回归模型的RMSE:
      1
      2
      3
      4
      5
      6
      7
      8
      from sklearn.metrics import mean_squared_error
      housing_predictions = lin_reg.predict(housing_prepared)
      lin_mse = mean_squared_error(housing_labels, housing_predictions)
      lin_rmse = np.sqrt(lin_mse)
      #print(lin_rmse)
      from sklearn.metrics import mean_absolute_error
      lin_mae = mean_absolute_error(housing_labels, housing_predictions)
      #print(lin_mae)
    • 我们来训练一个(决策树)DecisionTreeRegressor。
      1
      2
      3
      4
      5
      6
      7
      from sklearn.tree import DecisionTreeRegressor
      tree_reg = DecisionTreeRegressor(random_state=42)
      tree_reg.fit(housing_prepared, housing_labels)
      housing_predictions = tree_reg.predict(housing_prepared)
      tree_mse = mean_squared_error(housing_labels, housing_predictions)
      tree_rmse = np.sqrt(tree_mse)
      #print(tree_rmse) #可能对数据严重过度拟合
    • 使用交叉验证来更好地进行评估
      1
      2
      3
      4
      5
      6
      7
      8
      9
      from sklearn.model_selection import cross_val_score
      scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
      tree_rmse_scores = np.sqrt(-scores)

      def display_scores(scores):
      print("Scores:", scores)
      print("Mean:", scores.mean())
      print("Standard deviation:", scores.std())
      #display_scores(tree_rmse_scores)
    • 计算一下线性回归模型的评分
      1
      2
      3
      lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
      lin_rmse_scores = np.sqrt(-lin_scores)
      #display_scores(lin_rmse_scores)
    • 随机森林模型RandomForestRegressor
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      from sklearn.ensemble import RandomForestRegressor
      forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
      forest_reg.fit(housing_prepared, housing_labels)
      housing_predictions = forest_reg.predict(housing_prepared)
      forest_mse = mean_squared_error(housing_labels, housing_predictions)
      forest_rmse = np.sqrt(forest_mse)
      #print(forest_rmse)
      from sklearn.model_selection import cross_val_score
      forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
      forest_rmse_scores = np.sqrt(-forest_scores)
      #display_scores(forest_rmse_scores)
      scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
      #print(pd.Series(np.sqrt(-scores)).describe())
  12. 微调模型

  13. 网格搜索

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    #你可以用Scikit-Learn的GridSearchCV来替你进行探索。你所要做的只是告诉它你要进行实验的超参数是什么,以及需要尝试的值,它将会使用交叉验证来评估超参数值的所有可能的组合。
    #下面这段代码搜索RandomForestRegressor的超参数值的最佳组合:
    #当你不知道超参数应该赋什么值时,一个简单的方法是连续尝试10的幂次方
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import GridSearchCV
    param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, # try 12 (3×4) combinations of hyperparameters
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}, # then try 6 (2×3) combinations with bootstrap set as False
    ]
    forest_reg = RandomForestRegressor()
    grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
    grid_search.fit(housing_prepared, housing_labels)
    #print(grid_search.best_params_)
    #print(grid_search.best_estimator_)

    cvres = grid_search.cv_results_
    for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
    print(pd.DataFrame(grid_search.cv_results_))
    #随机搜索
    #集成方法
  14. 分析最佳模型及其错误

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    feature_importances = grid_search.best_estimator_.feature_importances_
    #print(feature_importances)
    #将这些重要性分数显示在对应的属性名称旁边:
    extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
    #cat_encoder = cat_pipeline.named_steps["cat_encoder"] # old solution
    cat_encoder = full_pipeline.named_transformers_["cat"]
    cat_one_hot_attribs = list(cat_encoder.categories_[0])
    attributes = num_attribs + extra_attribs + cat_one_hot_attribs
    sorted(zip(feature_importances, attributes), reverse=True)
    #print(sorted(zip(feature_importances, attributes), reverse=True))
    #通过测试集评估系统
    from sklearn.metrics import mean_squared_error
    final_model = grid_search.best_estimator_
    X_test = strat_test_set.drop("median_house_value", axis=1)
    y_test = strat_test_set["median_house_value"].copy()
    X_test_prepared = full_pipeline.transform(X_test)
    final_predictions = final_model.predict(X_test_prepared)
    final_mse = mean_squared_error(y_test, final_predictions)
    final_rmse = np.sqrt(final_mse)
    #print(final_rmse)
  15. 启动、监控和维护系统

c语言笔试准备

  1. 若以下说明语句:char x; float y; double z; 则表达式x-y+z的类型为(double) 。
    字节转换从低到高 char–>float–>short–>int–>double
    规律:占用字节数小的类型在与占用字节数大的类型运算时会被转化为占用字节数大的类型。

  2. 设int a=3,b=5,m,执行表达式m=a<=3&&a+b<8后,m的值为 (0)
    ! > 算术运算符 > 关系运算符 > && > || > 赋值运算符
    m是整型,因此将false转为整型即为0。
    m=1&&0
    m=0

  3. 用数组名作函数实参时,形参可以用同类型的指针变量。
    比如数组
    int S[10];
    数组名S可以理解为就是指向数组首地址的指针,即 S 和 &S[0] 等价,*S和S[0]等价, 数组名作函数实参时,参数类型就是指向整型的指针

  4. 字符型数据在计算机内部是以ASCII码存储的,数字、英文大写字母和小写字母在ASCII码表中都是连续的。数字字符‘0’‘9’是4857,大写字母A~Z是从65~90,小写字母a~z是从97~122

  5. 数组可以在定义时整体赋初值,但不能在赋值语句中整体赋值。

  6. 继承方式和可见性

    • 公有继承意味着继承派生类的类能访问基类的公有和保护成员。私有继承意味着继承派生类的类也不能访问基类的成员。保护继承意味着继承派生类的类能访问基类的公有和保护方法。
    • 默认为私有继承,但常用的却是公有继承。
    • 基类的私有成员在派生类中都是不可见的,如果一个派生类要访问基类中声明的私有成员,可以将这个派生类声明为友元。
    • 公有继承时,同样继承了基类的私有成员,对基类的公有成员和保护成员的访问属性不变,派生类的新增成员可以访问基类的公有成员和保护成员,但是访问不了基类的私有成员。派生类的对象只能访问派生类的公有成员(包括继承的公有成员),访问不了保护成员和私有成员。
  7. 可以把z=x>y? x : y理解为

    1
    2
    3
    4
    5
    if(x>y){
    z=x;
    }else{
    z=y;
    }

    对于条件表达式b ? x : y,先计算条件b,然后进行判断。如果b的值为true,计算x的值,运算结果为x的值;否则,计算y的值,运算结果为y的值。一个条件表达式绝不会既计算x,又计算y。条件运算符是右结合的,也就是说,从右向左分组计算

  8. auto被解释为一个自动存储变量的关键字,也就是申明一块临时的变量内存。
    其中auto和register对应自动存储期。具有自动存储期的变量在进入声明该变量的程序块时被建立,它在该程序块活动时存在,退出该程序块时撤销。

  9. register:这个关键字命令编译器尽可能的将变量存在CPU内部寄存器中而不是通过内存寻址访问以提高效率。如果一个变量被register来修饰,就意味着该变量作为一个寄存器变量,让该变量的访问速度达到最快。

  10. 全局变量在静态区;局部变量在动态区;static变量在静态区

c语言面试准备

  1. #define不能以分号结束,在宏中把参数用括号括起来

    • 写一个“标准”宏,这个宏输入两个参数并返回较小的:#define MIN(x, y) ((x)< (y)?(x):(y)) //结尾没有;
    • #是把宏参数转化为字符串的运算符,##是把两个宏参数连接的运算符。
    • 为避免头文件my_ head.h被重复包含,可在其中使用条件编译:
      1
      2
      3
      4
      #ifndef_ MY_ HEAD H
      #define_ MY_ HEAD_ H /*空宏*/
      /*其他语句*/
      #endif
  2. 预处理器直接计算常数表达式的值

  3. 死循环:while(1){}或for(;;){}

    • 两者相比for里面为空,编译执行之后没有判断的语句,而 while(1)始终都会有执行判断 1 = true,所以在单片机这种低速的、内存资源不多的环境,for(;;)是更好的选择。少执行判断语句,直接跳转(jump)到循环开始的代码继续执行。
  4. 用变量a给出下面的定义

    • 一个整型数: int a
    • 一个指向整型数的指针: int *a
    • 一个指向指针的指针,它指向的指针是指向一个整型数: int **a
    • 一个有10个整型数的数组: int a[10]
    • 一个有10个指针的数组,该指针是指向一个整型数的: int *a[10]
    • 一个指向有10个整型数数组的指针: int (*a)[10]
    • 一个指向函数的指针,该函数有一个整型参数并返回一个整型数: int (*a)(int)
    • 一个有10个指针的数组,该指针指向一个函数,该函数有一个整型参数并返回一个整型数: int (*a[10])(int)
  5. 关键字static的作用

    • 在函数体,一个被声明为静态的变量在这一函数被调用过程中维持其值不变。
    • 在模块内(但在函数体外),一个被声明为静态的变量可以被模块内所用函数访问,但不能被模块外其它函数访问。它是一个本地的全局变量。
    • 在模块内,一个被声明为静态的函数只可被这一模块内的其它函数调用。那就是,这个函数被限制在声明它的模块的本地范围内使用。
  6. 关键字const有什么含意?

    • 它限定一个变量不允许被改变,产生静态作用。使用const在一定程度上可以提高程序的安全性和可靠性。另外,在观看别人代码的时候,清晰理解const所起的作用,对理解对方的程序也有一定帮助。
    • const 推出的初始目的,正是为了取代预编译指令,消除它的缺点,同时继承它的优点
    • 便于进行类型检查,使编译器对处理内容有更多了解,消除了一些隐患。
    • 可以避免意义模糊的数字出现,同样可以很方便地进行参数的调整和修改。 同宏定义一样,可以做到不变则已,一变都变!
    • 可以保护被修饰的东西,防止意外的修改,增强程序的健壮性,减少bug。
    • 可以节省空间,避免不必要的内存分配。
    • 意味着只读
      • const int a; 或int const a; 表示a是一个常整型数。
      • const int *a; 意味着a是一个指向常整型数的指针(整型数是不可修改的,但指针可以)。
      • int * const a; a是一个指向整型数的常指针(指针指向的整型数是可以修改的,但指针是不可修改的)。
      • int const * a const; 意味着a是一个指向常整型数的常指针(也就是说,指针指向的整型数是不可修改的,同时指针也是不可修改的)。
      • int fun(const int a);int fun(const char *str);修饰函数形参,使得形参在函数内不能被修改,表示输入参数。
      • const char *getstr(void);使用: const *str= getstr);
        const int getint(void);使用: const int a =getint();修饰函数返回值,使得函数的返回值不能被修改。
  7. 关键字volatile有什么含义?

    • 一个定义为volatile的变量是说这变量可能会被意想不到地改变,这样,编译器就不会去假设这个变量的值了。精确地说就是,优化器在用到这个变量时必须每次都小心地重新读取这个变量的值,而不是使用保存在寄存器里的备份。下面是volatile变量的几个例子:
      • 并行设备的硬件寄存器(如:状态寄存器)
      • 一个中断服务子程序中会访问到的非自动变量(Non-automatic variables)
      • 多线程应用中被几个任务共享的变量
    • 搞嵌入式的家伙们经常同硬件、中断、RTOS等等打交道,所有这些都要求用到volatile变量。
    • 问题
      1. 一个参数既可以是const还可以是volatile吗?解释为什么。
        • 一个例子是只读的状态寄存器。它是volatile因为它可能被意想不到地改变。它是const因为程序不应该试图去修改它。
      2. 一个指针可以是volatile 吗?解释为什么。
        • 一个例子是当一个中服务子程序修该一个指向一个buffer的指针时。
    • volatile指定的关键字可能被系统、硬件、进程/线程改变,强制编译器每次从内存中取得该变量的值,而不是从被优化后的寄存器中读取。例子:硬件时钟;多线程中被多个任务共享的变量等。
  8. extern关键宇的作用:

    • 于修饰变量或函数,表明该变量或函数都是在别的文件中定义的,提示编译器在其他文件找定义。
    • extern "C"的作用就是为了能够正确实现C+ +代码调其他C语言代码。
  9. 要求设置一绝对地址为0x67a9的整型变量的值为0xaa66。编译器是一个纯粹的ANSI编译器。

    1
    2
    3
    int *ptr; 
    ptr = (int *)0x67a9;
    *ptr = 0xaa55;
  10. 有关中断:

    • ISR 不能返回一个值。如果你不懂这个,那么你不会被雇用。
    • ISR 不能传递参数。如果你没有看到这一点,你被雇用的机会等同第一项。
    • 在许多的处理器/编译器中,浮点一般都是不可重入的。有些处理器/编译器需要让额处的寄存器入栈,有些处理器/编译器就是不允许在ISR中做浮点运算。此外,ISR应该是短而有效率的,在ISR中做浮点运算是不明智的。
    • 与第三点一脉相承,printf()经常有重入和性能上的问题。
|