机器学习与回归模型基础知识

1 数据预处理

1.1 归一化

归一化主要用于数据的量纲统一和区间映射，适用于特征值之间差异较大的情况，或者当需要保持特征之间的相对关系时。归一化是将样本的特征值转换到同一量纲下把数据映射到[0, 1]或者[-1, 1]区间内，仅由变量的极值决定，区间放缩法是归一化的一种。

import pandas as pd
import sklearn.preprocessing as pp

data = pd.DataFrame(
    [[-1, 2, 3],
     [4, 0, -2],
     [2, 5, -2]]
)
mm_scaler = pp.MinMaxScaler()
ret = mm_scaler.fit_transform(data)
print(ret)
org = mm_scaler.inverse_transform(ret)
print(org)

1.2 标准化

标准化是依照特征矩阵的列处理数据，其通过求z-score的方法，转换为标准正态分布，和整体样本分布相关，每个样本点都能对标准化产生影响。

std_scaler = pp.StandardScaler()
ret = std_scaler.fit_transform(data)
print(ret)
print(ret.mean(), ret.std())
org = std_scaler.inverse_transform(ret)
print(org)

1.3 标签编码（字符串转数字）

str_data = pd.DataFrame(
    [[2, 'male'],
     [5, 'female'],
     [4, 'female'],
     [0, 'unknown'],
     [6, 'male']]
)
lab_scaler = pp.LabelEncoder()
# fit = lab_scaler.fit(str_data.iloc[:, 1])
# print(fit.classes_)
# ret = fit.transform(str_data.iloc[:, 1])
ret = lab_scaler.fit_transform(str_data.iloc[:, 1])
print(ret)
org = lab_scaler.inverse_transform(ret)
print(org)

1.4 one-hot

one_hot = pp.OneHotEncoder(categories='auto')
ret = one_hot.fit_transform(str_data.iloc[:, 1:])
print(one_hot.get_feature_names_out())
# 返回稀疏矩阵，转为数组显示
print(ret.toarray())
org = one_hot.inverse_transform(ret)
print(org)

1.5 二值化

biner = pp.Binarizer(threshold=4)
ret = biner.fit_transform(str_data.iloc[:, 0].values.reshape(-1, 1))
print(ret)

2 模型评估指标

平均绝对值误差mae
均方误差mse
中位数绝对偏差mad
R2得分

3 回归

3.1 线性回归

线性回归是一个单层神经网络

梯度下降算法

沿着梯度(导数/偏导数)的反方向进行参数更新x = x - 学习率 * 梯度

线性回归-梯度下降算法手写实例

import matplotlib.pyplot as plt
import numpy as np

x = np.array([0.3, 0.6, 1.2, 1.5, 3, 5, 7, 8])
y = np.array([0.5, 1.0, 1.9, 2.8, 6.5, 8.5, 12, 15])

# y = w1 * x + w0
w1 = 1  # 随机数
w0 = 0  # 0/1
learn_rate = 0.0001  # 学习率
epoch = 300  # 学习次数
w1s = []
w0s = []
losses = []
epoches = []
for i in range(epoch):
    # 求梯度（偏导数）
    d0 = (w0 + w1 * x - y).sum()
    d1 = (x * (w1 * x + w0 - y)).sum()
    # 梯度下降
    w0 = w0 - learn_rate * d0
    w1 = w1 - learn_rate * d1
    loss = ((w1 * x + w0 - y) ** 2).sum() / 2
    w1s.append(w1)
    w0s.append(w0)
    losses.append(loss)
    epoches.append(i+1)
    print(f'epoch: {i+1:>3}, w1: {w1:.8f}, w0: {w0:.8f}, loss: {loss:.8f}')

print(w1, w0)
predict = w1 * x + w0
plt.plot(x, predict, c='red')
plt.scatter(x, y, c='green')
plt.grid(linestyle=':')
plt.show()
plt.figure('Training params')
for i, y, color, label in zip(
        (1, 2, 3),
        (w1s, w0s, losses),
        ('blue', 'green', 'red'),
        ('w1', 'w0', 'loss')):
    plt.subplot(3, 1, i)
    plt.plot(epoches, y, color=color, label=label)
    plt.grid(linestyle=':')
    plt.legend()
plt.show()

线性回归-sklearn实现

from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np

x = np.array([0.3, 0.6, 1.2, 1.5, 3, 5, 7, 8]).reshape(-1, 1)
y = np.array([0.5, 1.0, 1.9, 2.8, 6.5, 8.5, 12, 15])

lr = LinearRegression()
lr.fit(x, y)
print(lr.coef_, lr.intercept_)
predict = lr.predict(x)
plt.scatter(x, y, c='blue')
plt.plot(x, predict, c='red')
plt.grid(linestyle=':')
plt.show()

3.2 多项式回归

import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

np.random.seed(0)
x = np.random.rand(100, 1)
y = 2 * x ** 2 + np.random.randn(*x.shape) * 0.1
# 创建多项式特征转换器，这里使用二次多项式
poly_features = PolynomialFeatures(degree=2, include_bias=False)
# 创建线性回归模型
linear_reg = LinearRegression()
# 使用管道将多项式转换和线性回归连接起来
pipeline = make_pipeline(poly_features, linear_reg)
# 训练模型
pipeline.fit(x, y)
# 进行预测
predict_test = np.linspace(0, 1, 100).reshape(-1, 1)
predict_ret = pipeline.predict(predict_test)
predict_test_real = 2 * predict_test ** 2 + np.random.randn(*predict_test.shape) * 0.1

print("多项式回归系数:", pipeline.steps[-1][1].coef_)
plt.scatter(x, y, c='green', label='train')
plt.scatter(predict_test, predict_test_real, c='blue', label='test')
plt.plot(predict_test, predict_ret, c='red', label='model')
plt.legend()
plt.grid()
plt.show()

3.3 Ridge岭回归

正则化方法可以避免过拟合，Ridge回归使用L2正则化。

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge

x = np.array([0.3, 0.6, 1.2, 1.5, 3, 5, 7, 8]).reshape(-1, 1)
y = np.array([0.5, 1.0, 1.9, 2.8, 6.5, 8.5, 12, 15])
ridge = Ridge(alpha=0.1)
ridge.fit(x, y)
print(ridge.coef_, ridge.intercept_)
predict = ridge.predict(x)
plt.scatter(x, y)
plt.plot(x, predict, c='red')
plt.show()

3.4 Lasso回归

正则化方法可以避免过拟合，Lasso回归使用L1正则化。L1正则化倾向于将越多的参数变为0，使最终的近似解只依赖于很少的变量。Lasso方法可以达到变量选择的效果，将不显著的变量系数压缩至0，即会减少参数数量；而Ridge方法虽然也对原本的系数进行了一定程度的压缩，但是任一系数都不会压缩至0，即只会削弱参数的强度但是都会保留住。

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

x = np.array([0.3, 0.6, 1.2, 1.5, 3, 5, 7, 8]).reshape(-1, 1)
y = np.array([0.5, 1.0, 1.9, 2.8, 6.5, 8.5, 12, 15])
lasso = Lasso(alpha=0.1)
lasso.fit(x, y)
print(lasso.coef_, lasso.intercept_)
predict = lasso.predict(x)
print(mean_squared_error(y, predict))
plt.scatter(x, y)
plt.plot(x, predict, c='red')
plt.show()

3.5 决策树回归（单棵树）

相似的输入产生相似的输出，cart分类回归树

import matplotlib.pyplot as plt
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

x = np.array([0.3, 0.6, 1.2, 1.5, 3, 5, 7, 8]).reshape(-1, 1)
y = np.array([0.5, 1.0, 1.9, 2.8, 6.5, 8.5, 12, 15])
dec_tree = DecisionTreeRegressor(max_depth=5, random_state=8)
dec_tree.fit(x, y)
predict = dec_tree.predict(x)
print(mean_absolute_error(y, predict))
plt.scatter(x, y)
plt.plot(x, predict, c='red')
plt.show()

3.6 集成学习（多棵树）

Boosting

下一个决策树基于上一棵决策树进行提升

AdaBoost自适应增强决策树

import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 生成回归数据集
x, y = make_regression(n_samples=100, n_features=1, n_informative=1, noise=50, random_state=0)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

regressor = AdaBoostRegressor(DecisionTreeRegressor(max_depth=2), n_estimators=100, random_state=0)
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
plt.scatter(x_train, y_train, c='blue')
plt.scatter(x_test, y_test, c='green')
plt.scatter(x_test, y_pred, c='red')
plt.show()

GBDT梯度提升树

基于残差提升下一棵树

from sklearn.ensemble import GradientBoostingRegressor

Bagging

自助聚合

降低强势样本的影响

from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
# 初始化BaggingRegressor，使用DecisionTreeRegressor作为基模型
bagging_regressor = BaggingRegressor(
    DecisionTreeRegressor(), n_estimators=500, max_samples=0.7, bootstrap=True
)

随机森林

同时降低强势样本和强势特征的影响，模型更加泛化

from sklearn.ensemble import RandomForestRegressor