机器学习分类模型
1 逻辑回归
用于二分类
Sigmoid逻辑函数
计算交叉熵作为损失函数
交叉验证
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
x, y = make_classification(n_samples=200, n_features=20, n_classes=2, random_state=0)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
# 创建逻辑回归模型
model = LogisticRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
2 决策树
信息熵
信息增益-ID3
信息增益率-C4.5
基尼系数-CART决策树
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 加载鸢尾花数据集
iris = load_iris()
x = iris.data
y = iris.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
model = DecisionTreeClassifier(random_state=42)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy}")
3 SVM支持向量机
寻找最优的分类边界
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
# 创建一个合成的二分类数据集
x, y = make_classification(n_samples=100, n_features=2, n_redundant=0, n_informative=1,
random_state=1, n_clusters_per_class=1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.8, random_state=42)
# 创建并训练SVM分类器
svc = svm.SVC(kernel='linear', C=1.0)
svc.fit(x_train, y_train)
y_pred = svc.predict(x_test)
print(f'Model Accuracy: {svc.score(x_test, y_test)}')
print(f'Model Accuracy: {accuracy_score(y_test, y_pred)}')
3.1 核函数
解决线性不可分问题
- 线性核函数
- 多项式核函数
- 径向基核函数
4 朴素贝叶斯
贝叶斯定理
条件独立假设
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
x, y = make_classification(n_samples=600, n_features=50, n_informative=10, n_redundant=10,
n_classes=3, random_state=0, shuffle=False)
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.6, random_state=0
)
gnb = GaussianNB()
gnb.fit(x_train, y_train)
y_pred = gnb.predict(x_test)
print(gnb.score(x_test, y_test))
print(accuracy_score(y_test, y_pred))
5 分类任务评价指标
- 精度、准确率与错误率
- 查准率
- 召回率
- f1分数
- 混淆矩阵
- 分类报告
- 交叉验证
6 模型优化
验证曲线:寻找最优模型参数
学习曲线:选取最优的训练集和测试集比例
网格搜索:遍历所有的超参数组合