781 lines
64 KiB
Text
781 lines
64 KiB
Text
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# 朴素贝叶斯模型实验 (选作)\n",
|
|||
|
"\n",
|
|||
|
"> 本实验目标是用朴素贝叶斯模型对Yelp网站的评论文本进行分类"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## 第一步:读入数据\n",
|
|||
|
"\n",
|
|||
|
"把`yelp.csv`读入一个DataFrame中。"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 66,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>business_id</th>\n",
|
|||
|
" <th>date</th>\n",
|
|||
|
" <th>review_id</th>\n",
|
|||
|
" <th>stars</th>\n",
|
|||
|
" <th>text</th>\n",
|
|||
|
" <th>type</th>\n",
|
|||
|
" <th>user_id</th>\n",
|
|||
|
" <th>cool</th>\n",
|
|||
|
" <th>useful</th>\n",
|
|||
|
" <th>funny</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>9yKzy9PApeiPPOUJEtnvkg</td>\n",
|
|||
|
" <td>2011-01-26</td>\n",
|
|||
|
" <td>fWKvX83p0-ka4JS3dc6E5A</td>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" <td>My wife took me here on my birthday for breakf...</td>\n",
|
|||
|
" <td>review</td>\n",
|
|||
|
" <td>rLtl8ZkDX5vH5nAx9C3q5Q</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>ZRJwVLyzEJq1VAihDhYiow</td>\n",
|
|||
|
" <td>2011-07-27</td>\n",
|
|||
|
" <td>IjZ33sJrzXqU-0X6U8NwyA</td>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" <td>I have no idea why some people give bad review...</td>\n",
|
|||
|
" <td>review</td>\n",
|
|||
|
" <td>0a2KyEL0d3Yb1V6aivbIuQ</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>6oRAC4uyJCsJl1X0WZpVSA</td>\n",
|
|||
|
" <td>2012-06-14</td>\n",
|
|||
|
" <td>IESLBzqUCLdSzSqm0eCSxQ</td>\n",
|
|||
|
" <td>4</td>\n",
|
|||
|
" <td>love the gyro plate. Rice is so good and I als...</td>\n",
|
|||
|
" <td>review</td>\n",
|
|||
|
" <td>0hT2KtfLiobPvh6cDC8JQg</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>_1QQZuf4zZOyFCvXc0o6Vg</td>\n",
|
|||
|
" <td>2010-05-27</td>\n",
|
|||
|
" <td>G-WvGaISbqqaMHlNnByodA</td>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" <td>Rosie, Dakota, and I LOVE Chaparral Dog Park!!...</td>\n",
|
|||
|
" <td>review</td>\n",
|
|||
|
" <td>uZetl9T0NcROGOyFfughhg</td>\n",
|
|||
|
" <td>1</td>\n",
|
|||
|
" <td>2</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>6ozycU1RpktNG2-1BroVtw</td>\n",
|
|||
|
" <td>2012-01-05</td>\n",
|
|||
|
" <td>1uJFq2r5QfJG_6ExMRCaGw</td>\n",
|
|||
|
" <td>5</td>\n",
|
|||
|
" <td>General Manager Scott Petello is a good egg!!!...</td>\n",
|
|||
|
" <td>review</td>\n",
|
|||
|
" <td>vYmM4KTsC8ZfQBg-j5MWkw</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" <td>0</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" business_id date review_id stars \\\n",
|
|||
|
"0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 \n",
|
|||
|
"1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 \n",
|
|||
|
"2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 \n",
|
|||
|
"3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 \n",
|
|||
|
"4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 \n",
|
|||
|
"\n",
|
|||
|
" text type \\\n",
|
|||
|
"0 My wife took me here on my birthday for breakf... review \n",
|
|||
|
"1 I have no idea why some people give bad review... review \n",
|
|||
|
"2 love the gyro plate. Rice is so good and I als... review \n",
|
|||
|
"3 Rosie, Dakota, and I LOVE Chaparral Dog Park!!... review \n",
|
|||
|
"4 General Manager Scott Petello is a good egg!!!... review \n",
|
|||
|
"\n",
|
|||
|
" user_id cool useful funny \n",
|
|||
|
"0 rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0 \n",
|
|||
|
"1 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0 \n",
|
|||
|
"2 0hT2KtfLiobPvh6cDC8JQg 0 1 0 \n",
|
|||
|
"3 uZetl9T0NcROGOyFfughhg 1 2 0 \n",
|
|||
|
"4 vYmM4KTsC8ZfQBg-j5MWkw 0 0 0 "
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 66,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# read csv\n",
|
|||
|
"import pandas as pd\n",
|
|||
|
"\n",
|
|||
|
"url = \"e2.4_yelp.csv\"\n",
|
|||
|
"yelp = pd.read_csv(url)\n",
|
|||
|
"yelp.head()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"创建一个新的DataFrame,只包含5颗星和1颗星评分的数据。"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 67,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# filter data\n",
|
|||
|
"yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## 第二步:生成X和y\n",
|
|||
|
"\n",
|
|||
|
"使用评论文本作为唯一的分类特征,评分星数作为预测目标,并将数据集划分为训练集和测试集。"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 68,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# define X and y \n",
|
|||
|
"X = yelp_best_worst.text\n",
|
|||
|
"y = yelp_best_worst.stars\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 69,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# split into training and testing sets\n",
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## 第三步:转换数据\n",
|
|||
|
"\n",
|
|||
|
"使用CountVectorizer将X_train和X_test转换为document-term矩阵。"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 70,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# import and instantiate the vectorizer\n",
|
|||
|
"from sklearn.feature_extraction.text import CountVectorizer\n",
|
|||
|
"vect = CountVectorizer()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 71,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# fit and transform X_train, but only transform X_test\n",
|
|||
|
"X_train_dtm = vect.fit_transform(X_train)\n",
|
|||
|
"X_test_dtm = vect.transform(X_test)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## 第四步:训练、预测、评价\n",
|
|||
|
"\n",
|
|||
|
"使用朴素贝叶斯预测测试集中评论的星级评分,并计算预测精度。"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 72,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<style>#sk-container-id-3 {color: black;background-color: white;}#sk-container-id-3 pre{padding: 0;}#sk-container-id-3 div.sk-toggleable {background-color: white;}#sk-container-id-3 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-3 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-3 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-3 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-3 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-3 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-3 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-3 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-3 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-3 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-3 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-3 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-3 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-3 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-3 div.sk-item {position: relative;z-index: 1;}#sk-container-id-3 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-3 div.sk-item::before, #sk-container-id-3 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-3 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-3 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-3 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-3 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-3 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-3 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-3 div.sk-label-container {text-align: center;}#sk-container-id-3 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-3 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-3\" class=\"sk-top-container\
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"MultinomialNB()"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 72,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# import/instantiate/fit\n",
|
|||
|
"from sklearn.naive_bayes import MultinomialNB\n",
|
|||
|
"nb = MultinomialNB()\n",
|
|||
|
"nb.fit(X_train_dtm, y_train)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 73,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# make class predictions\n",
|
|||
|
"y_pred_class = nb.predict(X_test_dtm)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 74,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0.9187866927592955"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 74,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# calculate accuracy\n",
|
|||
|
"from sklearn import metrics\n",
|
|||
|
"metrics.accuracy_score(y_test, y_pred_class)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"计算AUC。注意:y_test中的取值是1和5,需要先把它转换为取值为0和1的二值数组y_test_binary。"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 75,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# create y_test_binary from y_test, which contains ones and zeros instead\n",
|
|||
|
"# of ones and fives\n",
|
|||
|
"y_test_binary = y_test.map({5:1, 1:0})\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 76,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# predict class probabilities\n",
|
|||
|
"y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 77,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0.9391635104285566"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 77,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# calculate the AUC using y_test_binary and y_pred_prob\n",
|
|||
|
"metrics.roc_auc_score(y_test_binary, y_pred_prob)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"绘制ROC曲线。"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 78,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"%matplotlib inline\n",
|
|||
|
"import matplotlib.pyplot as plt"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 79,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkIAAAHHCAYAAABTMjf2AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy89olMNAAAACXBIWXMAAA9hAAAPYQGoP6dpAABhp0lEQVR4nO3deVxUVf8H8M8MMjPsi8gqguK+JC5puIaiqOWSpbikqKllbkmWu7jkUpbZYpmaov4st7RMDVdQNLdU3FBcwDACFBARUBhmzu8PHyZHQBmcYYD5vF8vXo/33HPv/c49Q3yfc889RyKEECAiIiIyQVJjB0BERERkLEyEiIiIyGQxESIiIiKTxUSIiIiITBYTISIiIjJZTISIiIjIZDERIiIiIpPFRIiIiIhMFhMhIiIiMllMhIjIaFJSUvDWW2+hatWqkEgkWLZsmbFDKhFvb28MGzasTK8ZFhYGiUSCW7dulel1C9y6dQsSiQRhYWFa5eHh4fD19YVCoYBEIkFGRgaGDRsGb29vo8RJpCsmQkT4749MwU+VKlXg4eGBYcOGITExschjhBDYsGEDOnToAHt7e1haWqJJkyaYN28esrOzi73Wjh070L17dzg5OUEmk8Hd3R39+/fHoUOHDPXxyq1JkyZh7969mDZtGjZs2IBu3boZ5DqbNm2CRCLBDz/8UOT+MWPGwNzcHOfPnzfI9SurtLQ09O/fHxYWFli+fDk2bNgAKysrY4dFpJMqxg6AqDyZN28eatasiUePHuHEiRMICwvD0aNHcenSJSgUCk09lUqFQYMGYcuWLWjfvj3mzJkDS0tLREVFYe7cudi6dSsOHDgAFxcXzTFCCIwYMQJhYWFo1qwZQkJC4OrqiqSkJOzYsQOdO3fGsWPH0KZNG2N8dKM4dOgQevfujcmTJxv0OgMGDMC6deswdepU9OnTR6tdTp06hZUrV+LDDz9E06ZNDRpHRebl5YWHDx/C3NxcU3b69Gk8ePAA8+fPR0BAgKZ81apVUKvVxgiTSHeCiMTatWsFAHH69Gmt8ilTpggAYvPmzVrlCxcuFADE5MmTC51r586dQiqVim7dummVL1myRAAQH3zwgVCr1YWOW79+vTh58qQePk3pZWVllen1JBKJGDt2rN7O9/DhQ6FSqYrcFx8fLywtLcXAgQM1Zfn5+cLX11d4e3uL7OzsEl/Hy8tLBAcHv2i4Oin4jsbHx5fpdZ9l3bp1Rf7e6JNarRY5OTkGOz8REyEiUXwitGvXLgFALFy4UFOWk5MjHBwcRN26dYVSqSzyfMOHDxcAxPHjxzXHODo6ivr164v8/PxSx6lSqcSyZctE48aNhVwuF05OTiIwMFATd3x8vAAg1q5dW+hYACI0NFSzHRoaKgCIy5cvi4EDBwp7e3vh6+urSdhu3bpV6BxTp04V5ubmIj09XVN24sQJERgYKGxtbYWFhYXo0KGDOHr06DM/R8H9fvqnwM2bN8Vbb70lHBwchIWFhWjdurXYtWuX1jkiIiIEAPHzzz+LGTNmCHd3dyGRSMS9e/eKve5nn30mAIh9+/YJIYRYunSpACD++OMPIYQQjx49ErNnzxY+Pj5CJpOJ6tWri48++kg8evRI6zxPJ0IFn+fw4cNi9OjRwtHRUdjY2IghQ4Zo3atnuXLliujXr59wcnISCoVC1K1bV0yfPr3QNZ5MhH799VfRo0cP4ebmJmQymahVq5aYN29eoe/YtWvXRN++fYWLi4uQy+XCw8NDBAUFiYyMDE2dffv2ibZt2wo7OzthZWUl6tatK6ZNm6bZ//R3q2PHjoXar+CeBAcHCy8vL60YVCqV+PLLL0XDhg2FXC4Xzs7OYvTo0YXuj5eXl3jttddEeHi4aNGihZDL5eLLL78s0T0kKg0+GiN6hoKBqQ4ODpqyo0eP4t69e5g4cSKqVCn6V2jo0KFYu3Ytdu3ahVdeeQVHjx5Feno6PvjgA5iZmZU6nnfeeQdhYWHo3r07Ro4cifz8fERFReHEiRNo2bJlqc7Zr18/1KlTBwsXLoQQAq+//jo+/vhjbNmyBR999JFW3S1btqBr166a+3Ho0CF0794dLVq0QGhoKKRSKdauXYtOnTohKioKrVq1KvKaHTp0wIYNGzBkyBB06dIFQ4cO1exLSUlBmzZtkJOTgwkTJqBq1apYt24devXqhW3btuGNN97QOtf8+fMhk8kwefJk5ObmQiaTFftZJ02ahI0bN2LMmDEIDw/H7NmzMWDAAHTr1g1qtRq9evXC0aNHMXr0aDRo0AAXL17El19+iWvXruHXX3997r0cN24c7O3tMWfOHMTGxuL777/H33//jcjISEgkkmKPu3DhAtq3bw9zc3OMHj0a3t7euHnzJn7//XcsWLCg2OPCwsJgbW2NkJAQWFtb49ChQ5g9ezYyMzOxZMkSAEBeXh4CAwORm5uL8ePHw9XVFYmJidi1axcyMjJgZ2eHy5cv4/XXX8dLL72EefPmQS6X48aNGzh27Fix154xYwbq1auHlStXah4p+/j4FFv/3XffRVhYGIYPH44JEyYgPj4e3377Lc6dO4djx45pPXKLjY3FwIED8e6772LUqFGoV6/es2470YsxdiZGVB4U/L/tAwcOiLt374rbt2+Lbdu2iWrVqgm5XC5u376tqbts2TIBQOzYsaPY86WnpwsAom/fvkIIIb766qvnHvM8hw4dEgDEhAkTCu0reNRWmh6hJx8VFfDz8xMtWrTQKjt16pQAINavX6+5Zp06dURgYKDWo76cnBxRs2ZN0aVLl+d+JgCFHo198MEHAoCIiorSlD148EDUrFlTeHt7ax59FfQI1apVS6dHJydPnhRSqVQ4OjoKe3t7kZycLIQQYsOGDUIqlWpdVwghVqxYIQCIY8eOacqK6xFq0aKFyMvL05QX9ED99ttvz4ypQ4cOwsbGRvz9999a5U/e16J6hIr63O+++66wtLTU9GKdO3dOABBbt24t9vpffvmlACDu3r1bbJ2ivlvF9aQ+3SMUFRUlAIiNGzdq1QsPDy9U7uXlJQCI8PDwYmMh0ie+NUb0hICAAFSrVg2enp546623YGVlhZ07d6J69eqaOg8ePAAA2NjYFHuegn2ZmZla//usY57nl19+gUQiQWhoaKF9z+pteJ733nuvUFlQUBDOnDmDmzdvaso2b94MuVyO3r17AwCio6Nx/fp1DBo0CGlpaUhNTUVqaiqys7PRuXNnHDlypFQDZvfs2YNWrVqhXbt2mjJra2uMHj0at27dQkxMjFb94OBgWFhYlPj8rVq1wnvvvYf09HQsWrRIM3B669ataNCgAerXr6/5LKmpqejUqRMAICIi4rnnHj16tFbPxpgxY1ClShXs2bOn2GPu3r2LI0eOYMSIEahRo4bWvue165Of+8GDB0hNTUX79u2Rk5ODq1evAgDs7OwAAHv37kVOTk6R57G3twcA/PbbbwYZ5Lx161bY2dmhS5cuWve2RYsWsLa2LnRva9asicDAQL3HQVQUJkJET1i+fDn279+Pbdu2oUePHkhNTYVcLteqU5DMFCRERXk6WbK1tX3uMc9z8+ZNuLu7w9HRsdTnKErNmjULlfXr1w9SqRSbN28G8PiNt61bt6J79+6az3L9+nUAjxORatWqaf2sXr0aubm5uH//vs7x/P3330U+CmnQoIFm//Pif56XX34ZALQeJ16/fh2XL18u9Fnq1q0LALhz585zz1unTh2tbWtra7i5uT1z7p+4uDgAQOPGjXX9GLh8+TLeeOMN2NnZwdbWFtWqVcPbb78NAJp7X7NmTYSEhGD16tVwcnJCYGAgli9frtU2QUFBaNu2LUaOHAkXFxcMGDAAW7Zs0VtSdP36ddy/fx/Ozs6F7m9WVlahe1uaNiUqLY4RInpCq1atNH8c+/Tpg3bt2mHQoEGIjY2FtbU1gP/+IF+4cAF9+vQp8jwXLlwAADRs2BAAUL9+fQDAxYsXiz1GH4rrQVCpVMUeU1Rviru7O9q3b48tW7Zg+vTpOHHiBBISEvDpp59q6hT8kVyyZAl8fX2LPHfBPTMkXXqDnkWtVqNJkyZ
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# plot ROC curve using y_test_binary and y_pred_prob\n",
|
|||
|
"fpr, tpr, thresholds = metrics.roc_curve(y_test_binary, y_pred_prob)\n",
|
|||
|
"plt.plot(fpr, tpr)\n",
|
|||
|
"plt.xlim([0.0, 1.0])\n",
|
|||
|
"plt.ylim([0.0, 1.0])\n",
|
|||
|
"plt.title('ROC curve for Yelp classifier')\n",
|
|||
|
"plt.xlabel('False Positive Rate (1 - Specificity)')\n",
|
|||
|
"plt.ylabel('True Positive Rate (Sensitivity)')\n",
|
|||
|
"plt.grid(True)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"显示混淆矩阵,并计算敏感度和特异性,评论结果。"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 80,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"array([[126, 58],\n",
|
|||
|
" [ 25, 813]])"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 80,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# print the confusion matrix\n",
|
|||
|
"metrics.confusion_matrix(y_test, y_pred_class)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 81,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0.9701670644391408"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 81,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# calculate sensitivity\n",
|
|||
|
"tn, fp, fn, tp = metrics.confusion_matrix(y_test, y_pred_class).ravel()\n",
|
|||
|
"sensitivity = tp / float(tp + fn)\n",
|
|||
|
"sensitivity"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 82,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0.6847826086956522"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 82,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# calculate specificity\n",
|
|||
|
"specificity = tn / float(tn + fp)\n",
|
|||
|
"specificity\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"对模型的敏感度和特异性做出评论:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## 第五步:错误分析\n",
|
|||
|
"\n",
|
|||
|
"查看测试集中一些被预测错误的评论文本,即false positives和false negatives。试着回答为什么这些评论会被预测错。"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 83,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"2175 This has to be the worst restaurant in terms o...\n",
|
|||
|
"1781 If you like the stuck up Scottsdale vibe this ...\n",
|
|||
|
"2674 I'm sorry to be what seems to be the lone one ...\n",
|
|||
|
"9984 Went last night to Whore Foods to get basics t...\n",
|
|||
|
"3392 I found Lisa G's while driving through phoenix...\n",
|
|||
|
"8283 Don't know where I should start. Grand opening...\n",
|
|||
|
"2765 Went last week, and ordered a dozen variety. I...\n",
|
|||
|
"2839 Never Again,\\nI brought my Mountain Bike in (w...\n",
|
|||
|
"321 My wife and I live around the corner, hadn't e...\n",
|
|||
|
"1919 D-scust-ing.\n",
|
|||
|
"2490 Lazy Q CLOSED in 2010. New Owners cleaned up ...\n",
|
|||
|
"9125 La Grande Orange Grocery has a problem. It can...\n",
|
|||
|
"9185 For frozen yogurt quality, I give this place a...\n",
|
|||
|
"436 this another place that i would give no stars ...\n",
|
|||
|
"2051 Sadly with new owners comes changes on menu. ...\n",
|
|||
|
"1721 This is the closest to a New York hipster styl...\n",
|
|||
|
"3447 If you want a school that cares more about you...\n",
|
|||
|
"842 Boy is the name a temptation.Seriously :) I'l...\n",
|
|||
|
"6159 Really, if I could, I would give this place ze...\n",
|
|||
|
"943 Don't waste your time...Arrowhead mall on the ...\n",
|
|||
|
"5977 You want good food? You'd be better off smuggl...\n",
|
|||
|
"8833 The owner has changed hands & this place isn't...\n",
|
|||
|
"6584 Jimmy Johns is cheaper and better ... The Capr...\n",
|
|||
|
"1899 Buca Di Beppo is literally, italian restaurant...\n",
|
|||
|
"9953 \"Hipster,Trendy\" ????-I think NOT !!!! Very di...\n",
|
|||
|
"2060 This place is closed. Good riddance.\n",
|
|||
|
"3082 Currently having a liquidation sale, but it's ...\n",
|
|||
|
"8220 Maybe I ate at a different restaurant than the...\n",
|
|||
|
"3634 Seriously?! With grocery stores like Fresh & E...\n",
|
|||
|
"3266 Absolutely awful... these guys have NO idea wh...\n",
|
|||
|
"7397 This place sucks!! I moved to the valley and h...\n",
|
|||
|
"4473 It is what you would expect from any themed pl...\n",
|
|||
|
"5502 Angry Bro Bar ! Please go here if you wear si...\n",
|
|||
|
"2615 Great in its day, now leaves a lot to be desir...\n",
|
|||
|
"3413 I purchased the Enotria groupon when it was re...\n",
|
|||
|
"2999 I can't even believe I actually went to this r...\n",
|
|||
|
"1372 No offense to everyone who gave this place 5 s...\n",
|
|||
|
"1291 Every time I come here the staff is so rude! I...\n",
|
|||
|
"6222 My mother always told me, if I didn't have any...\n",
|
|||
|
"9296 My boyfriend and I tried this place last year ...\n",
|
|||
|
"7975 What are you all talking about?! This place is...\n",
|
|||
|
"4630 I used to always go here for tires until my me...\n",
|
|||
|
"7130 I was not impressed. The food was bad & expens...\n",
|
|||
|
"5818 Most horrible buffet I have ever been to.\\n\\nM...\n",
|
|||
|
"3704 Staff is nice, and that's it.\\nThey use very c...\n",
|
|||
|
"8741 They served us stale rice. Average main dishe...\n",
|
|||
|
"3938 We were so disappointed! We were on vacation ...\n",
|
|||
|
"7631 this is a business located in the fry's grocer...\n",
|
|||
|
"8681 As I promised myself, I'd go back again to try...\n",
|
|||
|
"1532 Cold, under done chips. If a Mexican food rest...\n",
|
|||
|
"113 Unless you are a regular or look like your wal...\n",
|
|||
|
"4165 OMG! what is the rave about? this place is dis...\n",
|
|||
|
"9299 The salad plates were not chilled... As they u...\n",
|
|||
|
"4311 Donuts are really good, if they have any when ...\n",
|
|||
|
"7035 Totally excited to try this place out, my gran...\n",
|
|||
|
"8000 Still a place that is unacceptable in my book-...\n",
|
|||
|
"3755 Have been going to LGO since 2003 and have alw...\n",
|
|||
|
"507 HELLISH HELLISH SUMMER WEATHER (March thru Oct...\n",
|
|||
|
"Name: text, dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 83,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# first 10 false positives (meaning they were incorrectly classified as 5-star reviews)\n",
|
|||
|
"X_test[y_pred_class > y_test]\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 84,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"7148 I now consider myself an Arizonian. If you dri...\n",
|
|||
|
"4963 This is by far my favourite department store, ...\n",
|
|||
|
"6318 Since I have ranted recently on poor customer ...\n",
|
|||
|
"380 This is a must try for any Mani Pedi fan. I us...\n",
|
|||
|
"5565 I`ve had work done by this shop a few times th...\n",
|
|||
|
"3448 I was there last week with my sisters and whil...\n",
|
|||
|
"6050 I went to sears today to check on a layaway th...\n",
|
|||
|
"2504 I've passed by prestige nails in walmart 100s ...\n",
|
|||
|
"2475 This place is so great! I am a nanny and had t...\n",
|
|||
|
"241 I was sad to come back to lai lai's and they n...\n",
|
|||
|
"3149 I was told to see Greg after a local shop diag...\n",
|
|||
|
"423 These guys helped me out with my rear windshie...\n",
|
|||
|
"763 Here's the deal. I said I was done with OT, bu...\n",
|
|||
|
"8956 I took my computer to RedSeven recently when m...\n",
|
|||
|
"750 This store has the most pleasant employees of ...\n",
|
|||
|
"9765 You can't give anything less than 5 stars to a...\n",
|
|||
|
"6334 I came here today for a manicure and pedicure....\n",
|
|||
|
"1282 Loved my haircut. Walked in and waited for jus...\n",
|
|||
|
"1266 I've never been to this location before. My hu...\n",
|
|||
|
"402 Once again Wildflower proves why it's my favor...\n",
|
|||
|
"4034 \"Fine dining\" is not just a setting. it isn't...\n",
|
|||
|
"2444 EXCELLENT CUSTOMER SERVICE! \\n\\nEven with Happ...\n",
|
|||
|
"2494 What a great surprise stumbling across this ba...\n",
|
|||
|
"5736 Thank goodness for Sue at Mill Avenue Travel. ...\n",
|
|||
|
"7903 First, I'm sorry this review is lengthy, but i...\n",
|
|||
|
"Name: text, dtype: object"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 84,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# first 10 false negatives (meaning they were incorrectly classified as 1-star reviews)\n",
|
|||
|
"X_test[y_pred_class < y_test]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## 第六步:多分类预测\n",
|
|||
|
"\n",
|
|||
|
"使用所有的评论做预测,而不仅仅是评分1星和5星的评论。"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 85,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# define X and y using the original DataFrame\n",
|
|||
|
"X = yelp.text\n",
|
|||
|
"y = yelp.stars\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 86,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# split into training and testing sets\n",
|
|||
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 87,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# create document-term matrices\n",
|
|||
|
"X_train_dtm = vect.fit_transform(X_train)\n",
|
|||
|
"X_test_dtm = vect.transform(X_test)\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 88,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<style>#sk-container-id-4 {color: black;background-color: white;}#sk-container-id-4 pre{padding: 0;}#sk-container-id-4 div.sk-toggleable {background-color: white;}#sk-container-id-4 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-4 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-4 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-4 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-4 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-4 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-4 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-4 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-4 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-4 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-4 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-4 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-4 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-4 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-4 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-4 div.sk-item {position: relative;z-index: 1;}#sk-container-id-4 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-4 div.sk-item::before, #sk-container-id-4 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-4 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-4 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-4 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-4 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-4 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-4 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-4 div.sk-label-container {text-align: center;}#sk-container-id-4 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-4 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-4\" class=\"sk-top-container\
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"MultinomialNB()"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 88,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# fit a Naive Bayes model\n",
|
|||
|
"nb.fit(X_train_dtm, y_train)\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 89,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# make class predictions\n",
|
|||
|
"y_pred_class = nb.predict(X_test_dtm)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 90,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"0.4712"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 90,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# calculate the testing accuary\n",
|
|||
|
"metrics.accuracy_score(y_test, y_pred_class)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 91,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"array([[ 55, 14, 24, 65, 27],\n",
|
|||
|
" [ 28, 16, 41, 122, 27],\n",
|
|||
|
" [ 5, 7, 35, 281, 37],\n",
|
|||
|
" [ 7, 0, 16, 629, 232],\n",
|
|||
|
" [ 6, 4, 6, 373, 443]])"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 91,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# print the confusion matrix\n",
|
|||
|
"metrics.confusion_matrix(y_test, y_pred_class)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"有何评论:"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "Python 3.10.8 ('.venv': venv)",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.10.8"
|
|||
|
},
|
|||
|
"vscode": {
|
|||
|
"interpreter": {
|
|||
|
"hash": "1f0d395e06aa83586067b19165efc9b683889967164248deef4bbf1fa27cfb00"
|
|||
|
}
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 1
|
|||
|
}
|