This repository has been archived on 2024-01-06. You can view files and clone it, but cannot push or open issues or pull requests.
justhomework/AIandML/e2_matchine_learning/e2.0_k-means.ipynb

939 lines
108 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# K-means实验"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"用pandas读取啤酒数据集`e2.0_beer.txt`。"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>calories</th>\n",
" <th>sodium</th>\n",
" <th>alcohol</th>\n",
" <th>cost</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Budweiser</td>\n",
" <td>144</td>\n",
" <td>15</td>\n",
" <td>4.7</td>\n",
" <td>0.43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Schlitz</td>\n",
" <td>151</td>\n",
" <td>19</td>\n",
" <td>4.9</td>\n",
" <td>0.43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lowenbrau</td>\n",
" <td>157</td>\n",
" <td>15</td>\n",
" <td>0.9</td>\n",
" <td>0.48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Kronenbourg</td>\n",
" <td>170</td>\n",
" <td>7</td>\n",
" <td>5.2</td>\n",
" <td>0.73</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Heineken</td>\n",
" <td>152</td>\n",
" <td>11</td>\n",
" <td>5.0</td>\n",
" <td>0.77</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Old_Milwaukee</td>\n",
" <td>145</td>\n",
" <td>23</td>\n",
" <td>4.6</td>\n",
" <td>0.28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Augsberger</td>\n",
" <td>175</td>\n",
" <td>24</td>\n",
" <td>5.5</td>\n",
" <td>0.40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Srohs_Bohemian_Style</td>\n",
" <td>149</td>\n",
" <td>27</td>\n",
" <td>4.7</td>\n",
" <td>0.42</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Miller_Lite</td>\n",
" <td>99</td>\n",
" <td>10</td>\n",
" <td>4.3</td>\n",
" <td>0.43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Budweiser_Light</td>\n",
" <td>113</td>\n",
" <td>8</td>\n",
" <td>3.7</td>\n",
" <td>0.40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Coors</td>\n",
" <td>140</td>\n",
" <td>18</td>\n",
" <td>4.6</td>\n",
" <td>0.44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>Coors_Light</td>\n",
" <td>102</td>\n",
" <td>15</td>\n",
" <td>4.1</td>\n",
" <td>0.46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Michelob_Light</td>\n",
" <td>135</td>\n",
" <td>11</td>\n",
" <td>4.2</td>\n",
" <td>0.50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>Becks</td>\n",
" <td>150</td>\n",
" <td>19</td>\n",
" <td>4.7</td>\n",
" <td>0.76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Kirin</td>\n",
" <td>149</td>\n",
" <td>6</td>\n",
" <td>5.0</td>\n",
" <td>0.79</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>Pabst_Extra_Light</td>\n",
" <td>68</td>\n",
" <td>15</td>\n",
" <td>2.3</td>\n",
" <td>0.38</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>Hamms</td>\n",
" <td>139</td>\n",
" <td>19</td>\n",
" <td>4.4</td>\n",
" <td>0.43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>Heilemans_Old_Style</td>\n",
" <td>144</td>\n",
" <td>24</td>\n",
" <td>4.9</td>\n",
" <td>0.43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>Olympia_Goled_Light</td>\n",
" <td>72</td>\n",
" <td>6</td>\n",
" <td>2.9</td>\n",
" <td>0.46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>Schlitz_Light</td>\n",
" <td>97</td>\n",
" <td>7</td>\n",
" <td>4.2</td>\n",
" <td>0.47</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name calories sodium alcohol cost\n",
"0 Budweiser 144 15 4.7 0.43\n",
"1 Schlitz 151 19 4.9 0.43\n",
"2 Lowenbrau 157 15 0.9 0.48\n",
"3 Kronenbourg 170 7 5.2 0.73\n",
"4 Heineken 152 11 5.0 0.77\n",
"5 Old_Milwaukee 145 23 4.6 0.28\n",
"6 Augsberger 175 24 5.5 0.40\n",
"7 Srohs_Bohemian_Style 149 27 4.7 0.42\n",
"8 Miller_Lite 99 10 4.3 0.43\n",
"9 Budweiser_Light 113 8 3.7 0.40\n",
"10 Coors 140 18 4.6 0.44\n",
"11 Coors_Light 102 15 4.1 0.46\n",
"12 Michelob_Light 135 11 4.2 0.50\n",
"13 Becks 150 19 4.7 0.76\n",
"14 Kirin 149 6 5.0 0.79\n",
"15 Pabst_Extra_Light 68 15 2.3 0.38\n",
"16 Hamms 139 19 4.4 0.43\n",
"17 Heilemans_Old_Style 144 24 4.9 0.43\n",
"18 Olympia_Goled_Light 72 6 2.9 0.46\n",
"19 Schlitz_Light 97 7 4.2 0.47"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# beer dataset\n",
"import pandas as pd\n",
"url = 'e2.0_beer.txt'\n",
"beer = pd.read_csv(url, sep=' ')\n",
"beer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"去掉`name`项,保留`calories`、`sodium`、`alcohol`和`cost`数据,作为特征`X`。"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# define X\n",
"X = beer.drop('name', axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"建立K-means聚类器使类别数为3并进行数据拟合。"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-3 {color: black;background-color: white;}#sk-container-id-3 pre{padding: 0;}#sk-container-id-3 div.sk-toggleable {background-color: white;}#sk-container-id-3 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-3 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-3 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-3 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-3 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-3 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-3 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-3 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-3 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-3 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-3 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-3 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-3 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-3 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-3 div.sk-item {position: relative;z-index: 1;}#sk-container-id-3 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-3 div.sk-item::before, #sk-container-id-3 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-3 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-3 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-3 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-3 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-3 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-3 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-3 div.sk-label-container {text-align: center;}#sk-container-id-3 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-3 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-3\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>KMeans(n_clusters=3)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" checked><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">KMeans</label><div class=\"sk-toggleable__content\"><pre>KMeans(n_clusters=3)</pre></div></div></div></div></div>"
],
"text/plain": [
"KMeans(n_clusters=3)"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# K-means with 3 clusters\n",
"# 注意使K-means聚类器的对象名称为 km\n",
"from sklearn.cluster import KMeans\n",
"km = KMeans(n_clusters=3)\n",
"km.fit(X)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"将聚类结果传递给pandas数据框并按类别排序查看各个啤酒参与聚类的结果。"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>calories</th>\n",
" <th>sodium</th>\n",
" <th>alcohol</th>\n",
" <th>cost</th>\n",
" <th>cluster</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Budweiser_Light</td>\n",
" <td>113</td>\n",
" <td>8</td>\n",
" <td>3.7</td>\n",
" <td>0.40</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>Coors_Light</td>\n",
" <td>102</td>\n",
" <td>15</td>\n",
" <td>4.1</td>\n",
" <td>0.46</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Miller_Lite</td>\n",
" <td>99</td>\n",
" <td>10</td>\n",
" <td>4.3</td>\n",
" <td>0.43</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>Schlitz_Light</td>\n",
" <td>97</td>\n",
" <td>7</td>\n",
" <td>4.2</td>\n",
" <td>0.47</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Heineken</td>\n",
" <td>152</td>\n",
" <td>11</td>\n",
" <td>5.0</td>\n",
" <td>0.77</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Old_Milwaukee</td>\n",
" <td>145</td>\n",
" <td>23</td>\n",
" <td>4.6</td>\n",
" <td>0.28</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Augsberger</td>\n",
" <td>175</td>\n",
" <td>24</td>\n",
" <td>5.5</td>\n",
" <td>0.40</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Srohs_Bohemian_Style</td>\n",
" <td>149</td>\n",
" <td>27</td>\n",
" <td>4.7</td>\n",
" <td>0.42</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Lowenbrau</td>\n",
" <td>157</td>\n",
" <td>15</td>\n",
" <td>0.9</td>\n",
" <td>0.48</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Coors</td>\n",
" <td>140</td>\n",
" <td>18</td>\n",
" <td>4.6</td>\n",
" <td>0.44</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Schlitz</td>\n",
" <td>151</td>\n",
" <td>19</td>\n",
" <td>4.9</td>\n",
" <td>0.43</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Michelob_Light</td>\n",
" <td>135</td>\n",
" <td>11</td>\n",
" <td>4.2</td>\n",
" <td>0.50</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>Becks</td>\n",
" <td>150</td>\n",
" <td>19</td>\n",
" <td>4.7</td>\n",
" <td>0.76</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Kirin</td>\n",
" <td>149</td>\n",
" <td>6</td>\n",
" <td>5.0</td>\n",
" <td>0.79</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>Hamms</td>\n",
" <td>139</td>\n",
" <td>19</td>\n",
" <td>4.4</td>\n",
" <td>0.43</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>Heilemans_Old_Style</td>\n",
" <td>144</td>\n",
" <td>24</td>\n",
" <td>4.9</td>\n",
" <td>0.43</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Kronenbourg</td>\n",
" <td>170</td>\n",
" <td>7</td>\n",
" <td>5.2</td>\n",
" <td>0.73</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Budweiser</td>\n",
" <td>144</td>\n",
" <td>15</td>\n",
" <td>4.7</td>\n",
" <td>0.43</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>Olympia_Goled_Light</td>\n",
" <td>72</td>\n",
" <td>6</td>\n",
" <td>2.9</td>\n",
" <td>0.46</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>Pabst_Extra_Light</td>\n",
" <td>68</td>\n",
" <td>15</td>\n",
" <td>2.3</td>\n",
" <td>0.38</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name calories sodium alcohol cost cluster\n",
"9 Budweiser_Light 113 8 3.7 0.40 0\n",
"11 Coors_Light 102 15 4.1 0.46 0\n",
"8 Miller_Lite 99 10 4.3 0.43 0\n",
"19 Schlitz_Light 97 7 4.2 0.47 0\n",
"4 Heineken 152 11 5.0 0.77 1\n",
"5 Old_Milwaukee 145 23 4.6 0.28 1\n",
"6 Augsberger 175 24 5.5 0.40 1\n",
"7 Srohs_Bohemian_Style 149 27 4.7 0.42 1\n",
"2 Lowenbrau 157 15 0.9 0.48 1\n",
"10 Coors 140 18 4.6 0.44 1\n",
"1 Schlitz 151 19 4.9 0.43 1\n",
"12 Michelob_Light 135 11 4.2 0.50 1\n",
"13 Becks 150 19 4.7 0.76 1\n",
"14 Kirin 149 6 5.0 0.79 1\n",
"16 Hamms 139 19 4.4 0.43 1\n",
"17 Heilemans_Old_Style 144 24 4.9 0.43 1\n",
"3 Kronenbourg 170 7 5.2 0.73 1\n",
"0 Budweiser 144 15 4.7 0.43 1\n",
"18 Olympia_Goled_Light 72 6 2.9 0.46 2\n",
"15 Pabst_Extra_Light 68 15 2.3 0.38 2"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# save the cluster labels and sort by cluster\n",
"beer['cluster'] = km.labels_\n",
"beer.sort_values(by='cluster')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"查看聚类结果中各个簇的中心点坐标"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[102.75 , 10. , 4.075 , 0.44 ],\n",
" [150. , 17. , 4.52142857, 0.52071429],\n",
" [ 70. , 10.5 , 2.6 , 0.42 ]])"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# review the cluster centers\n",
"km.cluster_centers_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"用pandas查看各类别样本的坐标均值并回答是否和先前计算相同\n",
"\n",
"答:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_50398/58857758.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.\n",
" beer.groupby('cluster').mean()\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>calories</th>\n",
" <th>sodium</th>\n",
" <th>alcohol</th>\n",
" <th>cost</th>\n",
" </tr>\n",
" <tr>\n",
" <th>cluster</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>102.75</td>\n",
" <td>10.0</td>\n",
" <td>4.075000</td>\n",
" <td>0.440000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>150.00</td>\n",
" <td>17.0</td>\n",
" <td>4.521429</td>\n",
" <td>0.520714</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>70.00</td>\n",
" <td>10.5</td>\n",
" <td>2.600000</td>\n",
" <td>0.420000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" calories sodium alcohol cost\n",
"cluster \n",
"0 102.75 10.0 4.075000 0.440000\n",
"1 150.00 17.0 4.521429 0.520714\n",
"2 70.00 10.5 2.600000 0.420000"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# calculate the mean of each feature for each cluster\n",
"beer.groupby('cluster').mean()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_50398/1501469021.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.\n",
" centers = beer.groupby('cluster').mean()\n"
]
}
],
"source": [
"# save the DataFrame of cluster centers\n",
"centers = beer.groupby('cluster').mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"聚类结果可视化\n",
"\n",
"> **要求** 请运行、阅读和理解以下程序,并通过添加`注释`或者`markdown cell`,以说明每段代码的功能。"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"# allow plots to appear in the notebook\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"# set the font size\n",
"plt.rcParams['font.size'] = 14\n"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"# create a \"colors\" array for plotting\n",
"import numpy as np\n",
"colors = np.array(['red', 'green', 'blue', 'yellow'])"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'alcohol')"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# scatter plot of calories versus alcohol, colored by cluster (0=red, 1=green, 2=blue)\n",
"plt.scatter(beer.calories, beer.alcohol, c=colors[beer.cluster], s=50)\n",
"\n",
"# cluster centers, marked by \"+\"\n",
"plt.scatter(centers.calories, centers.alcohol, linewidths=3, marker='+', s=300, c='black')\n",
"\n",
"# add labels\n",
"plt.xlabel('calories')\n",
"plt.ylabel('alcohol')"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.19454664171120434"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# calculate SC for K=3\n",
"from sklearn import metrics\n",
"from sklearn.preprocessing import StandardScaler\n",
"scaler = StandardScaler()\n",
"scaler.fit(X)\n",
"X_scaled=scaler.transform(X)\n",
"metrics.silhouette_score(X_scaled, km.labels_)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"# calculate SC for K=2 through K=19\n",
"k_range = range(2, 20)\n",
"scores = []\n",
"for k in k_range:\n",
" km = KMeans(n_clusters=k, random_state=1)\n",
" km.fit(X_scaled)\n",
" scores.append(metrics.silhouette_score(X_scaled, km.labels_))"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# plot the results\n",
"plt.plot(k_range, scores)\n",
"plt.xlabel('Number of clusters')\n",
"plt.ylabel('Silhouette Coefficient')\n",
"plt.grid(True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# PCA实验"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"请使用PCA算法将前文中的啤酒数据`X`降维到2维空间并绘制出降维之后的数据点并且计算降维导致的重建误差。\n",
"\n",
"参见:[PCA算法文档](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)\n",
"\n",
"> 提示着重看文档中的示例Examples"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[-11.41071341 -18.7563276 -24.2856944 -36.55439065 -19.00636267\n",
" -13.15190363 -43.12530515 -17.5095788 33.85968231 20.12053057\n",
" -7.70749236 30.40940729 -2.06904 -17.75787186 -15.55267904\n",
" 64.28684296 -6.80198502 -12.25534761 61.13554293 36.13268615]\n"
]
},
{
"data": {
"text/plain": [
"<matplotlib.collections.PathCollection at 0x7fd3c9bb7ac0>"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.decomposition import PCA\n",
"pca = PCA(n_components=2)\n",
"pca.fit(X)\n",
"X_pca = pca.transform(X)\n",
"plt.scatter(X_pca[:, 0], X_pca[:, 1], c=colors[beer.cluster], s=50)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8"
},
"vscode": {
"interpreter": {
"hash": "1f0d395e06aa83586067b19165efc9b683889967164248deef4bbf1fa27cfb00"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}