{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# K-means实验" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "用pandas读取啤酒数据集`e2.0_beer.txt`。" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namecaloriessodiumalcoholcost
0Budweiser144154.70.43
1Schlitz151194.90.43
2Lowenbrau157150.90.48
3Kronenbourg17075.20.73
4Heineken152115.00.77
5Old_Milwaukee145234.60.28
6Augsberger175245.50.40
7Srohs_Bohemian_Style149274.70.42
8Miller_Lite99104.30.43
9Budweiser_Light11383.70.40
10Coors140184.60.44
11Coors_Light102154.10.46
12Michelob_Light135114.20.50
13Becks150194.70.76
14Kirin14965.00.79
15Pabst_Extra_Light68152.30.38
16Hamms139194.40.43
17Heilemans_Old_Style144244.90.43
18Olympia_Goled_Light7262.90.46
19Schlitz_Light9774.20.47
\n", "
" ], "text/plain": [ " name calories sodium alcohol cost\n", "0 Budweiser 144 15 4.7 0.43\n", "1 Schlitz 151 19 4.9 0.43\n", "2 Lowenbrau 157 15 0.9 0.48\n", "3 Kronenbourg 170 7 5.2 0.73\n", "4 Heineken 152 11 5.0 0.77\n", "5 Old_Milwaukee 145 23 4.6 0.28\n", "6 Augsberger 175 24 5.5 0.40\n", "7 Srohs_Bohemian_Style 149 27 4.7 0.42\n", "8 Miller_Lite 99 10 4.3 0.43\n", "9 Budweiser_Light 113 8 3.7 0.40\n", "10 Coors 140 18 4.6 0.44\n", "11 Coors_Light 102 15 4.1 0.46\n", "12 Michelob_Light 135 11 4.2 0.50\n", "13 Becks 150 19 4.7 0.76\n", "14 Kirin 149 6 5.0 0.79\n", "15 Pabst_Extra_Light 68 15 2.3 0.38\n", "16 Hamms 139 19 4.4 0.43\n", "17 Heilemans_Old_Style 144 24 4.9 0.43\n", "18 Olympia_Goled_Light 72 6 2.9 0.46\n", "19 Schlitz_Light 97 7 4.2 0.47" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# beer dataset\n", "import pandas as pd\n", "url = 'e2.0_beer.txt'\n", "beer = pd.read_csv(url, sep=' ')\n", "beer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "去掉`name`项,保留`calories`、`sodium`、`alcohol`和`cost`数据,作为特征`X`。" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "# define X\n", "X = beer.drop('name', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "建立K-means聚类器,使类别数为3,并进行数据拟合。" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
KMeans(n_clusters=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "KMeans(n_clusters=3)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# K-means with 3 clusters\n", "# 注意使K-means聚类器的对象名称为 km\n", "from sklearn.cluster import KMeans\n", "km = KMeans(n_clusters=3)\n", "km.fit(X)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "将聚类结果传递给pandas数据框,并按类别排序,查看各个啤酒参与聚类的结果。" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namecaloriessodiumalcoholcostcluster
9Budweiser_Light11383.70.400
11Coors_Light102154.10.460
8Miller_Lite99104.30.430
19Schlitz_Light9774.20.470
4Heineken152115.00.771
5Old_Milwaukee145234.60.281
6Augsberger175245.50.401
7Srohs_Bohemian_Style149274.70.421
2Lowenbrau157150.90.481
10Coors140184.60.441
1Schlitz151194.90.431
12Michelob_Light135114.20.501
13Becks150194.70.761
14Kirin14965.00.791
16Hamms139194.40.431
17Heilemans_Old_Style144244.90.431
3Kronenbourg17075.20.731
0Budweiser144154.70.431
18Olympia_Goled_Light7262.90.462
15Pabst_Extra_Light68152.30.382
\n", "
" ], "text/plain": [ " name calories sodium alcohol cost cluster\n", "9 Budweiser_Light 113 8 3.7 0.40 0\n", "11 Coors_Light 102 15 4.1 0.46 0\n", "8 Miller_Lite 99 10 4.3 0.43 0\n", "19 Schlitz_Light 97 7 4.2 0.47 0\n", "4 Heineken 152 11 5.0 0.77 1\n", "5 Old_Milwaukee 145 23 4.6 0.28 1\n", "6 Augsberger 175 24 5.5 0.40 1\n", "7 Srohs_Bohemian_Style 149 27 4.7 0.42 1\n", "2 Lowenbrau 157 15 0.9 0.48 1\n", "10 Coors 140 18 4.6 0.44 1\n", "1 Schlitz 151 19 4.9 0.43 1\n", "12 Michelob_Light 135 11 4.2 0.50 1\n", "13 Becks 150 19 4.7 0.76 1\n", "14 Kirin 149 6 5.0 0.79 1\n", "16 Hamms 139 19 4.4 0.43 1\n", "17 Heilemans_Old_Style 144 24 4.9 0.43 1\n", "3 Kronenbourg 170 7 5.2 0.73 1\n", "0 Budweiser 144 15 4.7 0.43 1\n", "18 Olympia_Goled_Light 72 6 2.9 0.46 2\n", "15 Pabst_Extra_Light 68 15 2.3 0.38 2" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# save the cluster labels and sort by cluster\n", "beer['cluster'] = km.labels_\n", "beer.sort_values(by='cluster')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "查看聚类结果中各个簇的中心点坐标" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[102.75 , 10. , 4.075 , 0.44 ],\n", " [150. , 17. , 4.52142857, 0.52071429],\n", " [ 70. , 10.5 , 2.6 , 0.42 ]])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# review the cluster centers\n", "km.cluster_centers_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "用pandas查看各类别样本的坐标均值,并回答是否和先前计算相同?\n", "\n", "答:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/tmp/ipykernel_50398/58857758.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.\n", " beer.groupby('cluster').mean()\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
caloriessodiumalcoholcost
cluster
0102.7510.04.0750000.440000
1150.0017.04.5214290.520714
270.0010.52.6000000.420000
\n", "
" ], "text/plain": [ " calories sodium alcohol cost\n", "cluster \n", "0 102.75 10.0 4.075000 0.440000\n", "1 150.00 17.0 4.521429 0.520714\n", "2 70.00 10.5 2.600000 0.420000" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate the mean of each feature for each cluster\n", "beer.groupby('cluster').mean()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/tmp/ipykernel_50398/1501469021.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.\n", " centers = beer.groupby('cluster').mean()\n" ] } ], "source": [ "# save the DataFrame of cluster centers\n", "centers = beer.groupby('cluster').mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "聚类结果可视化\n", "\n", "> **要求**: 请运行、阅读和理解以下程序,并通过添加`注释`或者`markdown cell`,以说明每段代码的功能。" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "# allow plots to appear in the notebook\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "# set the font size\n", "plt.rcParams['font.size'] = 14\n" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "# create a \"colors\" array for plotting\n", "import numpy as np\n", "colors = np.array(['red', 'green', 'blue', 'yellow'])" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'alcohol')" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# scatter plot of calories versus alcohol, colored by cluster (0=red, 1=green, 2=blue)\n", "plt.scatter(beer.calories, beer.alcohol, c=colors[beer.cluster], s=50)\n", "\n", "# cluster centers, marked by \"+\"\n", "plt.scatter(centers.calories, centers.alcohol, linewidths=3, marker='+', s=300, c='black')\n", "\n", "# add labels\n", "plt.xlabel('calories')\n", "plt.ylabel('alcohol')" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.19454664171120434" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate SC for K=3\n", "from sklearn import metrics\n", "from sklearn.preprocessing import StandardScaler\n", "scaler = StandardScaler()\n", "scaler.fit(X)\n", "X_scaled=scaler.transform(X)\n", "metrics.silhouette_score(X_scaled, km.labels_)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "# calculate SC for K=2 through K=19\n", "k_range = range(2, 20)\n", "scores = []\n", "for k in k_range:\n", " km = KMeans(n_clusters=k, random_state=1)\n", " km.fit(X_scaled)\n", " scores.append(metrics.silhouette_score(X_scaled, km.labels_))" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot the results\n", "plt.plot(k_range, scores)\n", "plt.xlabel('Number of clusters')\n", "plt.ylabel('Silhouette Coefficient')\n", "plt.grid(True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# PCA实验" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "请使用PCA算法将前文中的啤酒数据`X`降维到2维空间,并绘制出降维之后的数据点,并且计算降维导致的重建误差。\n", "\n", "参见:[PCA算法文档](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)\n", "\n", "> 提示:着重看文档中的示例Examples" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[-11.41071341 -18.7563276 -24.2856944 -36.55439065 -19.00636267\n", " -13.15190363 -43.12530515 -17.5095788 33.85968231 20.12053057\n", " -7.70749236 30.40940729 -2.06904 -17.75787186 -15.55267904\n", " 64.28684296 -6.80198502 -12.25534761 61.13554293 36.13268615]\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.decomposition import PCA\n", "pca = PCA(n_components=2)\n", "pca.fit(X)\n", "X_pca = pca.transform(X)\n", "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=colors[beer.cluster], s=50)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.8 ('.venv': venv)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "vscode": { "interpreter": { "hash": "1f0d395e06aa83586067b19165efc9b683889967164248deef4bbf1fa27cfb00" } } }, "nbformat": 4, "nbformat_minor": 2 }