統計學習方法筆記（李航）———第三章（k近鄰法）

時間 2020-09-22 標籤統計學習方法筆記李航第三章 k 近鄰法

k 近鄰法 (k-NN) 是一種基於實例的學習方法，沒法轉化爲對參數空間的搜索問題（參數最優化問題）。它的特色是對特徵空間進行搜索。除了k近鄰法，本章還對如下幾個問題進行較深刻的討
論:html

切比雪夫距離 $L_{\infty}\left(x_{i}, x_{j}\right)$ 的計算
「近似偏差」與「估計偏差" 的含義
k-d樹搜索算法圖解

1、算法

輸入：訓練集 $T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{N}, y_{N}\right)\right\}, \quad x_{i} \in X \subseteq R^{n}$ 爲實例特徵向 $y_{i} \in Y=\left\{c_{1}, c_{2}, \ldots, c_{K}\right\}$ 爲實例的類別, $\quad i=1,2, \ldots, N$ python

輸出：實例 $x$ 所屬的類 $y$ web

設在給定距離度量下，涵蓋最近k個點的鄰域爲 $N_{K}(x)$
$y=\arg \max _{c_{j}} \sum_{x_{i} \in N_{K}(x)} I\left(y_{i}=c_{j}\right), \quad i=1,2, \ldots, K$ 算法

其中示性函數 $I(x)=\left\{\begin{array}{l}1, x \text { 爲真 }\\ 0, x \text { 爲假}\end{array}\right.$ svg

尋找使得函數 $\sum_{x_{i} \in N_{K}(x)} I\left(y_{i}=c_{j}\right)$ 取得最大值的變量 $c_{j},$ 也就是說, 看看距離 $x_{i}$ 最近的k個點裏面哪一類別最多，以此做爲輸出。函數

2、模型

根據模型的分類, k-NN模型屬於非機率模型。學習

觀察 $y=\arg \max _{c_{j}} \sum_{x_{i} \in N_{K}(x)} I\left(y_{i}=c_{j}\right)$ 可發現它與感知機不一樣的之處, 做爲決策函數, 它並不須要任何未知參數（感知機須要肯定W和b），直接從訓練集的數據獲得輸出。優化

1. 距離度量

k-NN的基本思想是，特徵空間中的距離反映了兩個點的類似程度, 所以「距離" 是做出分類判斷的基本依據。向量空間 $R^{n}$ 的距離有多種度量方式：ui

(1) 不一樣距離度量spa

通常形式是閔可夫斯基距離（ $L_{p}$ 範數）：

$L_{p}\left(x_{i}, x_{j}\right)=\left(\sum_{l=1}^{n}\left|x_{i}^{(l)}-x_{j}^{(l)}\right|^{p}\right)^{1 / p}$

當p=1時, 稱爲曼哈頓距離（ $L_{1}$ 範數）：

$L_{1}\left(x_{i}, x_{j}\right)=\sum_{l=1}^{n}\left|x_{i}^{(l)}-x_{j}^{(l)}\right|$

當p=2時，稱爲歐幾里得距離（ $L_{2}$ 範數），也就是最經常使用的距離：:

$L_{2}\left(x_{i}, x_{j}\right)=\left(\sum_{l=1}^{n}\left|x_{i}^{(l)}-x_{j}^{(l)}\right|^{2}\right)^{\frac{1}{2}}$

import math
from itertools import combinations

def L(x, y, p=2):
    # x1 = [1, 1], x2 = [5,1]
    if len(x) == len(y) and len(x) > 1:
        sum = 0
        for i in range(len(x)):
            sum += math.pow(abs(x[i] - y[i]), p)
        return math.pow(sum, 1 / p)
    else:
        return 0

下圖表示平面上A、B兩點之間不一樣的距離：

$L_{1}$ 只容許沿着座標軸方向前進, 就像曼哈頓街區的出租車走過的距離
$L_{2}$ 兩點之間直線最短, 經過勾股定理計算斜邊的距離
$L_{\infty}$ 只剩下一個維度, 即最大的份量方向上的距離

可見p取值越大，兩點之間的距離越短。

(2) 問題：爲何切比雪夫距離 $L_{\infty}\left(x_{i}, x_{j}\right)=\max _{l}\left|x_{i}^{(l)}-x_{j}^{(l)}\right| ?$

其實這個問題等價於：爲何 $\|x\|_{\infty}=\max \left|x^{(l)}\right|,$ 即 $R^{n}$ 空間中的向量 $x,$ 它的切比雪夫長度等於它的最大份量方向上的長度。

證實: 設 $x=\left(x^{(1)}, x^{(2)}, \ldots x^{(n)}\right)^{T} \in X \subseteq R^{n}$

不妨設 $\left|x^{(1)}\right|=\max \left\{\left|x^{(1)}\right|,\left|x^{(2)}, \ldots,\right| x^{(n)}||\right\},$ 即 $\left|x^{(l)}\right| \leq\left|x^{(1)}\right|$

注意：最大份量的長度惟一, 但最大份量可能並不惟一,$設有 $x^{(1)}, x^{(2)}, \ldots x^{(k)}$ 等K個份量的長度都等於 $\left|x^{(1)}\right|$

當 $\left|x^{(l)}\right|=\left|x^{(1)}\right|, \quad \lim _{p \rightarrow \infty}\left(\frac{\left|x^{(l)}\right|}{\left|x^{(1)}\right|}\right)^{p}=1,$ 即 $x^{(l)}$ 爲 $x^{(1)}, x^{(2)}, \ldots x^{(k)}$ 時

當 $\left|x^{(l)}\right|<\left|x^{(1)}\right|, \quad \lim _{p \rightarrow \infty}\left(\frac{\left|x^{(l)}\right|}{\left|x^{(1)}\right|}\right)^{p}=0,$ 即 $x^{(l)}$ 爲非最大長度份量時

計算 $x$ 的切比雪夫長度:

$\|x\|_{\infty}=\lim _{p \rightarrow \infty}\left(\sum_{l=1}^{n}\left|x^{(l)}\right|^{p}\right)^{\frac{1}{p}}=\lim _{p \rightarrow \infty}\left(\left|x^{(1)}\right|^{p} \sum_{l=1}^{n} \frac{\left|x^{(l)}\right|^{p}}{\left|x^{(1)}\right|^{p}}\right)^{\frac{1}{p}}$

因爲已知 $\lim _{p \rightarrow \infty}\left(\frac{\left|x^{(l)}\right|}{\left|x^{(1)}\right|}\right)^{p}$ 等於0或1，且有k個份量結果爲1, 因此

$\lim _{p \rightarrow \infty} \sum_{l=1}^{n}\left(\frac{\left|x^{(l)}\right|}{\left|x^{(1)}\right|}\right)^{p}=(1+1+\ldots+1+0+0+\ldots .0)=k$

所以
$\|x\|_{\infty}=\lim _{p \rightarrow \infty}\left(\left|x^{(1)}\right|^{p} \sum_{l=1}^{n} \frac{\left|x^{(l)}\right|^{p}}{\left|x^{(1)}\right|^{p}}\right)^{\frac{1}{p}}=\lim _{p \rightarrow \infty}\left(\left|x^{(1)}\right|^{p} k\right)^{\frac{1}{p}}=\lim _{p \rightarrow \infty}\left|x^{(1)}\right| k^{\frac{1}{p}}=\left|x^{(1)}\right|$
即 $\|x\|_{\infty}=\max \left|x^{(l)}\right|,$ 得證。

以上證實參考

(3) 平面上的圓

$L_{p}=1$ 在平面上的圖像:

若是圓形的定義是「到某一點距離相等的曲線圍成的圖形" ，那麼在不一樣的距離度量下，圓形的形狀能夠徹底不一樣。爲什麼 $L_{1}$ 正則化在平面上的等高線爲同心正方形, 不就很容易理解嗎?

k值選擇

李航老師書中引入了「近似偏差」和「估計偏差」兩個概念，但沒有給出具體定義。

這裏簡單總結一下：
$I\left[\hat{f}_{H, l}\right]-I\left[f_{0}\right]=\left(I\left[\hat{f}_{H, l}\right]-I\left[f_{H}\right]\right)+\left(I\left[f_{H}\right]-I\left[f_{0}\right]\right)$
右側兩項分別是「估計偏差」和「近似偏差」

估計偏差：訓練集與無限數據集獲得的模型的差距
近似偏差：限制假設空間與無限制假設空間獲得的模型的差距
- k值越小 $\Leftrightarrow$ 單個樣本影響越大 $\Leftrightarrow$ 模型越複雜 $\Leftrightarrow$ 假設空間越大 $\Rightarrow$ 近似偏差越小 (估計偏差越大），容易過擬合;
- k值越大 $\Leftrightarrow$ 單個樣本影響越小 $\Leftrightarrow$ 模型越簡單 $\Leftrightarrow$ 假設空間越小 $\Rightarrow$ 近似偏差越大（估計偏差越小），容易欠擬合。

通常經過交叉驗證來肯定最優的 k 值。

決策規則

k 近鄰的決策規則就是「多數表決" ，即少數服從多數, 用數學表達式表示爲

$\frac{1}{k} \sum_{x_{i} \in N_{k}(x)} I\left(y_{i} \neq c_{j}\right)=1-\frac{1}{k} \sum_{x_{i} \in N_{k}(x)} I\left(y_{i}=c_{j}\right)$

等號左側爲誤分類率，要是誤分類率最小，就要使得 $\sum_{x_{i} \in N_{k}(x)} I\left(y_{i}=c_{j}\right)$ 最大, 即選擇集合 $N_{k}(x)$ 中最多的一類。

3、kd樹

參見本篇文章：（必定看）
kd樹介紹（KNN算法引出）

4、習題

3.1

參照圖3.1，在二維空間中給出實例點，畫出kk爲1和2時的kk近鄰法構成的空間劃分，並對其進行比較，體會kk值選擇與模型複雜度及預測準確率的關係。

%matplotlib inline
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap 

data = np.array([[5, 12, 1],
                [6, 21, 0],
                [14, 5, 0],
                [16, 10, 0],
                [13, 19, 0],
                [13, 32, 1],
                [17, 27, 1],
                [18, 24, 1],
                [20, 20, 0],
                [23, 14, 1],
                [23, 25, 1],
                [23, 31, 1],
                [26, 8, 0],
                [30, 17, 1],
                [30, 26, 1],
                [34, 8, 0],
                [34, 19, 1],
                [37, 28, 1]])
X_train = data[:, 0:2]
y_train = data[:, 2]

models = (KNeighborsClassifier(n_neighbors=1, n_jobs=-1),
          KNeighborsClassifier(n_neighbors=2, n_jobs=-1))
models = (clf.fit(X_train, y_train) for clf in models)

titles = ('K Neighbors with k=1',
          'K Neighbors with k=2')

fig = plt.figure(figsize=(15, 5))
plt.subplots_adjust(wspace=0.4, hspace=0.4)

X0, X1 = X_train[:, 0], X_train[:, 1]

x_min, x_max = X0.min() - 1, X0.max() + 1
y_min, y_max = X1.min() - 1, X1.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.2),
                         np.arange(y_min, y_max, 0.2))

for clf, title, ax in zip(models, titles, fig.subplots(1, 2).flatten()):    
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape) 
    colors = ('red', 'green', 'lightgreen', 'gray', 'cyan')  
    cmap = ListedColormap(colors[:len(np.unique(Z))])  
    ax.contourf(xx, yy, Z, cmap=cmap, alpha=0.5)

    ax.scatter(X0, X1, c=y_train, s=50, edgecolors='k', cmap=cmap, alpha=0.5)
    ax.set_title(title)

plt.show()

3.2
利用例題3.2構造的kd樹求點x $=(3,4.5)^{\mathrm{T}}$ 的最近鄰點。

import numpy as np
from sklearn.neighbors import KDTree

train_data = np.array([(2, 3), (5, 4), (9, 6), (4, 7), (8, 1), (7, 2)])
tree = KDTree(train_data, leaf_size=2)
dist, ind = tree.query(np.array([(3, 4.5)]), k=1)
x1 = train_data[ind[0]][0][0]
x2 = train_data[ind[0]][0][1]

print("x點的最近鄰點是({0}, {1})".format(x1, x2))
#x點的最近鄰點是(2, 3)

3.3
解答：
算法：用kd樹的kk近鄰搜索
輸入：已構造的kd樹；目標點xx；
輸出：xx的最近鄰

在kdkd樹中找出包含目標點xx的葉結點：從根結點出發，遞歸地向下訪問樹。若目標點xx當前維的座標小於切分點的座標，則移動到左子結點，不然移動到右子結點，直到子結點爲葉結點爲止；
若是「當前kk近鄰點集」元素數量小於kk或者葉節點距離小於「當前kk近鄰點集」中最遠點距離，那麼將葉節點插入「當前k近鄰點集」；
遞歸地向上回退，在每一個結點進行如下操做：
(a)若是「當前kk近鄰點集」元素數量小於kk或者當前節點距離小於「當前kk近鄰點集」中最遠點距離，那麼將該節點插入「當前kk近鄰點集」。
(b)檢查另外一子結點對應的區域是否與以目標點爲球心、以目標點與於「當前kk近鄰點集」中最遠點間的距離爲半徑的超球體相交。若是相交，可能在另外一個子結點對應的區域內存在距目標點更近的點，移動到另外一個子結點，接着，遞歸地進行最近鄰搜索；若是不相交，向上回退；
當回退到根結點時，搜索結束，最後的「當前kk近鄰點集」即爲xx的最近鄰點。

# 構建kd樹，搜索待預測點所屬區域
from collections import namedtuple
import numpy as np


# 創建節點類
class Node(namedtuple("Node", "location left_child right_child")):
    def __repr__(self):
        return str(tuple(self))


# kd tree類
class KdTree():
    def __init__(self, k=1):
        self.k = k
        self.kdtree = None

    # 構建kd tree
    def _fit(self, X, depth=0):
        try:
            k = self.k
        except IndexError as e:
            return None
        # 這裏能夠展開，經過方差選擇axis
        axis = depth % k
        X = X[X[:, axis].argsort()]
        median = X.shape[0] // 2
        try:
            X[median]
        except IndexError:
            return None
        return Node(
            location=X[median],
            left_child=self._fit(X[:median], depth + 1),
            right_child=self._fit(X[median + 1:], depth + 1)
        )

    def _search(self, point, tree=None, depth=0, best=None):
        if tree is None:
            return best
        k = self.k
        # 更新 branch
        if point[0][depth % k] < tree.location[depth % k]:
            next_branch = tree.left_child
        else:
            next_branch = tree.right_child
        if not next_branch is None:
            best = next_branch.location
        return self._search(point, tree=next_branch, depth=depth + 1, best=best)

    def fit(self, X):
        self.kdtree = self._fit(X)
        return self.kdtree

    def predict(self, X):
        res = self._search(X, self.kdtree)
        return res

KNN = KdTree()
X_train = np.array([[2, 3],
                    [5, 4],
                    [9, 6],
                    [4, 7],
                    [8, 1],
                    [7, 2]])
KNN.fit(X_train)
X_new = np.array([[3, 4.5]])
res = KNN.predict(X_new)

x1 = res[0]
x2 = res[1]

print("x點的最近鄰點是({0}, {1})".format(x1, x2))
#x點的最近鄰點是(2, 3)