bzdww

Get answers and suggestions for various questions from here

Principle and implementation of knn algorithm

cms
Machine learning basic algorithm Python code implementation can refer to: zlxy9892/ml_code

1 principle

Knn is a very basic algorithm in the field of machine learning. It can solve classification or regression problems. If you are just getting started learning machine learning, knn is a very good entry choice. It has the characteristics of easy understanding and simple implementation. Start by introducing the principles of its algorithm.

First, the basic rule of the knn algorithm is that samples of the same category should be clustered together in the feature space.

As shown in the figure below, suppose that the points of red, green and blue are distributed in two-dimensional space. This corresponds to the training sample in the classification task, which contains three categories and the number of features is 2. If we now want to speculate that the point of the open circle in the graph belongs to that category, then the knn algorithm will calculate the distance between the point to be speculated and all the training samples, and pick out the k samples with the smallest distance (here When k=4) is set, the four points in the figure and the connection will be regarded as the reference basis for the speculative hollow point (point to be speculated) category. Obviously, since these four points are all in the red category, the point to be speculated is presumed to be a red category.

Looking at another situation, if the point to be speculated is in a certain position in the middle (as shown in the figure below), the same 4 sample points are also calculated, and at this time, the 4 sample points contain 3 For the case of the category (1 red, 1 blue, 2 green), the knn algorithm usually uses the voting method to classify the category, that is, find the category with the most occurrences of the category among the k sample points, so the point to be speculated The type value is presumed to be a green category.

The principle of knn is so simple and simple, then some people will say that such a simple algorithm, compared to other more sophisticated machine learning algorithms that are now popular, such as neural networks, random forests, etc., what is the value of existence?

To answer this question, we may recall that there is no free lunch theorem in machine learning. For any problem (learning task), there is no best model, and it just reminds us. For a specific learning task, we need to consider the philosophical reason that is most suitable for the problem learner, that is, the specific analysis of the specific problem.

Then, knn must have the value of its existence. For example, we now have a learning task, we need to choose a learner to solve the problem, and there is no such thing as a study of the problem. So, where do you start? Usually, we don't need to use a neural network model or a powerful integrated learning model to do it. Instead, we can use a simple model to do "exploration". For example, knn is a good choice. Where are the benefits? We know that knn is essentially a representative of lazy learning, that is to say, it does not use training data to fit a model at all, but directly completes the classification task by voting with top-k neighbor sample points. Then, if such a lazy model can get a higher precision on the current problem, we can think that the current learning task is relatively simple, and the distribution of sample points of different categories in the feature space is clear, no need Using a complex model, on the other hand, if the accuracy obtained by knn is very low, the message conveyed to us is that the learning task is a bit complicated, and often accompanied by the message that the distribution of different types of sample points in the feature space is not clear in the current problem. Usually non-linearly separable, we need to call a more powerful learner. Thus, a simple knn machine learning algorithm can help the modeler to have a rough judgment on the complexity of the problem, and assist us in how to proceed further: continue to mine feature engineering, or replace complex models.

2 algorithm implementation

Let's implement a knn classifier by yourself using python.

Code link: github.com/zlxy9892/ml_

First create a file called knn.py, which is used to build the class implemented by knn. The code is as follows.

# -*- coding: utf-8 -*-

import numpy as np
import operator

class KNN(object):

    def __init__(self, k=3):
        self.k = k

    def fit(self, x, y):
        self.x = x
        self.y = y

    def _square_distance(self, v1, v2):
        return np.sum(np.square(v1-v2))

    def _vote(self, ys):
        ys_unique = np.unique(ys)
        vote_dict = {}
        for y in ys:
            if y not in vote_dict.keys():
                vote_dict[y] = 1
            else:
                vote_dict[y] += 1
        sorted_vote_dict = sorted(vote_dict.items(), key=operator.itemgetter(1), reverse=True)
        return sorted_vote_dict[0][0]

    def predict(self, x):
        y_pred = []
        for i in range(len(x)):
            dist_arr = [self._square_distance(x[i], self.x[j]) for j in range(len(self.x))]
            sorted_index = np.argsort(dist_arr)
            top_k_index = sorted_index[:self.k]
            y_pred.append(self._vote(ys=self.y[top_k_index]))
        return np.array(y_pred)

    def score(self, y_true=None, y_pred=None):
        if y_true is None and y_pred is None:
            y_pred = self.predict(self.x)
            y_true = self.y
        score = 0.0
        for i in range(len(y_true)):
            if y_true[i] == y_pred[i]:
                score += 1
        score /= len(y_true)
        return score

Next, create a train.py file to generate random training sample data, and call the knn class to complete the classification task and calculate the speculation accuracy.

# -*- coding: utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
from knn import *


# data generation
np.random.seed(314)
data_size_1 = 300
x1_1 = np.random.normal(loc=5.0, scale=1.0, size=data_size_1)
x2_1 = np.random.normal(loc=4.0, scale=1.0, size=data_size_1)
y_1 = [0 for _ in range(data_size_1)]

data_size_2 = 400
x1_2 = np.random.normal(loc=10.0, scale=2.0, size=data_size_2)
x2_2 = np.random.normal(loc=8.0, scale=2.0, size=data_size_2)
y_2 = [1 for _ in range(data_size_2)]

x1 = np.concatenate((x1_1, x1_2), axis=0)
x2 = np.concatenate((x2_1, x2_2), axis=0)
x = np.hstack((x1.reshape(-1,1), x2.reshape(-1,1)))
y = np.concatenate((y_1, y_2), axis=0)

data_size_all = data_size_1+data_size_2
shuffled_index = np.random.permutation(data_size_all)
x = x[shuffled_index]
y = y[shuffled_index]

split_index = int(data_size_all*0.7)
x_train = x[:split_index]
y_train = y[:split_index]
x_test = x[split_index:]
y_test = y[split_index:]

# visualize data
plt.scatter(x_train[:,0], x_train[:,1], c=y_train, marker='.')
plt.show()
plt.scatter(x_test[:,0], x_test[:,1], c=y_test, marker='.')
plt.show()

# data preprocessing
x_train = (x_train - np.min(x_train, axis=0)) / (np.max(x_train, axis=0) - np.min(x_train, axis=0))
x_test = (x_test - np.min(x_test, axis=0)) / (np.max(x_test, axis=0) - np.min(x_test, axis=0))

# knn classifier
clf = KNN(k=3)
clf.fit(x_train, y_train)

print('train accuracy: {:.3}'.format(clf.score()))

y_test_pred = clf.predict(x_test)
print('test accuracy: {:.3}'.format(clf.score(y_test, y_test_pred)))

At this point, the knn algorithm principle and code implementation are introduced.