K EN YAKIN KOMŞUNUN UYGULANMASI

Önkoşul: K en yakın komşular

giriiş

Diyelim ki bize her biri sayısal olarak değerli özelliklere (Boy Kilo Yaş vb.) sahip öğelerden oluşan bir veri seti verildi. Özellik sayısı ise N öğeleri bir nokta olarak temsil edebiliriz N boyutlu ızgara. Yeni bir öğe verildiğinde, öğenin setteki diğer öğelere olan mesafesini hesaplayabiliriz. Biz seçiyoruz k en yakın komşular ve bu komşuların çoğunun nerede sınıflandırıldığını görüyoruz. Yeni öğeyi orada sınıflandırıyoruz.
Böylece sorun ortaya çıkıyor öğeler arasındaki mesafeleri nasıl hesaplayabiliriz? Bunun çözümü veri setine bağlıdır. Değerler gerçekse genellikle Öklid mesafesini kullanırız. Değerler kategorik veya ikili ise genellikle Hamming mesafesini kullanırız.
Algoritma:

Given a new item: 1. Find distances between new item and all other items 2. Pick k shorter distances 3. Pick the most common class in these k distances 4. That class is where we will classify the new item

Veri Okuma

Java'da for döngüsü

Giriş dosyamız aşağıdaki formatta olsun:

Height Weight Age Class 1.70 65 20 Programmer 1.90 85 33 Builder 1.78 76 31 Builder 1.73 74 24 Programmer 1.81 75 35 Builder 1.73 70 75 Scientist 1.80 71 63 Scientist 1.75 69 25 Programmer

Her öğe bir satırdır ve 'Sınıf' altında öğenin nerede sınıflandırıldığını görüyoruz. Özellik adlarının altındaki değerler ('Yükseklik' vb.), öğenin o özelliğe ilişkin sahip olduğu değerdir. Tüm değerler ve özellikler virgülle ayrılmıştır.
Bu veri dosyalarını çalışma dizinine yerleştirin veri2 Ve veri . Birini seçin ve içeriğini olduğu gibi adlı bir metin dosyasına yapıştırın. veri .
Dosyayı ('data.txt' adlı) okuyacağız ve girdiyi satırlara böleceğiz:

c# öğretici

f = open('data.txt' 'r'); lines = f.read().splitlines(); f.close();

Dosyanın ilk satırı, sonunda 'Sınıf' anahtar kelimesi bulunan özellik adlarını içerir. Özellik adlarını bir listede saklamak istiyoruz:

# Split the first line by commas # remove the first element and # save the rest into a list. The # list now holds the feature # names of the data set. features = lines[0].split(' ')[:-1];

Daha sonra veri setinin kendisine geçiyoruz. Öğeleri adlı bir listeye kaydedeceğiz. öğeler öğeleri sözlük olan (her öğe için bir tane). Bu öğe sözlüklerinin anahtarları, özellik adları ve öğe sınıfını tutacak 'Sınıf'tır. Sonunda listedeki öğeleri karıştırmak istiyoruz (bu, öğelerin tuhaf bir sırada olması durumunda bir güvenlik önlemidir).

Python3

items = []; for i in range(1 len(lines)): line = lines[i].split(' '); itemFeatures = {'Class' : line[-1]}; # Iterate through the features for j in range(len(features)): # Get the feature at index j f = features[j]; # The first item in the line # is the class skip it v = float(line[j]); # Add feature to dict itemFeatures[f] = v; # Append temp dict to items items.append(itemFeatures); shuffle(items);

Verilerin sınıflandırılması

Saklanan verilerle öğeler şimdi sınıflandırıcımızı oluşturmaya başlıyoruz. Sınıflandırıcı için yeni bir fonksiyon yaratacağız Sınıflandırmak . Öğe listesini sınıflandırmak istediğimiz öğeyi girdi olarak alacaktır ve k en yakın komşuların sayısı.
Eğer k veri kümesinin uzunluğundan büyükse, veri kümesindeki toplam öğe sayısından daha fazla en yakın komşuya sahip olamayacağımız için sınıflandırmaya devam etmiyoruz. (alternatif olarak k'yi şu şekilde ayarlayabiliriz: öğeler hata mesajı döndürmek yerine uzunluk)

if(k > len(Items)): # k is larger than list # length abort return 'k larger than list length';

Sınıflandırılacak öğe ile eğitim setindeki tüm öğeler arasındaki mesafeyi sonunda hesaplamak istiyoruz. k en kısa mesafeler. Mevcut en yakın komşuları tutmak için adı verilen bir liste kullanırız. komşular . Her bir öğe, biri sınıflandırılacak öğeye olan uzaklık, diğeri ise komşunun bulunduğu sınıf için en az iki değer tutar. Uzaklığı genelleştirilmiş Öklid formülüyle hesaplayacağız (örneğin, N boyutlar). Daha sonra çoğu zaman görünen sınıfı seçeceğiz. komşular ve bu bizim seçimimiz olacak. Kodda:

Java'da karşılaştırılabilir arayüz

Python3

def Classify(nItem k Items): if(k > len(Items)): # k is larger than list # length abort return 'k larger than list length'; # Hold nearest neighbors. # First item is distance  # second class neighbors = []; for item in Items: # Find Euclidean Distance distance = EuclideanDistance(nItem item); # Update neighbors either adding # the current item in neighbors  # or not. neighbors = UpdateNeighbors(neighbors item distance k); # Count the number of each # class in neighbors count = CalculateNeighborsClass(neighbors k); # Find the max in count aka the # class with the most appearances. return FindMax(count);

Uygulamamız gereken harici işlevler şunlardır: Öklid Mesafesi GüncellemeKomşular HesaplaKomşu Sınıfı Ve FindMax .

Öklid Mesafesini Bulma

İki x ve y vektörü için genelleştirilmiş Öklid formülü şudur:

distance = sqrt{(x_{1}-y_{1})^2 + (x_{2}-y_{2})^2 + ... + (x_{n}-y_{n})^2}

Kodda:

Python3

def EuclideanDistance(x y): # The sum of the squared  # differences of the elements S = 0; for key in x.keys(): S += math.pow(x[key]-y[key] 2); # The square root of the sum return math.sqrt(S);

Komşular Güncelleniyor

Komşu listemiz var (en fazla uzunlukta olmalı) k ) ve listeye belirli bir mesafeye sahip bir öğe eklemek istiyoruz. İlk önce olup olmadığını kontrol edeceğiz komşular uzunluğuna sahip olmak k . Daha azsa, mesafeye bakılmaksızın öğeyi ona ekleriz (listeyi sonuna kadar doldurmamız gerektiğinden) k öğeleri reddetmeye başlamadan önce). Değilse, öğenin listedeki maksimum mesafeye sahip öğeden daha kısa bir mesafeye sahip olup olmadığını kontrol edeceğiz. Eğer bu doğruysa, maksimum mesafeye sahip öğeyi yeni öğeyle değiştireceğiz.
Maksimum mesafe öğesini daha hızlı bulmak için listeyi artan sırada sıralayacağız. Yani listedeki son öğe maksimum mesafeye sahip olacak. Yeni bir ürünle değiştireceğiz ve tekrar sıralayacağız.
Bu süreci hızlandırmak için, listenin tamamını sıralamak zorunda kalmadan listeye yeni öğeler eklediğimiz Ekleme Sıralaması uygulayabiliriz. Bunun kodu oldukça uzundur ve basit olmasına rağmen öğreticiyi çıkmaza sokacaktır.

Python3

def UpdateNeighbors(neighbors item distance k): if(len(neighbors) > distance): # If yes replace the last # element with new item neighbors[-1] = [distance item['Class']]; neighbors = sorted(neighbors); return neighbors;

HesaplaKomşu Sınıfı

Burada en sık görünen sınıfı hesaplayacağız. komşular . Bunun için adı verilen başka bir sözlüğü kullanacağız. saymak anahtarlar burada görünen sınıf adlarıdır komşular . Anahtar yoksa ekleyeceğiz, aksi halde değerini artıracağız.

alt dize java

Python3

def CalculateNeighborsClass(neighbors k): count = {}; for i in range(k): if(neighbors[i][1] not in count): # The class at the ith index # is not in the count dict. # Initialize it to 1. count[neighbors[i][1]] = 1; else: # Found another item of class  # c[i]. Increment its counter. count[neighbors[i][1]] += 1; return count;

FindMax

Bu fonksiyona sözlüğü gireceğiz saymak biz inşa ediyoruz HesaplaKomşu Sınıfı ve maksimum değerini döndüreceğiz.

Python3

def FindMax(countList): # Hold the max maximum = -1; # Hold the classification classification = ''; for key in countList.keys(): if(countList[key] > maximum): maximum = countList[key]; classification = key; return classification maximum;

Çözüm

Böylece bu kNN eğitimi tamamlandı.
Artık yeni öğeleri ayarlayarak sınıflandırabilirsiniz k uygun gördüğünüz gibi. Genellikle k için tek bir sayı kullanılır ancak bu gerekli değildir. Yeni bir öğeyi sınıflandırmak için, öğeyi karakterize eden özellik adlarını ve değerleri içeren anahtarlar içeren bir sözlük oluşturmanız gerekir. Bir sınıflandırma örneği:

'Aslanla kaplan arasındaki fark nedir'

newItem = {'Height' : 1.74 'Weight' : 67 'Age' : 22}; print Classify(newItem 3 items);

Yukarıdaki yaklaşımın tam kodu aşağıda verilmiştir: -

Python3

# Python Program to illustrate # KNN algorithm # For pow and sqrt import math from random import shuffle ###_Reading_### def ReadData(fileName): # Read the file splitting by lines f = open(fileName 'r') lines = f.read().splitlines() f.close() # Split the first line by commas  # remove the first element and save # the rest into a list. The list  # holds the feature names of the  # data set. features = lines[0].split(' ')[:-1] items = [] for i in range(1 len(lines)): line = lines[i].split(' ') itemFeatures = {'Class': line[-1]} for j in range(len(features)): # Get the feature at index j f = features[j] # Convert feature value to float v = float(line[j]) # Add feature value to dict itemFeatures[f] = v items.append(itemFeatures) shuffle(items) return items ###_Auxiliary Function_### def EuclideanDistance(x y): # The sum of the squared differences # of the elements S = 0 for key in x.keys(): S += math.pow(x[key] - y[key] 2) # The square root of the sum return math.sqrt(S) def CalculateNeighborsClass(neighbors k): count = {} for i in range(k): if neighbors[i][1] not in count: # The class at the ith index is # not in the count dict.  # Initialize it to 1. count[neighbors[i][1]] = 1 else: # Found another item of class  # c[i]. Increment its counter. count[neighbors[i][1]] += 1 return count def FindMax(Dict): # Find max in dictionary return  # max value and max index maximum = -1 classification = '' for key in Dict.keys(): if Dict[key] > maximum: maximum = Dict[key] classification = key return (classification maximum) ###_Core Functions_### def Classify(nItem k Items): # Hold nearest neighbours. First item # is distance second class neighbors = [] for item in Items: # Find Euclidean Distance distance = EuclideanDistance(nItem item) # Update neighbors either adding the # current item in neighbors or not. neighbors = UpdateNeighbors(neighbors item distance k) # Count the number of each class  # in neighbors count = CalculateNeighborsClass(neighbors k) # Find the max in count aka the # class with the most appearances return FindMax(count) def UpdateNeighbors(neighbors item distance k ): if len(neighbors) < k: # List is not full add  # new item and sort neighbors.append([distance item['Class']]) neighbors = sorted(neighbors) else: # List is full Check if new  # item should be entered if neighbors[-1][0] > distance: # If yes replace the  # last element with new item neighbors[-1] = [distance item['Class']] neighbors = sorted(neighbors) return neighbors ###_Evaluation Functions_### def K_FoldValidation(K k Items): if K > len(Items): return -1 # The number of correct classifications correct = 0 # The total number of classifications total = len(Items) * (K - 1) # The length of a fold l = int(len(Items) / K) for i in range(K): # Split data into training set # and test set trainingSet = Items[i * l:(i + 1) * l] testSet = Items[:i * l] + Items[(i + 1) * l:] for item in testSet: itemClass = item['Class'] itemFeatures = {} # Get feature values for key in item: if key != 'Class': # If key isn't 'Class' add  # it to itemFeatures itemFeatures[key] = item[key] # Categorize item based on # its feature values guess = Classify(itemFeatures k trainingSet)[0] if guess == itemClass: # Guessed correctly correct += 1 accuracy = correct / float(total) return accuracy def Evaluate(K k items iterations): # Run algorithm the number of # iterations pick average accuracy = 0 for i in range(iterations): shuffle(items) accuracy += K_FoldValidation(K k items) print accuracy / float(iterations) ###_Main_### def main(): items = ReadData('data.txt') Evaluate(5 5 items 100) if __name__ == '__main__': main()

Çıkış:

0.9375

Çıktı makineden makineye değişebilir. Kod bir Katlama Doğrulama işlevi içerir ancak algoritmayla ilgisi yoktur, algoritmanın doğruluğunu hesaplamak için oradadır.

Test Oluştur

TechCodeview

K En Yakın Komşunun Uygulanması