Natural Language Processing Lecture03

Natural Language Processing Lecture03

The First Book

由经验生成的数据。

  • 优点:全面
  • 缺点:抽象性弱,数据多且冗余,不易修改维护

KNN (K-nearest neighbors, K-近邻算法)

如果一个样本在特征空间中的 k 个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别。

The Second Book

决策树模型 (Decision Tree)。将问题的抽象层次变高。减少了 The First Book 的数据。

关键是找到合适的分割条件。将最显著的特征放在最前面,权重最大。

信息熵

\[Entropy = -\sum_i^n Pr(x_i)log(Pr(x_i))\]

1
2
3
4
5
6
7
8
9
10
11
from collections import Counter

def entropy(elements):
"""群体的混乱程度"""
counter = Counter(elements)
probs = [counter[c] / len(elements) for c in set(elements)]
return - sum(p * np.log(p) for p in probs)

entropy([1, 1, 1, 1]) # 0.0
entropy([1, 1, 1, 0]) # 0.5623351446188083
entropy([1, 3, 5, 7]) # 1.3862943611198906

决策树如何决定用于分割的特征?

1
2
3
4
5
6
7
8
9
10
import pandas as pd

mock_data = {
'gender':['F', 'F', 'F', 'F', 'M', 'M', 'M'],
'income': ['+10', '-10', '+10', '+10', '+10', '+10', '-10'],
'family_number': [1, 1, 2, 1, 1, 1, 2],
'bought': [1, 1, 1, 0, 0, 0, 1],
}

dataset = pd.DataFrame.from_dict(mock_data)

1
2
3
4
5
6
7
8
9
10
11
# split by gender:
entropy([1,1,1,0]) + entropy([0,0,1])

# split by income:
entropy([1,1,0,0,0]) + entropy([1,1])

# split by family_number:
entropy([1,1,0,0,0]) + entropy([1,1])

# split by some feature:
entropy([1,1,1,1]) + entropy([0,0,0]) # nice split

决策树在选择决策过程,决策顺序时,其实是按照根据这个特征分割后,数据的熵最小原则进行的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def find_the_optimal_spilter(training_data: pd.DataFrame, target: str) -> str:
x_fields = set(training_data.columns.tolist()) - {target}
ic(x_fields)

spliter = None
min_entropy = float('inf')

for f in x_fields:
ic(f)
values = set(training_data[f])
ic(values)
for v in values:
ic(v)
sub_spliter_1 = training_data[training_data[f] == v][target].tolist()
ic(sub_spliter_1)
# split by the current feature and one value

entropy_1 = entropy(sub_spliter_1)
ic(entropy_1)

sub_spliter_2 = training_data[training_data[f] != v][target].tolist()
ic(sub_spliter_2)

entropy_2 = entropy(sub_spliter_2)
ic(entropy_2)

entropy_v = entropy_1 + entropy_2
ic(entropy_v)

if entropy_v <= min_entropy:
min_entropy = entropy_v
spliter = (f, v)

print('spliter is: {}'.format(spliter))
print('the min entropy is: {}'.format(min_entropy))

return spliter

使用scikit-learning进行学习。

The Third Book

Naive Bayesian Classification (朴素贝叶斯)

  • 贝叶斯公式
  • 概率传递链式法则
  • 将变量之间的关系强行独立化

The Forth Book

Neural Network (神经网络)

  • 食堂打饭,不问价钱,问,每种菜多少钱?
  • 第一次 茄子*2+西瓜*1+馒头*1:7.8 元
  • 第二次 茄子*1+西瓜*1+馒头*2:6.5 元
  • 第三次 玉米*1+菠菜*2+馒头*1:5.6 元
  • ……

\[y=\sum W_i\times X_i\]

K-Means

算法描述:

1
2
3
4
5
选择K个点作为初始质心  
repeat
将每个点指派到最近的质心,形成K个簇
重新计算每个簇的质心
until 簇不发生变化或达到最大迭代次数

Demo:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from sklearn.cluster import KMeans
from collections import defaultdict

X1 = [random.randint(0, 100) for _ in range(100)]
X2 = [random.randint(0, 100) for _ in range(100)]

tranning_data = [[x1, x2] for x1, x2 in zip(X1, X2)]

cluster = KMeans(n_clusters=6, max_iter=500)

cluster.fit(tranning_data)

cluster.cluster_centers_ # 输出 6 个质心
cluster.labels_ # 点的聚类集合
"""
array([3, 0, 5, 3, 5, 2, 3, 1, 3, 3, 2, 4, 0, 4, 5, 0, 3, 4, 2, 3, 3, 2, 0, 1, 4, 3, 3, 0, 5, 4, 4, 3, 5, 3, 3, 0, 0, 1, 3, 3, 0, 0, 0, 2, 5, 3, 2, 4, 2, 1, 0, 1, 4, 1, 3, 1, 2, 0, 5, 2, 5, 1, 4, 2, 0, 5,
3, 1, 0, 3, 1, 5, 3, 1, 0, 2, 5, 2, 4, 2, 1, 0, 3, 2, 1, 0, 2, 3, 5, 3, 2, 1, 2, 4, 4, 3, 2, 0, 3, 0])
"""
centers = defaultdict(list)
for label, location in zip(cluster.labels_, tranning_data):
centers[label].append(location)

color = ['red', 'green', 'grey', 'black', 'yellow', 'orange']

for i, c in enumerate(centers):
for location in centers[c]:
plt.scatter(*location, c=color[i])

for center in cluster.cluster_centers_:
plt.scatter(*center, s=100)

计算复杂度:\(O(I\times N\times K \times d)\)太高

  • O:迭代次数
  • N:点的数目
  • K:聚类数目
  • d:距离

How to evaluate a model

  • Accuracy
  • Precision
  • Recall
  • ROC / AUC
  • F1_score, F2_Score
  • Loss Function
  • MSE

\[Precision=\frac{t_p}{t_p+f_p}\]

\[Recall=\frac{t_p}{t_p+f_n}\]

注:见贝叶斯条件概率

\[F1=2\times (precision\times recall) / (precision+recall\]

\[MSE=\frac 1n\sum_{i=1}^n(Y_i-\hat{Y_i})^2\]

Overfitting vs Underfitting

  • Overfitting:对已知数据准确率高,对未知数据差
    • 模型太过复杂
    • 数据太少
    • 数据分布不准确
    • 系数太大
  • Underfitting:整体预测准确率低
    • 模型太简单

Eager Learning vs Lazy Learning

  • Lazy Learning
    • 算法对整个训练数据集并没有训练得到一个整体的模型,这样,对于每一个新的测试数据点,都需要根据该点和训练数据集来对目标函数进行预测
    • “被动的”等待新的测试数据到来,才开始对其进行预测
    • 例:KNN、LWR
  • Eager Learning
    • 先算好模型再进行预测,对于新的测试数据,只需要往模型中代入就可以得到结果
    • 例:线性回归、逻辑回归、SVM

Outliners

数据集中的不正常数据,异于常值,噪声数据?

numpy中,用 percentile可以进行数据清洗

Bias vs Variance

  • Bias

    • an error from erroneous assumptions in the learning algorithm.
    • high bias can cause an algorithm to miss the relevant relations between feature and target outputs (underfitting)
  • Variance

    • an error from sensitivity to small fluctuations in the training set.

    • high variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting)

Train, validation, test

  • Train: 训练数据,用于建立模型的数据
  • Validation: 验证数据,用于比较多个模型的好坏
  • Test: 测试数据,与前两者的区别在于,train 和 validation 来自同一对象的数据,但是 test 用跨对象的数据来验证模型的稳定性
# NLP

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×