Machine Learning in Action Chapter07 Improving classification with the AdaBoost meta-algorithm

元算法(meta-algorithm):当做重要决定时,大家可能都会考虑吸取多个专家而不只是一个人的意见。

元算法是对其他算法进行组合的一种方式。

所有分类器都会遇到的一个通用问题:非均衡分类问题。

Machine Learning in Action Chapter05 Logistic regression

一般过程:

1
2
3
4
5
6
(1) 收集数据:采用任意方法收集数据。  
(2) 准备数据:由于需要进行距离计算,因此要求数据类型为数值型。另外,结构化数据 格式则最佳。
(3) 分析数据:采用任意方法对数据进行分析。
(4) 训练算法:大部分时间将用于训练,训练的目的是为了找到最佳的分类回归系数。
(5) 测试算法:一旦训练步骤完成,分类将会很快。
(6) 使用算法:首先,我们需要输入一些数据,并将其转换成对应的结构化数值; 接着,基于训练好的回归系数就可以对这些数值进行简单的回归计算,判定它们属于 哪个类别;在这之后,我们就可以在输出的类别上做一些其他分析工作。

Machine Learning in Action Chapter04 Classifying with probability theory: naive Bayes

基于贝叶斯决策理论的分类方法

朴素贝叶斯是贝叶斯决策理论的一部分。

  • 优点:在数据较小的情况下,仍然有效,可以处理多类别问题
  • 缺点:对于输入数据的准备方式较为敏感
  • 使用数据类型:标称型数据

Machine Learning in Action Chapter03 Splitting datasets one feature at a time: decision trees

决策树的构造

分类决策树模型是一种描述对实例进行分类的树形结构。决策树由结点(node)和有向边(directed edge)组成。结点有两种类型:内部结点(internal node)和叶结点(leaf node)。内部结点表示一个特征或属性(features),叶结点表示一个类(labels)。

用决策树对需要测试的实例进行分类:从根节点开始,对实例的某一特征进行测试,根据测试结果,将实例分配到其子结点;这时,每一个子结点对应着该特征的一个取值。如此递归地对实例进行测试并分配,直至达到叶结点。最后将实例分配到叶结点的类中。

Machine Learning in Action Chapter02 Classifying with k-Nearest Neighbors

KNN 概述

KNN 算法是测量不同特征之间的距离来进行分类的算法。

  • 优点:精度高,对异常值不敏感,无数据输入假定
  • 缺点:计算复杂度高,空间复杂度高
  • 使用数据范围:数值型和标称型

Machine Learning Week11 Application Example Photo OCR

Photo OCR

Problem Description and Pipeline

What is photo OCR problem?

  • Photo OCR = photo optical character recognition
    • With growth of digital photography, lots of digital pictures
    • One idea which has interested many people is getting computers to understand those photos
    • The photo OCR problem is getting computers to read text in an image
      • Possible applications for this would include
        • Make searching easier (e.g. searching for photos based on words in them)
        • Car navigation
  • OCR of documents is a comparatively easy problem
    • From photos it's really hard

Machine Learning Week10 Large Scale Machine Learning

Gradient Descent with Large Datasets

Learning With Large Datasets

Why large datasets?

  • One of best ways to get high performance is take a low bias algorithm and train it on a lot of data

    • e.g. Classification between confusable words
  • We saw that so long as you feed an algorithm lots of data they all perform pretty similarly
  • So it's good to learn with large datasets

Machine Learning Week9 Anomaly Detection

Density Estimation

Problem Motivation

  • We have a dataset which contains normal(data)
    • How we ensure they're normal is up to us
    • In reality it's OK if there are a few which aren't actually normal
  • Using that dataset as a reference point we can see if other examples are anomalous
  • First, using our training dataset we build a model

    • We can access this model using p(x)
      • This asks, "What is the probability that example x is normal"

Machine Learning Week8 Unsupervised Learning

Clustering

Unsupervised Learning Introduction

  • What is clustering good for
    • Market segmentation - group customers into different market segments
    • Social network analysis - Facebook "smartlists"
    • Organizing computer clusters and data centers for network layout and location
    • Astronomical data analysis - Understanding galaxy formation###

Machine Learning Week7 Support Vector Machines

Large Margin Classification

Optimization Objective

An alternative view of logistic regression

  • Begin with logistic regression, see how we can modify it to get the SVM

    • With hθ(x) close to 1, (θTx) must be much larger than 0
    • With hθ(x) close to 0, (θTx) must be much less than 0
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×