机器学习(machine learning)

机器学习就是自动找函式

speech recognition

image recognition

playing go

dialogue system

要找什么样的函式？

regression	输入以前数据，预测未来数据	pm2.5预测
binary classification	输入内容，输出yes/no	RNN：输入一个句子，输出pos/neg
multi-class classification	输入内容，选择正确选项	CNN：输入图片，输出类型
generation	产生有结构的复杂东西（文句、图片）	产生文句、产生二次元图像

如何告诉机器？

训练资料

supervised learning（label）（正确落子位置）
函式的loss（好坏）
reinforcement learning（AlphaGo）（自己学习）
unsupervised learning（无label)

机器如何找出所要的函式？

给定函式寻找范围 network architecture ：RNN/CNN
函式寻找方法 gradient descent ：PyTorch…

前沿研究

explainable AI、adversarial attack、network compression…（

一、Regression

Stock Market Forecast、Self-driving Car、Recommendation

Linear Regression

1、Case Learning

machine learning ：model(function set)->定义function->最好的function

Step 1: Model

Step 2: Goodness of Function

Step 3: Best Function

使用Gradient Descent来求出最小的L(f)，即最好的function

Consider loss function L(w) with one parameter w:
- 随机选取一个$w^0$
- 计算$\frac{dl}{dw}|_w=w^0$ ：若为正，向左移；若为负，向右移。找最小的（learning rate : $\eta$）
- 计算$w^1=w^0-\eta\frac{dl}{dw}|_w=w^0$
- 重复上述步骤$\frac{dl}{dw}|_w=w^1$ ……
How about two parameter?
- 同理，不过微分变成偏微分

¹. 上标为一个完整对象object ↩

². 下标为一个对象的component ↩

由于计算出来的误差偏大，这里引入多次方式

#Selecting another Model

$y=b+w1*x{cp}+w2*(x{cp})^2$
$y=b+w1*x{cp}+w2*(x{cp})^2+w3*(x{cp})^3$

……

但是在testing data上又不一样,所以model不是越复杂越好

#Overfitting

实际上会有很多数据和影响因素时:

Back to step1:

Back to step 2:

Regularization：redfine loss function.

L=$\sum_n(\hat{y}^n-(b+\sum{w_ix_i}))^2+\lambda\sum{(w_i)^2}$
The functions with smaller $w_i$ is better
Prefer soomth function, but not too smooth.

-> select $\lambda$ obtaining the best model

Conclusion

There are probably other hidden factors
Gradient descent
Overfitting and Regularization
Validation

2、Basic Concept

Bias and Variance of Estimator

期望是没有bias，并且variance又小

Variance

Simpler model is less influenced by the sampled data

Bias

Bias v.s. Variance

Diagnosis

underfitting:If your model cannot even fit the training examples, then you have large bais
- redesign your model:
  - add more featurers as input
  - a more complex model
overfitting:If you can fit the training data, but large error on testing data, then you probably have large variance
- more data: very effective, but not always practical
- regularization: 加一个weight*term, 但可能影响bias

#What you should not do

#What you should do

Cross Validation
N-fold Cross Validation

3、Gradient Descent

由之前的step3，得到Gradient Descent的大体步骤如下:

Visualize

Tips

1、Tuning your learning rates

Set the learning rate $\eta$ carefully

太大会错过min
太小会走得过慢

解决方案：

Popular & Simple Idea: Reduce the learning rate by some facotor every few epochs.
- At the beining, we are far from the destination, so we use larger learning rate.
- After several epochs, we are close to the destination, so we reduce the learning rate.
- E.g. $\frac{1}{t}$ decay: $\eta^t=\frac{\eta}{\sqrt{t+1}}$
Learning rate cannot be one-size-fits-all
- Giving different parameters different learning rates

Adagrad

Divide the learning rate of each parameter by the root mean square of its previous derivatives

总结

Intuitive Reason

反差效果

Comparison between different parameters…

The best step is $\frac{First derivative}{Second derivative}$

2、Stochastic Gradient Descent

Making the training faster

一个对比

3、Feature Scaling

Make different features have the same scaling

How to do it?

For each dimension i :
- mean: $m_i$
- standard deviation: $\sigma_i$
- $x_i^r=\frac{x_i^r-m_i}{\sigma_i}$
The means of all dimensions are 0, and the variances are all 1

Theory

Taylor Series

Multivariable Taylor Series

More Limitation of Gradient Descent

二、Classification

Credit Scoring

Medical Diagnosis

Handwritten character recognition

Face recognition

……

1、Case Learning

描述：vector->function=种类

Total

HP

Attack

Defence

SP Atk

SP Def

Speed

How to do Classification

Training data for Classification

用regression做：

Ideal Alternatives

Generative Model : $P(x)=P(x|C_1)P(C_1)+P(x|C_2)P(C_2)$

Prior

Gaussian Distribution

Maximum Likelihood

Results

结果不好 : (
The boundary is linear-> lnear model

2、几率模型

Three Steps

Function Set(Model):

Goodness of a function
- The mean $\mu$ and covariance $\sum$ that maximizing the likelihood(the probaility of generating data)
Find the best function:easy

Probability Distribution

whatever you like

Posterior Probability

三、Logistic Regression

Step1

见上分析

Step2

Goodness of a Function

~~省略一堆数学推导~~

得出一个Logistic Regression & Linear Regression的对比图

Set3

Find the best function

~~省略一堆数学推导~~

—>Larger difference, larger update.

另一种解答方式

Logistic Regression + Square Error

Cross Entropy v.s. Square Errror

Discriminative v.s. Generative
- Discriminative: Logistic Regression的方法
- Generative: 用Gaussian来描述Historical ability
他们的model & function set一样

The same model (function set), but different function is selected by the same training data.

有人会说Discriminative更好

Above all

Benefit of generative model
- With the assumption of probability distribution, less training data is needed
- With the assumption of probability distribution, more robust to the noise
- Priors and class-dependent probabilities can be estimatedfrom different sources