机器学习(machine learning)

机器学习就是自动找函式

  • speech recognition
  • image recognition
  • playing go
  • dialogue system

要找什么样的函式?

regression 输入以前数据,预测未来数据 pm2.5预测
binary classification 输入内容,输出yes/no RNN:输入一个句子,输出pos/neg
multi-class classification 输入内容,选择正确选项 CNN:输入图片,输出类型
generation 产生有结构的复杂东西(文句、图片) 产生文句、产生二次元图像

如何告诉机器?

训练资料

  • supervised learning(label)(正确落子位置)
  • 函式的loss(好坏)
  • reinforcement learning(AlphaGo)(自己学习)
  • unsupervised learning(无label)

机器如何找出所要的函式?

  1. 给定函式寻找范围 network architecture :RNN/CNN
  2. 函式寻找方法 gradient descent :PyTorch…

前沿研究

explainable AI、adversarial attack、network compression…(

一、Regression

Stock Market Forecast、Self-driving Car、Recommendation

Linear Regression

1、Case Learning

  • machine learning :model(function set)->定义function->最好的function

Step 1: Model

Step 2: Goodness of Function

Step 3: Best Function

使用Gradient Descent来求出最小的L(f),即最好的function

  • Consider loss function L(w) with one parameter w:

    • 随机选取一个$w^0$
    • 计算$\frac{dl}{dw}|_w=w^0$ :若为正,向左移;若为负,向右移。找最小的(learning rate : $\eta$)
    • 计算$w^1=w^0-\eta\frac{dl}{dw}|_w=w^0$
    • 重复上述步骤$\frac{dl}{dw}|_w=w^1$ ……
  • How about two parameter?

    • 同理,不过微分变成偏微分

    image-20220710154907644

image-20220710155755320

1. 上标为一个完整对象object
2. 下标为一个对象的component

由于计算出来的误差偏大,这里引入多次方式

#Selecting another Model

  • $y=b+w1*x{cp}+w2*(x{cp})^2$
  • $y=b+w1*x{cp}+w2*(x{cp})^2+w3*(x{cp})^3$

……

image-20220710160827709

但是在testing data上又不一样,所以model不是越复杂越好

#Overfitting

image-20220710161145796

实际上会有很多数据和影响因素时:

Back to step1:

image-20220710161658936

Back to step 2:

Regularization:redfine loss function.

  • L=$\sum_n(\hat{y}^n-(b+\sum{w_ix_i}))^2+\lambda\sum{(w_i)^2}$

  • The functions with smaller $w_i$ is better

  • Prefer soomth function, but not too smooth.

-> select $\lambda$ obtaining the best model

Conclusion

  • There are probably other hidden factors
  • Gradient descent
  • Overfitting and Regularization
  • Validation

2、Basic Concept

Bias and Variance of Estimator

期望是没有bias,并且variance又小

Variance

image-20220711184033723

  • Simpler model is less influenced by the sampled data

Bias

image-20220711184347185

Bias v.s. Variance

image-20220711184809507

Diagnosis

  • underfitting:If your model cannot even fit the training examples, then you have large bais

    • redesign your model:
      • add more featurers as input
      • a more complex model
  • overfitting:If you can fit the training data, but large error on testing data, then you probably have large variance

    • more data: very effective, but not always practical
    • regularization: 加一个weight*term, 但可能影响bias

#What you should not do

image-20220711190024592

#What you should do

  • Cross Validation

    image-20220711190523494

  • N-fold Cross Validation

image-20220711190751386

3、Gradient Descent

  • 由之前的step3,得到Gradient Descent的大体步骤如下:

image-20220712155318082

  • Visualize

image-20220712155757776

Tips

1、Tuning your learning rates

Set the learning rate $\eta$ carefully

  • 太大会错过min
  • 太小会走得过慢

解决方案:

  • Popular & Simple Idea: Reduce the learning rate by some facotor every few epochs.

    • At the beining, we are far from the destination, so we use larger learning rate.
    • After several epochs, we are close to the destination, so we reduce the learning rate.
    • E.g. $\frac{1}{t}$ decay: $\eta^t=\frac{\eta}{\sqrt{t+1}}$
  • Learning rate cannot be one-size-fits-all

    • Giving different parameters different learning rates
Adagrad

Divide the learning rate of each parameter by the root mean square of its previous derivatives

image-20220712163630335

image-20220712163921861

总结

image-20220712164100499

Intuitive Reason

反差效果

Comparison between different parameters…

  • The best step is $\frac{First derivative}{Second derivative}$

image-20220712170036909

2、Stochastic Gradient Descent

Making the training faster

一个对比

image-20220713125830710

3、Feature Scaling

Make different features have the same scaling

image-20220713130153004How to do it?
  • For each dimension i :

    • mean: $m_i$
    • standard deviation: $\sigma_i$
    • $x_i^r=\frac{x_i^r-m_i}{\sigma_i}$
  • The means of all dimensions are 0, and the variances are all 1

Theory

  • Taylor Series

image-20220713131834874

  • Multivariable Taylor Series

image-20220713132158987

  • More Limitation of Gradient Descent

二、Classification

  • Credit Scoring
  • Medical Diagnosis
  • Handwritten character recognition
  • Face recognition

……

1、Case Learning

描述:vector->function=种类

  • Total
  • HP
  • Attack
  • Defence
  • SP Atk
  • SP Def
  • Speed

How to do Classification

  • Training data for Classification

用regression做:

image-20220714102335056

  • Ideal Alternatives

image-20220714102725425

  • Generative Model : $P(x)=P(x|C_1)P(C_1)+P(x|C_2)P(C_2)$

image-20220714103333487

  • Prior

Gaussian Distribution

image-20220714110400248

  • Maximum Likelihood

image-20220714112457598

Results

image-20220720122154962

  • 结果不好 : (

  • The boundary is linear-> lnear model

2、几率模型

Three Steps

  • Function Set(Model):

image-20220720123255775

  • Goodness of a function

    • The mean $\mu$ and covariance $\sum$ that maximizing the likelihood(the probaility of generating data)
  • Find the best function:easy

Probability Distribution

whatever you like

Posterior Probability

image-20220720124012042

三、Logistic Regression

image-20220720125040430

Step1

见上分析

Step2

Goodness of a Function

省略一堆数学推导

  • 得出一个Logistic Regression & Linear Regression的对比图

image-20220721205531836

Set3

Find the best function

省略一堆数学推导

—>Larger difference, larger update.

image-20220721210100956

另一种解答方式

Logistic Regression + Square Error

  • Cross Entropy v.s. Square Errror

image-20220721210935442

  • Discriminative v.s. Generative

    • Discriminative: Logistic Regression的方法
    • Generative: 用Gaussian来描述Historical ability
  • 他们的model & function set一样

The same model (function set), but different function is selected by the same training data.

image-20220721211802576

有人会说Discriminative更好

Above all

  • Benefit of generative model
    • With the assumption of probability distribution, less training data is needed
    • With the assumption of probability distribution, more robust to the noise
    • Priors and class-dependent probabilities can be estimatedfrom different sources

Muti-class Classification

image-20220721212736671

Limitation of Losgistic Regression

How to avoid?

  • Feature Transformation

    • Not always easy to find a good transformation
  • Cascading logistic regression models

Neural Network

image-20220721213750046

->Deep Learning !!!

四、Deep Learning

Deep learning attracts lots of attention.

1、Up and downs of Deep Learning

2、Three Steps for Deep Learning

  • Step 1: define a set of function
  • Step 2: goodness of function
  • Step 3: pick the best function

Neural Network

Different connection leads to different network structures

  • Network parameter $\theta$ :all the weight and biases in the “neurons”

  • 最常见的连接方式:

    • Fully Connect Feedforward Network

Given network structure, define a function set

image-20220722174334520

Deep = Many hidden layers

Matrix Operation

image-20220722201550617

  • Using parallel computing techniques to speed up matrix operation

Output Layer

image-20220722203107237

You need to decidethe network structure to let a doog function in your function set.

Q: How many layers?

A: Trial and error +Intuition

Q: Can we design the net structure?(乱接)

A: Convolutional Neural Network(CNN)

Total Loss

  • Find a function in function set that minimizes total loss L
  • Find the network parameter $\theta^*$ that minimize total loss L

total loss->Gradient Descent

Why deeper?

……