관리 메뉴

엘지노 프로젝트

12.31 공부노트 본문

Log/Daily

12.31 공부노트

elzino 2018. 12. 31. 13:52

http://cs231n.github.io/linear-classify/


데이터 노멀라이즈 하는 이유 -> 학습 속도 빠르게 하기 위해


regularize 하는 이유

SVM에서 W가 유니크 하지 않음

-> we wish to encode some preference for a certain set of weights W over others to remove this ambiguity. 


In addition to the motivation we provided above there are many desirable properties to include the regularization penalty. For example, it turns out that including the L2 penalty leads to the appealing max margin property in SVMs.


The most appealing property is that penalizing large weights tends to improve generalization, because it means that no input dimension can have a very large influence on the scores all by itself. 

w1 = [1 , 0, 0, 0], w2 = [0.25, 0.25, 0.25, 0.25]

The final classifier is encouraged to take into account all input dimensions to small amounts rather than a few input dimensions and very strongly.

lead to less overfitting.


Note that biases do not have the same effect since, unlike the weights, they do not control the strength of influence of an input dimension. Therefore, it is common to only regularize the weights but not the biases b. However, in practice this often turns out to have a negligible effect.


SVM - delta vs lambda
It turns out that this hyperparameter can safely be set to Δ=1.0 in all cases. The hyperparameters and  seem like two different hyperparameters, but in fact they both control the same tradeoff


Information theory view. The cross-entropy between a “true” distribution p and an estimated distribution q is defined as:

H(p,q)=xp(x)logq(x)

The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities ( q=efyi/jefj as seen above) and the “true” distribution, which in this interpretation is the distribution where all probability mass is on the correct class (i.e. p=[0,1,,0] contains a single 1 at the yi -th position.). Moreover, since the cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as H(p,q)=H(p)+DKL(p||q), and the entropy of the delta function p is zero, this is also equivalent to minimizing the KL divergence between the two distributions (a measure of distance). In other words, the cross-entropy objective wants the predicted distribution to have all of its mass on the correct answer.


we are therefore minimizing the negative log likelihood of the correct class, which can be interpreted as performing Maximum Likelihood Estimation (MLE)


 SVM cost function is an example of a convex function There is a large amount of literature devoted to efficiently minimizing these types of functions, and you can also take a Stanford class on the topic ( convex optimization )
Once we extend our score functions fto Neural Networks our objective functions will become non-convex, and the visualizations above will not feature bowls but complex, bumpy terrains.

 at these kinks the gradient is not defined. However, the subgradient still exists and is commonly used instead. In this class will use the terms subgradient and gradient interchangeably.


'Log > Daily' 카테고리의 다른 글

2018 회고, 2019 목표  (0) 2019.01.01
Comments