Hanjoon’s blog

Algorithm1

2021-04-16T00:00:00+00:00

1309번 풀어라

PRINCIPAL COMPONENT ANALYSIS

2019-11-08T00:00:00+00:00

PCA(Principal Component Analysis)

PCA는 전통적인 unsupervised classification method이자 dimensional reduction technique 중 하나로 알려져 있다.
PCA에서는 서로 intercorrelated되어 있는 feature axis를 다시 uncorrelated axis(Principal Components)로 바꿔 새롭게 표현한다. 그리고 더 나아가 이 PCs들 중 데이터 포인트의 분산이 가장 큰 몇개의 PCs를 추려서 dimensional reduction을 시도한다.
본 포스팅에서는 PCA에 대한 수학적 기반 일부분에 대해 알아보겠습니다.

먼저 $\mathbb{R}^n$ 에서 데이터 포인트가 분포한다고 가정하자. 그렇다면 이 데이터 포인트들 간의 분산은 다음과 같이 표현 되어질 수 있다.
$\mathbb{V}[X] = \mathbb{E}[(X-\bar{X})(X-\bar{X})^\top]\ = S$ in the case when calculated by sample mean
$\mathbb{V}[X] = \mathbb{E}[(X-\mu)(X-\mu)^\top]\ = \Sigma$
$\textbf{x}_i = {x_1,…,x_n}$ 그리고 $\textbf{x}_i \in X$

따라서 , 분산은 $n \times n$의 Matrix로 표현되고 정방대칭 행렬이다. 왜냐하면 $(\textbf{x}_i-\mu)^\top (\textbf{x}_j-\mu) = (\textbf{x}_j-\mu)^\top (\textbf{x}_i-\mu)$ for all $i,j$ 또 하나의 중요한 사실은 이 분산행렬은 positive definite이다. 다르게 말하면 $a\Sigma a^\top \succ 0$ for all $a$

만약 매트릭스가 positive definite이라면 다음과 같은 특성을 가진 것으로 알려져 있다.

Diagonalizable
Positive Eigenvalues.

Diagonalizable이므로 eigenvalue decomposition이 가능하고 이 말은 $P^{-1}DP$ 꼴로 표현이 가능하다는 것을 말한다. 여기서 $P$는 orthogonal basis로 구성된 Matrix이다.
Orthogonal matrix의 특성은 $P^{-1}P = I = P^{\top}P$ 그리고 $D$는 diagonal matrix이며 각각의 entry는 $P$의 orthogonal basis의 scaling을 담당하는 역할을 한다. 다르게 말해서, $P$의 orthogonal basis들은 eigenvector 역할을 $D$의 diagonal entry들은 eigenvalue역할을 한다. PCA에서 $D$에 있는 scaling scalar값이 최대가 되는 orthogonal basis를 찾는 것이 목표가 된다. 그리고 orthogonal basis는 orthonormal basis로 한정하기도 한다.
orthonormal basis는 앞서 말한 orthogonal basis와 같은 특성을 가지고 있으면서도 basis자체의 norm이 1인 경우이다. 여기서는 $P$가 orthonormal matrix임을 가정하겠다.

우리가 해결해야할 objective function은 다음과 같은 표현이 될 것이다.

$\underset{a_i}{\text{maximize}} \ a_{i}^{\top} \Sigma a_{i}$
$\text{subject to} \ a_{i}^{\top} a_{i} = 1 $

여기서 $a_{i}$는 $\Sigma$ 를 eigenvalue decomposition했을 때의 orthonormal basis이고 $\lambda_i$는 i번째 orthogonal basis의 eigenvalue이다.

그렇다면 $\underset{a_i}{\text{maximize}} \ a_{i} \Sigma a_{i}^{\top} = \ a_{i} P^{\top} D P a_{i}^\top = \ \lambda_{i}$

결과적으로 D의 entry인 $\lambda_i$을 최대화하는 orthonormal basis를 찾는 것이라 할 수 있다.

이 objective function의 조건을 충족하는 orthonormal basis를 찾기위해서 Lagrangian form 으로 변환 다음과 같은 형태가 된다.

$a_i \Sigma a_i^\top - \lambda a_i^\top x_i - 1 = 0$

$\Sigma$가 positive definite이므로 $ a_i \Sigma a_i^\top$는 언제나 $\succ 0$이다. 따라서 convex인 것을 알 수 있다.

$x_i$에 대해 편미분 해주고 critical point를 찾으면 그 지점에서 global optimal point을 찾을 수 있다.

$\nabla_{a_i} \Sigma a_i - \lambda(a_i^\top a_i - 1)$

$\rightarrow 2 \Sigma a_i - \lambda a_i = 0 $

$\rightarrow \Sigma a_i = \lambda a_i$

$\rightarrow \Sigma = \lambda$

이 결과를 원래의 optimization problem에 대입하면

$\underset{a_i}{\text{maximize}} \ a_{i}^{\top} \lambda_i a_{i}$
$\text{subject to} \ a_{i}^{\top} a_{i} = 1 $

$\underset{a_i}{\text{maximize}} \ \lambda_i$
$\text{subject to} \ a_{i}^{\top} a_{i} = 1 $

따라서 orthonormal vector $a_i$는 $\lambda_i$ 값을 최대화하는 eigenvector이다. 그리고 $\lambda_i$는 $\Sigma$를 eigenvalue decomposition하여 얻을 수 있는 eigenvalue 값들 중 최대값이 된다.

2번째로 높은 Principal component는 어떻게 찾을 수 있을까?

우선 첫번째로 찾은 PC에서 eigenvector $a_i$ 는 두번째로 찾은 eigenvector $a_j$와 orothogonal 해야한다. (i.e. $a_i^\top a_j = 0$)

그렇다면 다음과 같은 optimization problem이 설정된다.

$\underset{a_j}{\text{maximize}} \ a_j^{\top} \Sigma a_j$
$\text{subject to} \ a_j^{\top} a_j = 1 ,\ a_i^{\top} a_j = 0 $

Lagrangian form을 취해주면 $a_j^{\top} \Sigma a_j - \lambda_j(a_j^{\top} a_j - 1) - \gamma(a_i^{\top} a_j)$

$\nabla_{a_j} a_j^{\top} \Sigma a_j - \lambda_j(a_j^{\top} a_j - 1) - \gamma(a_j^{\top} a_i) $
$\rightarrow \Sigma a_j - \lambda_j a_j - \gamma a_i = 0$

Multiply by $a_i$

$\rightarrow a_i^\top \Sigma a_j - \lambda_j a_i^\top a_j - \gamma a_i^\top a_i = 0$
$\rightarrow \gamma = 0$

다음과 같은 결과를 대입하면

$\rightarrow \Sigma a_j - \lambda_j a_j = 0 $

$\rightarrow \Sigma a_j = \lambda_j a_j$

따라서 두번 째 PC는 $\lambda_i$ 다음으로 큰 eigenvalue $\lambda_j$를 가지는 orthonormal vector(eigenvector) $a_j$이다.

이런식으로 3번째는 제약조건을 하나 더 추가해서 구할 수 있고 4번째 PC도 마찬가지이다.

솔직히 수식을 많이 적긴 했지만 생각보다 간단하게 생각할 수 있다고 생각한다.

Covariance Matrix $\Sigma$가 존재하면 그냥 eigenvalue decomposition해서 D부분의 diagonal entry중에 가장 큰 값을 순서적으로 찾아서 이에 해당하는 orthonormal vector 를 P에서 찾으면 끝이 아닐까 한다.

WHITENING TRANSFORMATION

2019-09-12T00:00:00+00:00

Whitening Transformation

Whitening transformation은 Sphering transformation이라고도 부르며 어떤 데이터를 transformation시켜 norm covariance matrix를 identity matrix $I$로 바꿔주는 테크닉이다. 다르게 말하면 $cov(x_i,x_j) = 1$ with $i=j$ or $0$ with $i \neq j$로 variance가 1이고 다른 변수들 간의 covriance를 0(uncorrelated)으로 바꿔주는 방법이다.

먼저 feature matrix $X$를 다음과 같이 정의하자.

$X \in \mathbb{R}^{n\times n} = [x_1, x_2,…,x_n ]$, where $x_i \in \mathbb{R}^{n\times 1}$ 여기서 각각의 $x_i$는 n개의 샘플을 가진 하나의 feature vector이다.
그리고 $\bar{X} = 1^T\cdot [\bar{x_1},\bar{x_2},…,\bar{x_n}]$ where $1^T \in \mathbb{R}^{n\times 1}$ with $1$ in the entries and $\bar{x_i}$ is the mean of $i$th column vector.
따라서 $\bar{X} \in \mathbb{R}^{n\times n}$가 되고 각 column들은 feature vector들의 평균값들로 채워진다.

Centering

$\tilde{X}$를 다음과 같이 정의한다. $\tilde{X}$ = $X - \bar{X}$ 각 column vector의 entry들에 mean값을 빼준 matrix의 형태가 되고 이를 centering이라고 부른다.

그리고 다음과 같은 특징을 가진다.
$\mathbb{E}[\tilde{X}]=0$
왜냐하면 $\mathbb{E}[\tilde{X}] = \mathbb{E}[X - \bar{X}] = \mathbb{E}[X]-\mathbb{E}[\bar{X}] = \bar{X} - \bar{X} = 0$

$\tilde{X}$의 covariance matrix는 다음과 같다.
$\mathrm{Cov}[\tilde{X}] = \mathbb{E}[\tilde{X}\tilde{X^T}]$
왜냐하면 $\mathbb{E}[\tilde{X}\tilde{X^T}] = \mathbb{E}[(X - \bar{X})(X - \bar{X})^T] $

Eigenvalue Decomposition

$\mathrm{Cov}[\tilde{X}]$ 를 $\Sigma$라고 하자.

$\Sigma$는 positive semi definite(PSD)이다. (<- PSD의 성질은 $a\Sigma a^\top \succeq 0$이다.)

PSD matrix는 eigenvalue decomposition이 가능하고 $\Sigma = VDV^T$ 형태로 치환이 가능하다. 여기서 $V$는 각 column들의 norm $\rvert\rvert \cdot \rvert\rvert$이 1인 orthonormal vector이다. 그리고 $D$는 diagonal matrix로 각각의 orthonormal vector들의 magnitude(음수가 아닌 아이겐벨류 값)의 정보를 담고 있다.

PSD matrix가 eigenvalue decomposition이 가능함을 보이기 위해서는 우선 spectral theorem부터 설명해야 될 것 같아서 따로 포스팅…

Orthogonal Transformation

$Y$를 다음과 같이 정의하자.

$Y = V^{T}\tilde{X}$

이는 $\tilde{X}$를 orthonormal matrix에 의해 $Y$로 transformation하는 형태인데 다음과 같은 특징을 가진다.

$T : R^{n\times n} \rightarrow R^{n\times n}$ via $\tilde{X} \mapsto V^T\tilde{X}$

여기서 $T$를 orthonormal transformation이라고 부르고 다음과 같은 특징을 가진다.

$\rvert\rvert x \rvert\rvert = \rvert\rvert Tx \rvert\rvert$

왜냐하면 $\rvert\rvert Tx \rvert\rvert = <Tx,Tx> = <x,T^{\ast}Tx> = <x,x> = \rvert\rvert x \rvert\rvert$

Operator $T$의 역할을 하는게 orthonormal matrix $V^T$이고 $T^{\ast}$ 는 $T$의 transpose conjugate이므로 $V$이다. 따라서 $T^{\ast}T = V^TV = I$이다.

정리하자면, $V^{T}$는 $\tilde{X}$의 거리 norm을 유지한 채 orthonormal basis에 따라서 좌표를 옮긴 형태의 $Y$로 만들어주는 것이 된다.

Whitening/Sphering

다시 covariance matrix $\Sigma$로 돌아가서 eigenvalue decomposition을 통해 $\Sigma = VDV^T$형태로 변환된 것을 다시 $VDV^T = VD^{1/2}D^{1/2}V^T$ 형태로 변환 가능하다. diagonal matrix들 끼리의 matrix multiplication은 diagonal entry들끼리의 elementwise multiplication이나 다름이 없으므로 가능..

$VD^{1/2}$만 쪼개서 보면 $V$는 말했듯 orthonormal matrix이고 $D^{1/2}$는 $V$의 orthonormal vector(column vector)들의 magnitude를 결정 짓는 scalar역할을 한다.

$Y = V^T\tilde{X}$에서 각각의 term에 $D^{1/2}$의 역행렬을 곱해주면 어떻게 될까? 여기서 우리는 $D$를 $\tilde{X}\tilde{X}^T$로 부터 추출 했다는 사실을 기억하자.

Eigenvalue decomposition의 의미를 되새길 필요가 있다.
어떤 행렬이 eigenvalue decomposition이 가능하다는 것은 행렬의 성분을 orthonormal basis들로 쪼갤 수 있고 각 basis의 magnitude를 diagonal matrix에 담을 수 있다는 이야기가 된다. $\tilde{X}\tilde{X}^T$는 $\tilde{X}$ 성분의 제곱이므로 $D$의 entry maginitude가 두번 곱해진 형태라고 볼 수 있다. 따라서 $D^{1/2}$는 $X$라는 성분을 어떤 orthonormal basis로 표현했을 때 그 basis의 magnitude가 된다.

따라서, $D^{-1/2} V^T X = D^{-1/2} Y$는 $V^T$를 통해 orthogonal transformation한 뒤에 orthonormla basis를 $D^{1/2}$가 가지고 있는 entry의 magnitude 나눠 준 형태가 된다.

$D^{-1/2} Y$를 $W$라고 하고 $\mathrm{Cov}[W]$가 어떻게 되는지 살펴보도록 하자.
$\mathrm{Cov}[W]$
$\Longleftrightarrow \mathbb{E} [W W^T]$
$\Longleftrightarrow \mathbb{E}[D^{-1/2}Y Y^T D^{-1/2}]$
$\Longleftrightarrow D^{-1/2}\mathbb{E}[Y Y^T]D^{-1/2}$
$\Longleftrightarrow D^{-1/2}\mathbb{E}[V^T\tilde{X} \tilde{X}^T V]D^{-1/2}$
$\Longleftrightarrow D^{-1/2}V^T \mathbb{E}[\tilde{X} \tilde{X}^T]VD^{-1/2}$
$\Longleftrightarrow D^{-1/2} V^T \Sigma V D^{-1/2}$
$\Longleftrightarrow D^{-1/2} V^T V D V^T V D^{-1/2}$
$\Longleftrightarrow D^{-1/2} D D^{-1/2}$
$\Longleftrightarrow D^{-1/2} D^{1/2}D^{1/2} D^{-1/2}$
$\Longleftrightarrow I$

따라서 $\mathrm{Cov}[W] = I$

정리해서 centering matrix $\tilde{X}$에 $D^{-1/2}V^T$를 곱해 $W= D^{-1/2}V^T\tilde{X}$를 만들어 주면 norm은 유지하면서 covariance matrix가 $I$인 데이터로 변환이 가능하다.

reference: https://www.projectrhea.org/rhea/images/1/15/Slecture_ECE662_Whitening_and_Coloring_Transforms_S14_MH.pdf

SUPPORT VECTOR MACHINE

2019-08-25T00:00:00+00:00

SUPPORT VECTOR MACHINE

Linear SVM(support vector machine)

Hyperplane

hyperplane은 다음과 같이 정의된다

$\lbrace \vec{x} \mid \vec{w}^{T}\vec{x}=b \rbrace$ where $\vec{x} \in \mathbb{R}^{n}$ and $b \in \mathbb{R}$

다시말해서 위의 등식을 만족시키는 $\vec{x}$의 집합이다.

예를들어 $\vec{x} \in \mathbb{R}^{2}$라면,

$\lbrace (x_{1},x_{2}) \mid ax_{1}+bx_{2}=b \rbrace$를 만족하는 해집합 $(x_{1},x_{2})$은 $\mathbb{R}^{2}$을 가르는 직선형태를 띌 것이다. (마찬가지로 $\mathbb{R}^{3}$라면 $\vec{x}$의 해집합은 평면)

일반화 해서 $\vec{x} \in \mathbb{R}^{n}$에서 $\lbrace \vec{x} \mid \vec{w}^{T}\vec{x}=b \rbrace$의 해집합은 $\mathbb{R}^{n-1}$에서 span할 것이다.

그림 1. Hyperplane

그림 2. Linear SVM with margin

Linear SVM은 $\mathbb{R}^{n}$상에 분포된 data point들과 그것을 가르는 hyperplane 사이의 margin(거리)을 최대화 하는 $\vec{w}$를 optimization 기법이다.

Optimization problem(find large margin)

임의의 hyperplane $\vec{w}^{T}\vec{x^+}=b$이 존재한다고 하자. 그리고 이 hyperplane을 $\pm$ 1만큼 translation하면,

$\vec{w}^{T}\vec{x^+}=b+1$

$\vec{w}^{T}\vec{x^-}=b-1$ 이 된다.

두 등식의 차이는 hyperplane들 사이의 width(거리)가 된다.

$\vec{w}^{T}\vec{x^+} - \vec{w}^{T}\vec{x^-} =2$

아래쪽 hyperplane과 위쪽 hyperplane의 거리는 equidistance이므로 절반으로 나누고 $\vec{w}$를 normalize하면,

$ = \frac{1}{2}\frac{\vec{w}^{T}}{\parallel w \parallel}(\vec{x^+}-\vec{x^-}) = \frac{1}{\parallel w \parallel}$, where $\frac{1}{\parallel w \parallel}$ is width(margin)

여기서 임의로 hyperplane을 $\pm$ 얼마만큼 translation을 하는지는 별로 중요하지 않다. 왜냐하면 $\parallel w \parallel$를 조절해서 얼마든지 이 값을 변화시킬 수 있기 때문이다.

Linear SVM은 이 width를 최대화 하는 것이므로 $\parallel w \parallel$를 최소화 해야한다. 따라서 다음과 같은 optimization problem이 된다.

$\underset{\vec{w},b}{\text{minimize}} \ \frac{1}{2}\parallel w \parallel^2$
$\text{subject to} \ y_{i}(\vec{w}^{T}\vec{x}_ {i}+b) \geq 1 , i=1,…,n$

$\frac{1}{2}\parallel w \parallel^2$는 $\parallel w \parallel$를 최소화하는 문제를 좀 더 손쉽게 풀기 위해서 변형한 형태이며 이렇게 변형해도 최소의 $\parallel w \parallel$를 찾는데에는 아무런 영향이 없다. 다시말해 $\parallel w \parallel$를 최소화 하는 문제는 $\frac{1}{2}\parallel w \parallel^2$를 최소화 하는 문제와 같다.

여기에 추가로 constraint를 추가하여 $y_{i}(\vec{w}^{T}\vec{x}_ {i}+b) \geq 1$라는 조건을 동시에 충족하여야 한다.

$y_{i}$는 classifier 다음과 같이 정의된다.

\[\ y_{i} = \begin{cases} 1 ,& \text{if } \ \vec{w}^{T}\vec{x}_ {i}+b \leq p \newline \newline -1, & \text{if } \ \vec{w}^{T}\vec{x}_ {i}+b > p \end{cases} \\]

$y_{i}$는 $\mathbb{R}^{n}$안의 data point $p$가 $\vec{w}^{T}\vec{x}_ {i}+b$ 기준으로 위쪽이거나 걸쳐 있을때 1의 값을 아래쪽일때 -1값을 가진다.

따라서 $y_{i}(\vec{w}^{T}\vec{x}_ {i}+b) \geq 1$가 되기 위해서는 $y_{i}$와 $\vec{w}^{T}\vec{x}_ {i}+b$가 둘다 음수거나 양수여야 한다. 그말은 해석하자면 $p$가 $\vec{w}^{T}\vec{x}_ {i}+b = -c$ 아래에 위치하거나 그렇지 않으면 $\vec{w}^{T}\vec{x}_ {i}+b = c$의 위에 위치해야 한다는 말이 된다.

만약 $\vec{w}^{T}\vec{x}_ {i}+b = -c$에 위치하면 $y_{i} = 1$이 되고 $\vec{w}^{T}\vec{x}_ {i}+b = -c$ 이므로 $y_{i}(\vec{w}^{T}\vec{x}_ {i}+b) = -c$ 음수가 된다. 반대로 $\vec{w}^{T}\vec{x}_ {i}+b = c$ 아래에 위치하면 $y_{i} = -1$ 이 되므로 $y_{i}(\vec{w}^{T}\vec{x}_ {i}+b)=-c$ 음수가 된다.
결과적으로 $y_{i}(\vec{w}^{T}\vec{x}_ {i}+b)\geq 1$가 함의하는 것은 $\vec{w}^{T}\vec{x}_ {i}+b = -c$ 와 $\vec{w}^{T}\vec{x}_ {i}+b = c$ 사이에는 data point가 존재하지 않아야 한다는 말과 같다.

Dual problem

위의 optimization problem을 lagrange dual function을 이용해 dual problem으로 치환해 풀 수 있다.(더 자세한 설명은 lagrange dual problem 포스팅 참조.)

Lagrange dual function은 다음과 같은 형태가 된다.

$g(\lambda) = \inf(\frac{1}{2}\parallel w \parallel^2 - \sum_{i=1}^{n}\lambda_{i}(y_{i}(\vec{w}^{T}\vec{x}_ {i}+b)-1))$

$= \inf(\frac{1}{2}\parallel w \parallel^2 - \sum_{i=1}^{n}(\lambda_{i} y_{i}\vec{w}^{T}\vec{x}_ {i}+\lambda_{i} y_{i}b-\lambda_{i}))$

다시, Lagrange daul function은 affine function들의 pointwise infimum 이므로 concave이다. 따라서 global optimal point가 존재한다. 그리고 optimal point는 기울기가 0이 되는 지점이다.

$w$에 대해 기울기가 0이되는 조건을 걸고 편미분 해주면,

$\nabla_{w}\frac{1}{2}\parallel w \parallel^2 - \sum_{i=1}^{n}(\lambda_{i} y_{i}\vec{w}^{T}\vec{x}_ {i}+\lambda_{i} y_{i}b-\lambda_{i}) = 0$

$ \iff w - \sum_{i=1}^{n} \lambda_{i} y_{i} \vec{x}_ {i} = 0$

$ \iff w = \sum_{i=1}^{n} \lambda_{i} y_{i} \vec{x}_ {i}$

$b$에 대해 기울기가 0이되는 조건을 걸고 편미분 해주면,

$\nabla_{b}\frac{1}{2}\parallel w \parallel^2 - \sum_{i=1}^{n}(\lambda_{i} y_{i}\vec{w}^{T}\vec{x}_ {i}+\lambda_{i} y_{i}b-\lambda_{i}) = 0$

$\iff \sum_{i=1}^{n}\lambda_{i} y_{i} = 0$

따라서,

$ w = \sum_{i=1}^{n} \lambda_{i} y_{i} \vec{x}_ {i}$ 와 $\sum_{i=1}^{n}\lambda_{i} y_{i} = 0$를 $g(\lambda)$에 대입해 주면

$\iff g(\lambda) = \frac{1}{2} \sum_{i=1}^{n}\sum_{j=1}^{n} \lambda_{i}\lambda_{j} y_{i}y_{j} {\vec{x}_ {i}}^{T}\vec{x}_ {j} - \sum_{i=1}^{n}\sum_{j=1}^{n}\lambda_{i}\lambda_{j} y_{i}y_{j}{\vec{x}_ {i}}^{T}{\vec{x}_ {j}} - \sum_{i=1}^{n}\lambda_{i} y_{i}b + \sum_{i=1}^{n} \lambda_{i} $

$\iff g(\lambda) = \sum_{i=1}^{n} \lambda_{i} - \frac{1}{2} \sum_{i=1}^{n}\sum_{j=1}^{n} \lambda_{i}\lambda_{j} y_{i}y_{j} {\vec{x}_ {i}}^{T}\vec{x}_ {j} - \sum_{i=1}^{n}\lambda_{i} y_{i}b $

$\iff g(\lambda) = \sum_{i=1}^{n} \lambda_{i} - \frac{1}{2} \sum_{i=1}^{n}\sum_{j=1}^{n} \lambda_{i}\lambda_{j} y_{i}y_{j} {\vec{x}_ {i}}^{T}\vec{x}_ {j}$

위의 결과를 토대로 아래와 같은 dual problem이 만들어 진다.

$\underset{\lambda>0}{\text{maximize}} \ \sum_{i=1}^{n} \lambda_{i} - \frac{1}{2} \sum_{i=1}^{n}\sum_{j=1}^{n} \lambda_{i}\lambda_{j} y_{i}y_{j} {\vec{x}_ {i}}^{T}\vec{x}_ {j}$
$\text{subject to} \ \sum_{i=1}^{n}\lambda_{i} y_{i} \geq 0 , i=1,…,n$

끝 아님… 뒤이어 작성 할 겁니다. :)

BAYESIAN OPTIMIZATION : THOMPSON SAMPLING

2019-08-24T00:00:00+00:00

THOMPSON SAMPLING

Beta distribution

Thompson sampling을 설명할 때 sampling distribution update과정을 beta-binomial distribution로 설명하기가 쉬우므로 많이 쓴다.

먼저 beta distribution의 probability density function은 다음과 같다.

$Beta(\theta,\alpha,\beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha -1}(\theta-1)^{\beta -1}$

그림 1. Beta distribution

예시 그림에서와 같이 $\alpha$값이 상대적으로 높을수록 positive skew된, $\beta$값이 상대적으로 높을수록 negative skew된 것을 볼 수 있다.
그리고 $\alpha$와 $\beta$값 높을 수록 좌우가 분포가 peak이 높은 형태를 나타내는 것도 확인 할 수 있다.

Gamma function

Beta distribution의 gamma function은 다음과 같은 형태를 가지고 있다. $\Gamma(\alpha) = \int_{0}^{\infty} t^{\alpha - 1} e^{-t} dt$

Intgeration by parts를 적용하여 풀어주면,

$ \int_{0}^{\infty} t^{\alpha - 1} e^{-t} dt $

$=-t^{\alpha - 1}e^{-t} \bigg\rvert_{t=0}^{\infty} + \int_{0}^{\infty} (x-1)t^{\alpha -2}e^{-t} dt$

$= 0 + \int_{0}^{\infty} (x-1)t^{\alpha -2}e^{-t} dt$

$=(x-1)\Gamma(x-2)$

이 방법을 연속적으로 취해주면 $ \Gamma (x) = (x-1)! $임을 알 수 있다.

Beta-binomial distribution

위의 두 사실을 바탕으로 다음과 같은 등식이 성립한다.

\[{\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha -1}(\theta-1)^{\beta -1} = \frac{(\alpha+\beta -1)!}{(\alpha-1)!(\beta-1)!}\theta^{\alpha-1}(\theta-1)^{\beta-1}}\]

다음으로 위의 사실이 어떻게 Binomial distribution과 연관되는지 알아보도록 하자.

Bayesian update for binomial distribution

binomial distribution은 bernoulli distribution이라고도 부르고 probability density function은 다음과 같이 정의된다.

$bin(n,k,\theta) = \binom{n}{k}\theta^{k}(\theta-1)^{n-k}$
, where $\theta$ = probability of success, $k$ = trial of sucess

Binomial distribution에서 posterior를 계산하는 방법은 다음과 같다.

$p(\theta \mid x) = \frac{p(x \mid \theta)p(\theta)}{p(x)}$, we assume that $p(\theta)$ = 1

$= \frac{\binom{n}{k}\theta^{k}(\theta-1)^{n-k}}{\binom{n}{k}\int_{\theta}\theta^{k}{(\theta-1)}^{n-k}d\theta}$

$= \frac{\theta^{k}(\theta-1)^{n-k}}{\frac{\Gamma(k+1)\Gamma(n-k+1)}{\Gamma(n+2)}}$

$= \frac{\Gamma(n+2)}{\Gamma(k+1)\Gamma(n-k+1)}\theta^{k}(\theta-1)^{n-k}$

$= Beta(\theta,k+1,n-k+1)$

따라서 $bin(n,k)$의 posterior는 $Beta(k+1,n-k+1)$과 같다.

Bayesian update for beta-binomial distribution

만약 prior $p(\theta)$가 beta distribution을 다른다고 가정하면 어떻게 될까?

bayesian inference

bayesian inference에서는 계산의 복잡성에 의해 normalizing constant (여기서는 $p(x)$)를 생략하기도 한다. 따라서 다음과 같은 식이 된다.

\[posterior \propto likelihood \ast prior\]

따라서 posterior는 다음과 같이 구해진다.

$p(\theta \mid x) \propto p(x \mid \theta)p(\theta)$, we assume that $p(\theta) = Beta(\alpha,\beta)$

$\approx \binom{n}{k}\theta^{k}(\theta-1)^{n-k}\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha -1}(\theta-1)^{\beta -1}$

$\approx \binom{n}{k}\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{k+\alpha -1}(\theta-1)^{n-k+\beta -1}$

$\approx Beta(x+\alpha, n-x+\beta)$

Tompson sampling for bernoulli bandit

bandit problem을 이야기할 때에 보통 슬롯머신 비유를 많이한다. 간단하게 이야기하면 여러개의 슬롯머신이 있고 그 중 어떤 슬롯머신의 손잡이를 잡아 당겼을 때 돈을 확률이 높은지를 고민하는 문제라고 할 수 있겠다.

아래의 알고리즘은 bandit problem 최적의 선택(어떤 슬롯머신의 손잡이를 당겼을 때 돈을 딸 확률이 높은지)을 구하는 알고리즘이다.

Psudo-code for bernoulli thompson sampling

그림 2. Bernoulli greedy algorithm(좌) Bernoulli thompson sampling algorithm(우)

두 알고리즘의 가장 주요한 차이는 전자는 $\hat{\theta_{k}}$값을 구할 때 mean값 사용하고 후자는 beta distribution에 의해 분포된 probability density에 따라 확률이 랜덤하게 선택된다는 것이다.

Algorithm2에 대한 해설
$t$는 trial에 대한 index이고 $k$는 choice에 대한 index를 나타낸다.
$t$번째 trial에서 각 choice에 대한 ${\theta_{k}}$값들(다르게 말해 성공확률들)이 구해질 것이고, 이 확률들 중 가장 높은 확률을 가지는 $\theta_{k}$를 $\hat{\theta_{k}}$로 선택한다.
가장 큰 성공확률을 나타내는 $\hat{\theta_{k}}$가 선택 되었으면 $\hat{\theta_{k}}$에 대응되는 choice를 $x_{t}$로 두고 이에 대한 reward값인 $r_{t}$ 확인한다.
Bayesian inference를 적용해 reward값 $r_{t}$를 parameter로 하는 $Beta(r_{t},r_{t}-1)$을 prior로 두고 posterior를 구$Beta(\alpha+r,\beta+r-1)$가 된다.

앞서 그림 1을 통해 beta distribution의 분포특성을 살펴봤듯이 두 파라미터 값($\alpha$, $\beta$)값이 높아질 수록 분포의 폭이 좁아지는 경향을 보인다.
그리고 알고리즘의 trial이 높아 질수록 이 두 파라미터 값이 증가한다.
즉, ${\theta_{k}}$의 sampling distribution의 variance가 작아진다. 따라서 랜덤하게 추출되는 ${\theta_{k}}$값이 균질해진다.
그리고, 이러한 효과는 높은 성공확률을 가지는 분포에서 random sampling한 ${\theta_{k}}$의 값이 균질 (또는 명백히 하는) 효과를 준다.

$\epsilon$-greedy algorithm에서는 $\epsilon$값에 의해 성공확률 이외에도 랜덤하게 다른 choice를 할 여지는 주지만 성공확률이 확연히 차이나는 선택이 존재할 경우 이러한 방식은 도움이 되지 않는다.

reference : https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf

LAGRANGE DUAL PROBLEM

2019-08-20T00:00:00+00:00

LAGRANGE DUAL PROBLEM

만약 다음과 같은 형태의 optimization problem이 주어졌다고 하자.

$\underset{x}{\text{minimize}} \ f_0(x)$
$\text{subject to} \ f_i(x) \leq 0 , i=1,…,m$
$\qquad\qquad\ g_i(x) = 0 , i=1,…,p$

Lagrange dual funtion

그렇다면 lagrange dual function은 다음과 같이 정의된다.

$g(\lambda,\upsilon) = \inf\limits_{x \in \mathcal{D}}L(x,\lambda,\upsilon)= \inf\limits_{x \in \mathcal{D}}(f_{0}(x)+\sum_{i=1}^{m} \lambda_{i}f_{i}(x) + \sum_{i=1}^{p} \upsilon_{i}g_{i}(x) )$

여기서 $L(x,\lambda,\upsilon)$은 Lagrangian function으로

$L:\mathbb{R}^m \times \mathbb{R}^p \rightarrow \mathbb{R} \ \text{with} \ \mathcal{D} = \lbrace x \mid \bigcap_{i=1}^m \text{dom} \ f_i(x) \cap \bigcap_{i=1}^p \text{dom} \ g_i(x) \rbrace$ $(x,\lambda,\upsilon) \mapsto f_{0}(x)+\sum_{i=1}^{m} \lambda_{i}f_{i}(x) + \sum_{i=1}^{p} \upsilon_{i}g_{i}(x)$

그리고 각각의 constraint에 곱해지는 $\lambda_{i}$와 $\upsilon_{i}$는 “dual” 또는 “lagrange multiplier”라고 부른다.

만약 lagrange multiplier들이 0보다 큰 양수라면 constraint의 조건과 합쳐져 다음과 같은 부등식이 성립된다.

$g(\lambda, \upsilon) \leq L(\widetilde{x},\lambda,\upsilon) \leq f_{0}(\widetilde{x})$ ,where $\widetilde{x}$ is any feasible point in $\mathcal{D}$

위의 부등식이 항상 성립하므로 $f_{0}(x)$의 lower bound는 $g(\lambda, \upsilon)$이다.

Dual problem

$g(\lambda, \upsilon)$는 affine function들의 pointwise infimum이므로 concave이다.

그리고 이말은 $g(\lambda, \upsilon)$가 global optimal point(maximum point)가 존재한다는 말과 동치이다.

이러한 사실로 다음과 같은 optimization problem을 생각해 볼 수 있다.

$\underset{\lambda \geq 0}{\text{maximize}} \ g(\lambda, \upsilon)$

$g(\lambda, \upsilon)$가 최대가 되는 지점은 $\max g(\lambda, \upsilon) = d^{\ast}$가 될 것이고, 이 지점은 $f_{0}$가 최소가 되는 $\min f_{0}(x) = p^{\ast}$가 될 것이다. 따라서 다음과 같은 부등식이 성립한다.

\[\max g(\lambda, \upsilon) = d^{\ast} \leq p^{\ast} = \min f_{0}(x)\]

Example

Standard form LP

다음과 같은 standard form LP(linear programming)을 optimzation하는 문제가 주어졌다고 하자.

$\underset{x}{\text{minimize}} \ c^{T}x$
$\text{subject to} \ Ax = b$
$\qquad\qquad\ x \succeq 0$

이에 대한 lagrange dual function은

$g(\lambda,\upsilon) = \inf\limits_{x \in \mathcal{D}}(c^{T}x+\lambda^{T}(Ax-b) + \upsilon^{T}x )$
$\quad\quad\quad = \inf\limits_{x \in \mathcal{D}}((c^{T}+\lambda^{T}A+\upsilon^{T})x - \lambda^{T}b)$
$\quad\quad\quad = \inf\limits_{x \in \mathcal{D}}((c^{T}+\lambda^{T}A+\upsilon^{T})x) - \lambda^{T}b$

이 형태는 linear function의 pointwise infimum을 찾는 것이므로 concave이다.

그리고 다음과 같은 dual problem으로 치환할 수 있다.

$\underset{\lambda,\upsilon}{\text{maximize}} \ g(\lambda,\upsilon)$

$g(\lambda, \upsilon)$는 concave이므로 global optimal point가 존재하고 이 지점은 $\nabla_{x}g(\lambda,\upsilon) = 0$ 를 만족하는 $x$가 될 것이다.

\[\max g(\lambda,\upsilon) \iff \nabla_{x}g(\lambda,\upsilon) = c^{T}+\lambda^{T}A+\upsilon^{T} = 0\]

따라서 $c^{T}+\lambda^{T}A+\upsilon^{T} = 0$일 때 $g(\lambda,\upsilon)$가 최대 값을 가지는 지점이 된다.

\[\ g(\lambda, \upsilon)= \begin{cases} - \lambda^{T}b ,& \text{if } \ c^{T}+\lambda^{T}A+\upsilon^{T} = 0 \newline \newline -\infty, & \text{otherwise} \end{cases} \\]

NAIVE BAYES CLASSIFIER

2019-08-19T00:00:00+00:00

NAIVE BAYES CLASSIFIER

Bayes Rule

$p(c \mid \theta) = \frac{p(\theta \mid c)p(c)}{p(\theta)} \iff p(c , \theta) = p(\theta \mid c)p(c) $

만약 어떤 단서(feature)들이 주어졌을 때 그 단서들이 $c$를 추정할 확률은 얼마나 될까? 일단 확률을 추정하는 식은 다음과 같다.

$p(c \mid \theta_{1},\theta_{2},…,\theta_{n})$, where $c$ and $\theta$ are “choice” and “feature” respectively.

그리고 이 식은 baye’s rule을 이용해 다음과 같이 표현할 수 있다.

$p(c \mid \theta_{1},\theta_{2},…,\theta_{n}) = \frac{p(\theta_{1},\theta_{2},…,\theta_{n},c)p(c)}{p(\theta_{1},\theta_{2},…,\theta_{n})} = \frac{p(\theta_{1},\theta_{2},…,\theta_{n},c)}{p(\theta_{1},\theta_{2},…,\theta_{n})} $

여기서 bayes rule의 특성을 이용해 대수적 조작을 해주면

$p(c,\theta_{1},\theta_{2},…,\theta_{n}) =$

$\ = p(\theta_{1},\theta_{2},…,\theta_{n} \mid c)p(c)$

$\ = p(\theta_{2},\theta_{3},…,\theta_{n} \mid \theta_{1},c)p(\theta_{1} \mid c)p(c)$

$\ = p(\theta_{3},\theta_{4},…,\theta_{n} \mid \theta_{1},\theta_{2},c)p(\theta_{2},\theta_{1} \mid c)p(c)$

$\ = p(\theta_{4},\theta_{5},…,\theta_{n} \mid \theta_{3},\theta_{2},\theta_{1},c)p(\theta_{3}\mid \theta_{2},\theta_{1},c)p(\theta_{2} \mid \theta_{1},c)p(\theta_{1} \mid c)p(c)$

$\ \ \ \ \ \ \ \vdots$

$\ = p(\theta_{n} \mid \theta_{n-1},\theta_{n-2},…,\theta_{1},c)p(\theta_{n-1} \mid \theta_{n-2},\theta_{n-3},…,\theta_{1},c)…p(\theta_{3} \mid \theta_{2},\theta_{1},c)p(\theta_{2} \mid \theta_{1},c)p(\theta_{1} \mid c)p(c)$

이에 더해 Naive bayes classfier에서는 각 feature들이 서로 연관성이 없다고 가정한다. 다시말해 다른 feature들이 조건으로 붙어도 또는 붙지 않아도 확률에는 영향을 미치지 않는다. 따라서 feature들이 상호 독립적(mutually independent)이라고 가정한다면 다음과 같이 표현가능하다.

$\ = p(\theta_{n} \mid \theta_{n-1},\theta_{n-2},…,\theta_{1},c)p(\theta_{n-1} \mid \theta_{n-2},\theta_{n-3},…,\theta_{1},c)…p(\theta_{3} \mid \theta_{1},\theta_{2},c)p(\theta_{2} \mid \theta_{1},c)p(\theta_{1}\mid c)p(c)$

$\ = p(\theta_{n} \mid c)p(\theta_{n-1} \mid c)…p(\theta_{3} \mid c)p(\theta_{2} \mid c)p(\theta_{1} \mid c)p(c)$

$\ = p(c){\displaystyle \prod_{i=1}^{n}p(\theta_{i} \mid c)}$

Gaussian Naive Bayes Classifier

확률이 gaussian distribution(정규분포)을 따르고 이에 대해 NB(naive bayes)방법으로 분류하는 것을 gaussian naive bayes classifier라고 부른다.

gaussian distribution은 좌우대칭의 종모양의 분포를 따르고 수식은 다음과 같다.

$p(x \mid \mu,\sigma^2) = \frac{1}{\sigma \sqrt {2\pi}}e^\frac{-(x - \mu)^2}{2\sigma^2}, X \sim \mathcal{N}(\mu,\sigma^{2})$

방법은 단순한 편이다.
Choice set $\mathcal{C} = \lbrace c_{1},c_{2},…,c_{k} \rbrace$가 있다고 한다면 각 $\theta_{i}$와 $c_{j}$ 대해 특정한 $\mu_{i,j}$ 와 $\sigma_{i,j}^2$를 가지는 정규분포를 따른다고 가정하고 각각의 확률을 정규분포 pdf(probability density function)로 치환해 주면된다.

$p(c_{j},\theta_{1},\theta_{2},…,\theta_{n}) = {\displaystyle \prod_{i=1}^{n} \frac{1}{\sigma_{i,j} \sqrt {2\pi}}e^\frac{-(x - \mu_{i,j})^2}{2\sigma_{i,j}^2} }$, where $\mu_{i,j}$ and $\sigma_{i,j}^2$ are given by $c_{j}$

따라서 각 $c_{i}$에 대한 단서로 구성된 feature들의 정규분포상 확률들을 제각기 곱한 결과가 제일 큰 쪽으로 분류가 진행된다.

$c^{\ast} = \text{arg}\max\limits_{c_{j} \in \mathcal{C}}\ p(c_{j}, \theta_{1},\theta_{2},…,\theta_{n}) $

혹시 포스팅을 진지하게 읽으신 분 중, 어째서 $ p(c \mid \theta_{1},\theta_{2},…,\theta_{n}) $대신 $p(c,\theta_{1},\theta_{2},…,\theta_{n})$로 계산했는지 의문점이 드시는 분을 위해 추가 설명을 하자면,

$ p(c_{j} \mid \theta_{1},\theta_{2},…,\theta_{n}) = \frac{p(c_{j}, \theta_{1},\theta_{2},…,\theta_{n})}{p(\theta_{1},\theta_{2},…,\theta_{n})}$ , where $p(\theta_{1},\theta_{2},…,\theta_{n})$ is common term for all $c_{j} \in \mathcal{C}$

따라서, $p(c_{j}, \theta_{1},\theta_{2},…,\theta_{n})$만 계산해 주는 것으로 어떤 선택이 제일 확률이 큰지 확인할 수 있다.

COMPACT SPACE

2019-08-15T00:00:00+00:00

COMPACT SPACE

Compact의 정의는 다음과 같다.

Definition
A space X is said to be compact if every open covering $\mathcal{A}$ of $X$ contains a finite subcollection that also covers.

다시 말해서 어떤 공간 $X$를 감싸는 open set들의 모임(open covering $\mathcal{A}$)이 있다면 그 중 임의로 유한개의 open set들을 뽑아서 다시 $X$를 감쌀 수 있을 때(또는 망라할 수 있을때) $X$는 compact이다.

Lemma
Let $Y$ be a subspace of $X$. Then $Y$ is compact if and only if every covering of $Y$ by sets open in $X$ contains a finite subcollection covering $Y$.

Proof
$(\iff)$
If $Y$ is compact,then there must exist a finite subcollection $\mathcal{A}^{\prime}$ of open covering $Y$,defined by $\lbrace A^{\prime}_ {i} \rbrace_{i \in I}$ ,where $I$ is a finite index set, and this set is equivalent to $ \lbrace A^{\prime}_ {i} = A_{i} \cap Y \mid A_ {i} \subseteq_{open} X \rbrace_{i \in I}$, where $\lbrace A_{i}\rbrace_{i \in I}$ is a finite subcollection of $X$.

위의 결론으로부터 $\lbrace A^{\prime}_ {i} \rbrace_{i \in I}$가 유한하다면 $\lbrace A_ {i} \rbrace_{i \in I}$도 유한함을 알 수 있다. 따라서 $Y$는 compact이며 $X$는 $Y$를 감싸는 유한한 subcollection을 가지고 있다.

Theorem
Every closed subspace of a compact space is compact.

Proof
Let $X$ be a topological space and $Y$ be a closed subspace of $X$,then there exists open covering $\mathcal{A}$ of $Y$ such that $\mathcal{A} \cup \lbrace X - Y \rbrace$ is open covering of $X$ which is compact. hence $\mathcal{A}$ is finite subcollection covering $Y$, and $Y$ is compact.

$\mathcal{A} \cup \lbrace X - Y \rbrace$는 compact인 $X$의 open covering이고 따라서 finite open covering일 수 있다. 여기서 open set인 $\lbrace X - Y \rbrace$를 제거 하더라도 supspace $Y$는 여전히 finite open covering $\mathcal{A}$로 감쌀 수 있으므로 compact이다.

Theorem
Every compact subspace of a Hausdorff space is closed.

Proof
Let $X$ be a topological space and $Y$ be a compact subspace of $X$. By hausdorff condition,there is $x_{0}$ in $X-Y$ such that neighborhood of $x_{0}$ called $U_{y_0}$ which is disjoint from a neighborhood $V_{y}$ containing $y$ in $Y$. Since Y is compact subspace we can choose finite open covering $\mathcal{V} = V_{y_{0}} \cup V_{y_{1}} \cup V_{y_{2}} … \cup V_{y_{n}}$ and $\mathcal{U} = U_{y_{0}} \cap U_{y_{1}} \cap U_{y_{1}} … \cap U_{y_{n}}$. But then for any $z$ in $\mathcal{U}$ is having open set $V_{y_{i}}$ which is disjoint from any $U_{y_{k}}$ by constrution. Hence it is closed.

증명에서는 주어진 조건에 의해,

모든 element들은 서로 중첩되지 않는 open set을 가질 수 있다.(haudorff)
$Y$는 compact이므로 임의의 finite open covering $\mathcal{V}$ 구성 할 수 있다.(compact)

$Y$의 바깥($X-Y$)에 존재하는 임의의 $x_{0}$의 open set$\ U_{y_{k}}$과 중첩하지 않는 $Y$ 안의 유한개의 $y_{i}$를 포함하는 open set $\ U_{y_{i}}$을 이용해 finite open covering $\mathcal{V}$를 구성 할 수 있다. 그리고 $\mathcal{U}$안의 어떤 원소도 $\mathcal{V}$와 충첩되는 open set을 가지고 있지 않으므로 closed임을 알 수 있다.

예를들어 $\mathcal{U}$는 항상 $x_{0}$를 포함하고 있고 $\mathcal{V}$안의 어떤 open set도 $x_{0}$를 포함하지 않으므로 $\mathcal{U}$는 $\mathcal{V}$와 disjoint 따라서 closed이다. 이러한 성질을 만족하는 공간을 regular space라고도 부르고 나중에 더 자세히 포스팅.

RIEMANN INTEGRAL

2019-08-15T00:00:00+00:00

This is a highlight test.

Normal block

alert('Hello World!');

print 'helloworld'

This post is used for testing tag plugins. See docs for more info.

Block Quote

Normal blockquote

Praesent diam elit, interdum ut pulvinar placerat, imperdiet at magna.

Code Block

Inline code block

This is a inline code block: python, print 'helloworld'.

Normal code block

alert('Hello World!');

print "Hello world"

Highlight code block

print "Hello world"

def foo
  puts 'foo'
end

1
2
3
def foo
  puts 'foo'
end

Gist

MATHEMATICAL_PROOF

2019-08-12T00:00:00+00:00

PROOF BY CONTRADICTION

Proof by contradiction은 조건을 부정한 뒤 나온 결과가 결과 모순관계임을 보여 이 참임을 증명하는 방법이다.

필자는 처음 본 증명방법을 접했을 때 직관적으로 이해하기 어려웠으나 조건논리에 대해 조금만 알면 이 방법이 어떻게 쓰이는지 통찰을 할 수 있다.

우선 if p, then q (또는 $p \rightarrow q$)라는 조건문의 Truth table을 그려보자.

$p$	$q$	$p \rightarrow q$
T	T	T
T	F	F
F	T	T
F	F	T

여기서 확인할 수 있는 것은 $p \rightarrow q$에서 거짓이 되는 조건은 $p$가 참이고 $q$가 거짓일 때 뿐이다.

그말은 다르게 표현해 $q$가 거짓이면 무조건 $p$는 참이어야만 조건문의 관계가 참이라는 말과 같다.

proof by contradiction에서는 조건의 부정이 참임을 가정하고 결과가 모순임을 보여
$\neg p \rightarrow q(c \land \neg c)$ 가 거짓 따라서 $p$가 참임을 보여주는 테크닉이라고 할 수 있다.
(여기서 $c$는 원래 나와야하는 결과 $\neg c$는 조건을 부정하여 나온 결과이고 $c$와 $\neg c$는 서로 모순관계이므로 양립할 수 없어 거짓이다.)

$\neg p \rightarrow q(c \land \neg c)$ 에서 $(c \land ~c)$가 거짓이므로 $\neg p$가 거짓이어야 한다. 따라서, $\neg \neg p = p$는 참이다.

PROOF BY INDUCTION

본 증명원리는 .

Suppose $P(1)$ is, and $P(k+1)$ is true whenever $P(k)$ is true.

다시 말해서 Base case $n=1$일 때의 어떤 proposition이 참인 것을 확인하고 $n=k$일 때 참이라는 것을 가정 했을때 $n=k+1$이 참이라는 것을 확인하는 방법이다.

proof by induction은 well-ordering principle에 기반한다.

Every non-empty set of positive integers(natural number) contains a least element.

증명
Assume that there does not exist such a set called $A$. Since $A$ does not contains a smallest element, $B(= \mathbb{N} \setminus A)$ is a set containing element $k$ which is smaller than all elements in $A$, then $k+1$ is the smallest element in $A$. contradiction.

well-odering principle이 어떻게 위의 증명방법에 타당성을 주는지 확인하도록 하자.

증명
Assume that $A$ be a non-empty set defined by $\lbrace k$ | $P(k)$ is false $\rbrace$. By well-ordering principle, $k’$ is the smallest element in A, That is, $P(k’-1)$ is true as $k’-1$ is not in $A$. But then $k’-1$ is smaller than $k’$ which contradicts well-ordering principle.

여기서 $k’$는 1이 될 수 없다. 왜냐하면 $P(1)$ is true 이기 떄문에 $A$에 속하지 않는다. 따라서, $k’>1$이다.
증명에서 밝히고자 하는 핵심은 $n=k’$가 거짓일때 $k’-1$가 참인 케이스는 존재하지 않는다는 것이다. 다르게 말해서, $k’-1$가 참이면 $k’$도 참이다. ($A$를 충족시키는 set은 empty이다.)