'분류 전체보기'에 해당되는 글 1620건

2023.08.29 D2L - 15.4. Pretraining word2vec
2023.08.29 D2L - 15.3. The Dataset for Pretraining Word Embeddings
2023.08.28 D2L- 15.2. Approximate Training
2023.08.25 D2L- 15.1. Word Embedding (word2vec)
2023.08.24 D2L - 15. Natural Language Processing: Pretraining
2023.08.21 D2L - 14.12. Neural Style Transfer 1
2023.08.21 D2L - 14.11. Fully Convolutional Networks
2023.08.21 D2L - 14.10. Transposed Convolution 1
2023.08.20 D2L - 14.9. Semantic Segmentation and the Dataset
2023.08.20 D2L - 14.8. Region-based CNNs (R-CNNs)

Dive into Deep Learning/D2L Natural language Processing

D2L - 15.4. Pretraining word2vec

2023. 8. 29. 12:23 | Posted by 솔웅

15.4. Pretraining word2vec — Dive into Deep Learning 1.0.3 documentation (d2l.ai)

15.4. Pretraining word2vec — Dive into Deep Learning 1.0.3 documentation

d2l.ai

15.4. Pretraining word2vec

We go on to implement the skip-gram model defined in Section 15.1. Then we will pretrain word2vec using negative sampling on the PTB dataset. First of all, let’s obtain the data iterator and the vocabulary for this dataset by calling the d2l.load_data_ptb function, which was described in Section 15.3

우리는 계속해서 섹션 15.1에 정의된 스킵 그램 모델을 구현합니다. 그런 다음 PTB 데이터세트에 대해 네거티브 샘플링을 사용하여 word2vec을 사전 학습합니다. 먼저 15.3절에서 설명한 d2l.load_data_ptb 함수를 호출하여 이 데이터셋에 대한 데이터 반복자와 어휘를 구해보겠습니다.

import math
import torch
from torch import nn
from d2l import torch as d2l

batch_size, max_window_size, num_noise_words = 512, 5, 5
data_iter, vocab = d2l.load_data_ptb(batch_size, max_window_size,
                                     num_noise_words)

이 코드는 스킵-그램 모델 학습에 필요한 라이브러리 및 데이터를 불러오고 설정하는 과정을 보여주고 있습니다.

import math, import torch, from torch import nn, from d2l import torch as d2l:
- math 모듈과 torch 라이브러리의 nn 모듈을 임포트합니다. 또한 d2l 패키지에서 torch 모듈을 가져와 별칭을 d2l로 설정합니다.
batch_size, max_window_size, num_noise_words = 512, 5, 5:
- 미니배치 크기(batch_size), 최대 윈도우 크기(max_window_size), 부정적 샘플링 개수(num_noise_words)를 각각 512, 5, 5로 설정합니다.
data_iter, vocab = d2l.load_data_ptb(batch_size, max_window_size, num_noise_words):
- 앞서 설정한 파라미터를 이용하여 d2l.load_data_ptb 함수를 호출하여 PTB 데이터셋을 미니배치 형태로 로드합니다. data_iter는 데이터 로더를 나타내며, vocab은 단어 사전을 나타냅니다.

이 코드는 스킵-그램 모델 학습에 필요한 데이터를 불러오고 설정하는 과정을 보여주고 있습니다.

15.4.1. The Skip-Gram Model

We implement the skip-gram model by using embedding layers and batch matrix multiplications. First, let’s review how embedding layers work.

임베딩 레이어와 배치 행렬 곱셈을 사용하여 스킵 그램 모델을 구현합니다. 먼저 임베딩 레이어의 작동 방식을 살펴보겠습니다.

15.4.1.1. Embedding Layer

As described in Section 10.7, an embedding layer maps a token’s index to its feature vector. The weight of this layer is a matrix whose number of rows equals to the dictionary size (input_dim) and number of columns equals to the vector dimension for each token (output_dim). After a word embedding model is trained, this weight is what we need.

섹션 10.7에 설명된 대로 임베딩 레이어는 토큰의 인덱스를 해당 특징 벡터에 매핑합니다. 이 레이어의 가중치는 행 수가 사전 크기(input_dim)와 같고 열 수가 각 토큰의 벡터 차원(output_dim)과 동일한 행렬입니다. 단어 임베딩 모델이 훈련된 후에는 이 가중치가 우리에게 필요한 것입니다.

embed = nn.Embedding(num_embeddings=20, embedding_dim=4)
print(f'Parameter embedding_weight ({embed.weight.shape}, '
      f'dtype={embed.weight.dtype})')

이 코드는 임베딩 층을 생성하고 해당 임베딩 층의 가중치(weight)의 형태와 데이터 타입을 출력하는 과정을 보여주고 있습니다.

embed = nn.Embedding(num_embeddings=20, embedding_dim=4):
- nn.Embedding 클래스를 이용하여 임베딩 층(embed)을 생성합니다. num_embeddings는 임베딩할 단어의 개수, embedding_dim은 임베딩된 벡터의 차원을 나타냅니다. 이 코드에서는 20개의 단어를 4차원 벡터로 임베딩하는 임베딩 층을 생성합니다.
print(f'Parameter embedding_weight ({embed.weight.shape}, dtype={embed.weight.dtype})'):
- 생성한 임베딩 층의 가중치(weight)의 형태와 데이터 타입을 출력합니다. embed.weight는 임베딩 층의 가중치를 나타내며, shape를 통해 가중치의 크기, dtype를 통해 데이터 타입을 확인할 수 있습니다. 이 정보는 임베딩 층의 설정과 가중치를 확인하기 위한 용도로 사용됩니다.

이 코드는 임베딩 층을 생성하고 해당 층의 가중치의 형태와 데이터 타입을 출력하는 과정을 보여주고 있습니다.

The input of an embedding layer is the index of a token (word). For any token index i, its vector representation can be obtained from the ith row of the weight matrix in the embedding layer. Since the vector dimension (output_dim) was set to 4, the embedding layer returns vectors with shape (2, 3, 4) for a minibatch of token indices with shape (2, 3).

임베딩 레이어의 입력은 토큰(단어)의 인덱스입니다. 토큰 인덱스 i의 경우 임베딩 레이어에 있는 가중치 행렬의 i번째 행에서 벡터 표현을 얻을 수 있습니다. 벡터 차원(output_dim)이 4로 설정되었으므로 임베딩 레이어는 모양이 (2, 3)인 토큰 인덱스의 미니 배치에 대해 모양이 (2, 3, 4)인 벡터를 반환합니다.

x = torch.tensor([[1, 2, 3], [4, 5, 6]])
embed(x)

이 코드는 생성한 임베딩 층에 입력 데이터를 넣어 임베딩된 결과를 계산하는 과정을 보여주고 있습니다.

x = torch.tensor([[1, 2, 3], [4, 5, 6]]):
- 입력 데이터인 2개의 시퀀스를 텐서 형태로 정의합니다. 각 시퀀스는 길이가 3인 정수 시퀀스입니다.
embed(x):
- 앞서 생성한 임베딩 층 embed에 입력 데이터 x를 넣어서 임베딩된 결과를 계산합니다. 입력 시퀀스에 포함된 각 정수는 해당 정수에 대응하는 임베딩 벡터로 변환됩니다.

임베딩 층을 통해 정수 시퀀스를 임베딩된 벡터로 변환하는 과정을 나타내고 있습니다

15.4.1.2. Defining the Forward Propagation

In the forward propagation, the input of the skip-gram model includes the center word indices center of shape (batch size, 1) and the concatenated context and noise word indices contexts_and_negatives of shape (batch size, max_len), where max_len is defined in Section 15.3.5. These two variables are first transformed from the token indices into vectors via the embedding layer, then their batch matrix multiplication (described in Section 11.3.2.2) returns an output of shape (batch size, 1, max_len). Each element in the output is the dot product of a center word vector and a context or noise word vector.

순방향 전파에서 스킵 그램 모델의 입력에는 center word indices center of shape(배치 크기, 1)과 연결된 컨텍스트 및 노이즈 단어 인덱스 contexts_and_negatives shape(배치 크기, max_len)이 포함됩니다. 여기서 max_len은 다음에서 정의됩니다. 섹션 15.3.5. 이 두 변수는 먼저 임베딩 레이어를 통해 토큰 인덱스에서 벡터로 변환된 다음 배치 행렬 곱셈(섹션 11.3.2.2에 설명됨)이 모양(배치 크기, 1, max_len)의 출력을 반환합니다. 출력의 각 요소는 중심 단어 벡터와 문맥 또는 노이즈 단어 벡터의 내적입니다.

def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
    v = embed_v(center)
    u = embed_u(contexts_and_negatives)
    pred = torch.bmm(v, u.permute(0, 2, 1))
    return pred

이 코드는 스킵-그램 모델의 예측값을 계산하는 함수를 정의하고 있습니다.

def skip_gram(center, contexts_and_negatives, embed_v, embed_u)::
- skip_gram 함수를 정의합니다. 이 함수는 중심 단어, 문맥 및 부정적 단어 조합, 그리고 중심 단어를 임베딩하는 embed_v와 문맥 및 부정적 단어들을 임베딩하는 embed_u를 인자로 받습니다.
v = embed_v(center):
- 주어진 중심 단어(center)를 임베딩 벡터로 변환합니다. embed_v를 이용하여 중심 단어를 임베딩합니다.
u = embed_u(contexts_and_negatives):
- 주어진 문맥 및 부정적 단어 조합(contexts_and_negatives)을 임베딩 벡터로 변환합니다. embed_u를 이용하여 문맥 및 부정적 단어들을 임베딩합니다.
pred = torch.bmm(v, u.permute(0, 2, 1)):
- 중심 단어 임베딩 벡터 v와 문맥 및 부정적 단어 임베딩 벡터 u 간의 행렬 곱을 계산하여 예측값(pred)을 생성합니다. 행렬 곱은 torch.bmm 함수를 이용하며, 중심 단어의 임베딩 벡터와 각 문맥 및 부정적 단어의 임베딩 벡터 간의 유사도를 나타냅니다.
return pred:
- 계산한 예측값을 반환합니다.

이 코드는 스킵-그램 모델의 예측값을 계산하는 함수를 정의하고 있습니다.

Let’s print the output shape of this skip_gram function for some example inputs.

몇 가지 예시 입력에 대해 이 Skip_gram 함수의 출력 형태를 인쇄해 보겠습니다.

skip_gram(torch.ones((2, 1), dtype=torch.long),
          torch.ones((2, 4), dtype=torch.long), embed, embed).shape

이 코드는 skip_gram 함수를 사용하여 스킵-그램 모델의 예측값을 계산하고, 계산된 예측값의 형태(shape)를 출력하는 과정을 보여주고 있습니다.

skip_gram(torch.ones((2, 1), dtype=torch.long), torch.ones((2, 4), dtype=torch.long), embed, embed):
- skip_gram 함수를 호출하여 중심 단어와 문맥 단어의 부정적 샘플들을 이용하여 예측값을 계산합니다. 여기서는 임의의 예시로 중심 단어를 1로, 문맥 및 부정적 단어를 모두 1로 설정하여 호출하였습니다. 이 때, embed 함수를 사용하여 단어들을 임베딩 벡터로 변환합니다.
.shape:
- 계산된 예측값의 형태(shape)를 확인하는 명령입니다.

이 코드는 skip_gram 함수를 호출하여 스킵-그램 모델의 예측값을 계산하고, 계산된 예측값의 형태(shape)를 출력하는 과정을 나타내고 있습니다

15.4.2. Training

Before training the skip-gram model with negative sampling, let’s first define its loss function.

네거티브 샘플링으로 스킵그램 모델을 훈련하기 전에 먼저 손실 함수를 정의해 보겠습니다.

15.4.2.1. Binary Cross-Entropy Loss

According to the definition of the loss function for negative sampling in Section 15.2.1, we will use the binary cross-entropy loss.

15.2.1절의 네거티브 샘플링에 대한 손실 함수 정의에 따라 이진 교차 엔트로피 손실을 사용합니다.

class SigmoidBCELoss(nn.Module):
    # Binary cross-entropy loss with masking
    def __init__(self):
        super().__init__()

    def forward(self, inputs, target, mask=None):
        out = nn.functional.binary_cross_entropy_with_logits(
            inputs, target, weight=mask, reduction="none")
        return out.mean(dim=1)

loss = SigmoidBCELoss()

이 코드는 마스킹을 적용한 이진 크로스 엔트로피 손실 함수를 정의하고 그 함수를 사용하는 과정을 보여주고 있습니다.

class SigmoidBCELoss(nn.Module)::
- SigmoidBCELoss 클래스를 정의합니다. 이 클래스는 PyTorch의 nn.Module을 상속하여 정의되었습니다.
def __init__(self)::
- SigmoidBCELoss 클래스의 초기화 메서드입니다. 별다른 초기화 작업은 없습니다.
def forward(self, inputs, target, mask=None)::
- forward 메서드는 손실의 계산을 수행합니다. inputs는 모델의 출력, target은 목표값을 나타내며, mask는 선택적으로 적용되는 마스크를 의미합니다.
out = nn.functional.binary_cross_entropy_with_logits(inputs, target, weight=mask, reduction="none"):
- nn.functional.binary_cross_entropy_with_logits 함수를 사용하여 이진 크로스 엔트로피 손실을 계산합니다. inputs는 모델의 출력, target은 목표값을 나타내며, mask는 선택적으로 적용되는 마스크를 의미합니다. reduction="none"으로 설정하여 요소별 손실 값을 계산합니다.
return out.mean(dim=1):
- 계산한 손실 값을 각 샘플에 대해 평균내어 반환합니다. dim=1은 각 샘플에 대한 평균을 구하는 축을 나타냅니다.
loss = SigmoidBCELoss():
- 정의한 SigmoidBCELoss 클래스의 인스턴스를 생성하여 loss 변수에 할당합니다.

이 코드는 마스킹을 적용한 이진 크로스 엔트로피 손실 함수를 정의하고 그 함수를 사용하는 과정을 나타내고 있습니다.

Recall our descriptions of the mask variable and the label variable in Section 15.3.5. The following calculates the binary cross-entropy loss for the given variables.

섹션 15.3.5의 마스크 변수와 라벨 변수에 대한 설명을 떠올려보세요. 다음은 주어진 변수에 대한 이진 교차 엔트로피 손실을 계산합니다.

pred = torch.tensor([[1.1, -2.2, 3.3, -4.4]] * 2)
label = torch.tensor([[1.0, 0.0, 0.0, 0.0], [0.0, 1.0, 0.0, 0.0]])
mask = torch.tensor([[1, 1, 1, 1], [1, 1, 0, 0]])
loss(pred, label, mask) * mask.shape[1] / mask.sum(axis=1)

Below shows how the above results are calculated (in a less efficient way) using the sigmoid activation function in the binary cross-entropy loss. We can consider the two outputs as two normalized losses that are averaged over non-masked predictions.

아래에서는 이진 교차 엔트로피 손실에서 시그모이드 활성화 함수를 사용하여 위 결과를 (덜 효율적인 방식으로) 계산하는 방법을 보여줍니다. 두 개의 출력을 마스크되지 않은 예측에 대해 평균을 낸 두 개의 정규화된 손실로 간주할 수 있습니다.

def sigmd(x):
    return -math.log(1 / (1 + math.exp(-x)))

print(f'{(sigmd(1.1) + sigmd(2.2) + sigmd(-3.3) + sigmd(4.4)) / 4:.4f}')
print(f'{(sigmd(-1.1) + sigmd(-2.2)) / 2:.4f}')

이 코드는 시그모이드 함수를 정의하고, 일부 값들에 대해 시그모이드 함수를 적용한 결과를 출력하는 과정을 보여주고 있습니다.

def sigmd(x)::
- sigmd 함수를 정의합니다. 이 함수는 시그모이드 함수를 구현한 것으로, 주어진 x에 대해 -math.log(1 / (1 + math.exp(-x))) 값을 반환합니다.
print(f'{(sigmd(1.1) + sigmd(2.2) + sigmd(-3.3) + sigmd(4.4)) / 4:.4f}'):
- 시그모이드 함수를 각각 1.1, 2.2, -3.3, 4.4에 대해 적용한 결과의 평균을 계산하여 소수점 4자리까지 출력합니다.
print(f'{(sigmd(-1.1) + sigmd(-2.2)) / 2:.4f}'):
- 시그모이드 함수를 각각 -1.1, -2.2에 대해 적용한 결과의 평균을 계산하여 소수점 4자리까지 출력합니다.

이 코드는 시그모이드 함수를 정의하고, 몇 가지 값들에 대해 이 함수를 적용하여 결과를 출력하는 과정을 나타내고 있습니다

Binary Cross-Entropy Loass란?

Binary Cross-Entropy Loss, often abbreviated as BCE Loss, is a commonly used loss function in machine learning and deep learning, particularly for binary classification tasks. It is used to measure the dissimilarity between the predicted probabilities and the true binary labels of a classification problem.

이진 교차 엔트로피 손실(Binary Cross-Entropy Loss), 줄여서 BCE 손실은 주로 이진 분류 작업에서 사용되는 흔히 쓰이는 손실 함수입니다. 이 함수는 예측된 확률과 실제 이진 레이블 간의 불일치를 측정하는 데 사용됩니다.

In a binary classification problem, each instance belongs to one of two classes, typically denoted as the positive class (1) and the negative class (0). The BCE Loss quantifies the difference between the predicted probabilities of belonging to the positive class and the actual binary labels. It's important to note that BCE Loss is specifically designed for binary classification and not suitable for multi-class classification problems.

이진 분류 작업에서 각 인스턴스는 일반적으로 긍정 클래스(1)와 부정 클래스(0) 중 하나에 속합니다. BCE 손실은 긍정 클래스에 속할 확률의 예측된 값과 실제 이진 레이블 간의 차이를 측정합니다. BCE 손실은 이진 분류에 특화된 것으로, 다중 클래스 분류 작업에는 적합하지 않습니다.

Mathematically, the BCE Loss for a single instance can be defined as follows:

수학적으로 하나의 인스턴스에 대한 BCE 손실은 다음과 같이 정의됩니다:

Where:

is the Binary Cross-Entropy Loss.
은 이진 교차 엔트로피 손실입니다.
is the true binary label (0 or 1) for the instance.
는 해당 인스턴스의 실제 이진 레이블(0 또는 1)입니다.
is the predicted probability of the positive class (i.e., the output of the model's sigmoid activation function).
는 긍정 클래스에 속할 확률의 예측된 값(즉, 모델의 시그모이드 활성화 함수의 출력)입니다.

The loss function is logarithmic in nature, and it penalizes the model more when the predicted probability () deviates from the true label (). The loss is symmetric in the sense that it treats errors of predicting the positive class and predicting the negative class equally.

이 손실 함수는 로그 형태를 띄며, 예측된 확률()이 실제 레이블()에서 얼마나 벗어나는지에 따라 모델을 처벌합니다. 이 손실은 긍정 클래스를 예측하거나 부정 클래스를 예측하는 오류를 동등하게 다루기 때문에 대칭적인 손실입니다.

During training, the goal is to minimize the BCE Loss across all instances in the training dataset. This is typically achieved using optimization algorithms like gradient descent or its variants.

훈련 중에 목표는 훈련 데이터 집합의 모든 인스턴스에 대해 BCE 손실을 최소화하는 것입니다. 이는 일반적으로 경사 하강법 또는 그 변형을 사용하여 달성됩니다.

In summary, Binary Cross-Entropy Loss is a widely used loss function for binary classification problems. It quantifies the difference between predicted probabilities and true binary labels, encouraging the model to improve its predictions and classify instances accurately.

요약하면, 이진 교차 엔트로피 손실은 이진 분류 작업에서 널리 사용되는 손실 함수입니다. 이는 예측된 확률과 실제 이진 레이블 간의 차이를 측정하여 모델이 예측을 향상시키고 인스턴스를 정확하게 분류하도록 유도합니다.

15.4.2.2. Initializing Model Parameters

We define two embedding layers for all the words in the vocabulary when they are used as center words and context words, respectively. The word vector dimension embed_size is set to 100.

우리는 어휘의 모든 단어가 각각 중심 단어와 문맥 단어로 사용될 때 두 개의 임베딩 레이어를 정의합니다. 단어 벡터 차원 embed_size는 100으로 설정됩니다.

embed_size = 100
net = nn.Sequential(nn.Embedding(num_embeddings=len(vocab),
                                 embedding_dim=embed_size),
                    nn.Embedding(num_embeddings=len(vocab),
                                 embedding_dim=embed_size))

이 코드는 임베딩 층을 포함하는 신경망 모델을 생성하는 과정을 나타내고 있습니다.

embed_size = 100:
- 임베딩 벡터의 차원 크기를 100으로 설정합니다.
net = nn.Sequential(...):
- nn.Sequential을 사용하여 순차적으로 레이어를 쌓는 모델을 정의합니다.
nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_size):
- 첫 번째 임베딩 층을 생성합니다. num_embeddings은 단어 사전(vocab)의 크기, embedding_dim은 임베딩 벡터의 차원 크기를 나타냅니다.
nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_size):
- 두 번째 임베딩 층을 생성합니다. 이 층도 첫 번째 임베딩 층과 동일한 설정을 가집니다.

위 코드는 두 개의 임베딩 층을 갖는 신경망 모델을 생성하는 과정을 보여줍니다. 이 모델은 단어를 임베딩 벡터로 변환하는 두 개의 임베딩 층을 포함하고 있습니다.

15.4.2.3. Defining the Training Loop

The training loop is defined below. Because of the existence of padding, the calculation of the loss function is slightly different compared to the previous training functions.

훈련 루프는 아래에 정의되어 있습니다. 패딩이 존재하기 때문에 손실 함수 계산은 이전 훈련 함수와 약간 다릅니다.

def train(net, data_iter, lr, num_epochs, device=d2l.try_gpu()):
    def init_weights(module):
        if type(module) == nn.Embedding:
            nn.init.xavier_uniform_(module.weight)
    net.apply(init_weights)
    net = net.to(device)
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                            xlim=[1, num_epochs])
    # Sum of normalized losses, no. of normalized losses
    metric = d2l.Accumulator(2)
    for epoch in range(num_epochs):
        timer, num_batches = d2l.Timer(), len(data_iter)
        for i, batch in enumerate(data_iter):
            optimizer.zero_grad()
            center, context_negative, mask, label = [
                data.to(device) for data in batch]

            pred = skip_gram(center, context_negative, net[0], net[1])
            l = (loss(pred.reshape(label.shape).float(), label.float(), mask)
                     / mask.sum(axis=1) * mask.shape[1])
            l.sum().backward()
            optimizer.step()
            metric.add(l.sum(), l.numel())
            if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
                animator.add(epoch + (i + 1) / num_batches,
                             (metric[0] / metric[1],))
    print(f'loss {metric[0] / metric[1]:.3f}, '
          f'{metric[1] / timer.stop():.1f} tokens/sec on {str(device)}')

이 코드는 스킵-그램 모델을 학습하는 함수를 정의하고 있습니다.

def train(net, data_iter, lr, num_epochs, device=d2l.try_gpu())::
- train 함수를 정의합니다. 이 함수는 스킵-그램 모델을 학습하는데 필요한 여러 설정값들과 데이터를 받습니다.
def init_weights(module)::
- 초기화 함수 init_weights를 정의합니다. 이 함수는 네트워크 모델의 가중치 초기화를 수행합니다.
net.apply(init_weights):
- 네트워크 모델의 가중치를 초기화합니다.
net = net.to(device):
- 네트워크 모델을 지정한 디바이스로 이동합니다.
optimizer = torch.optim.Adam(net.parameters(), lr=lr):
- Adam 옵티마이저를 생성합니다.
animator = d2l.Animator(xlabel='epoch', ylabel='loss', xlim=[1, num_epochs]):
- 학습 과정을 애니메이션으로 표시하기 위한 Animator 객체를 생성합니다.
metric = d2l.Accumulator(2):
- 손실값을 누적하기 위한 Accumulator 객체를 생성합니다.
중첩된 반복문:
- 주어진 에폭 수 만큼 반복하면서 학습을 수행합니다. 내부 반복문은 데이터 배치마다 반복하며 학습을 진행합니다.
optimizer.zero_grad():
- 옵티마이저의 기울기를 초기화합니다.
데이터 전처리:
- 배치 내 데이터들을 지정한 디바이스로 이동합니다.
pred = skip_gram(center, context_negative, net[0], net[1]):
- skip_gram 함수를 사용하여 스킵-그램 모델의 예측값을 계산합니다.
손실 계산:
- 계산된 예측값과 실제 레이블을 이용하여 손실을 계산합니다.
역전파 및 가중치 업데이트:
- 손실을 이용하여 역전파를 수행하고 가중치를 업데이트합니다.
매 에폭 종료 후:
- 애니메이션에 손실값을 추가하여 학습 과정을 시각화합니다.
학습 완료 후:
- 학습이 완료된 후에는 최종 손실값과 학습 속도를 출력합니다.

이 코드는 스킵-그램 모델을 학습하는 함수를 정의하고 있습니다.

Now we can train a skip-gram model using negative sampling.

이제 네거티브 샘플링을 사용하여 스킵그램 모델을 훈련할 수 있습니다.

lr, num_epochs = 0.002, 5
train(net, data_iter, lr, num_epochs)

이 코드는 미리 정의된 train 함수를 사용하여 스킵-그램 모델을 학습하는 과정을 나타내고 있습니다.

lr, num_epochs = 0.002, 5:
- 학습률(lr)을 0.002로, 에폭 수(num_epochs)를 5로 설정합니다.
train(net, data_iter, lr, num_epochs):
- 정의된 train 함수를 호출하여 스킵-그램 모델을 학습합니다. net은 학습할 모델, data_iter는 학습 데이터를 제공하는 데이터 반복자, lr은 학습률, num_epochs은 학습 에폭 수를 의미합니다.

즉, 이 코드는 미리 정의된 학습 함수 train을 사용하여 주어진 학습 데이터와 설정값으로 스킵-그램 모델을 학습하는 과정을 나타내고 있습니다.

loss 0.410, 223485.0 tokens/sec on cuda:0

15.4.3. Applying Word Embeddings

After training the word2vec model, we can use the cosine similarity of word vectors from the trained model to find words from the dictionary that are most semantically similar to an input word.

word2vec 모델을 훈련한 후 훈련된 모델의 단어 벡터의 코사인 유사성을 사용하여 입력 단어와 의미상 가장 유사한 사전의 단어를 찾을 수 있습니다.

def get_similar_tokens(query_token, k, embed):
    W = embed.weight.data
    x = W[vocab[query_token]]
    # Compute the cosine similarity. Add 1e-9 for numerical stability
    cos = torch.mv(W, x) / torch.sqrt(torch.sum(W * W, dim=1) *
                                      torch.sum(x * x) + 1e-9)
    topk = torch.topk(cos, k=k+1)[1].cpu().numpy().astype('int32')
    for i in topk[1:]:  # Remove the input words
        print(f'cosine sim={float(cos[i]):.3f}: {vocab.to_tokens(i)}')

get_similar_tokens('chip', 3, net[0])

이 코드는 특정 단어에 대해 유사한 단어를 찾는 함수 get_similar_tokens을 정의하고, 이 함수를 사용하여 주어진 단어와 유사한 단어를 출력하는 과정을 나타내고 있습니다.

def get_similar_tokens(query_token, k, embed)::
- get_similar_tokens 함수를 정의합니다. 이 함수는 주어진 단어와 유사한 단어를 찾아 출력합니다.
W = embed.weight.data:
- 임베딩 층의 가중치 정보를 가져옵니다.
x = W[vocab[query_token]]:
- 주어진 단어의 임베딩 벡터 x를 가져옵니다.
cos = torch.mv(W, x) / torch.sqrt(torch.sum(W * W, dim=1) * torch.sum(x * x) + 1e-9):
- 코사인 유사도를 계산합니다. 임베딩 벡터들 간의 코사인 유사도를 계산하며, 수치적 안정성을 위해 1e-9를 더해줍니다.
topk = torch.topk(cos, k=k+1)[1].cpu().numpy().astype('int32'):
- 코사인 유사도에서 가장 높은 상위 k+1개의 값을 가져옵니다. topk에는 상위 값의 인덱스가 저장되어 있습니다.
for i in topk[1:]::
- 주어진 단어를 제외한 상위 유사한 단어들을 순회하면서 출력합니다.
print(f'cosine sim={float(cos[i]):.3f}: {vocab.to_tokens(i)}'):
- 각 유사한 단어의 코사인 유사도와 해당 단어를 출력합니다.
get_similar_tokens('chip', 3, net[0]):
- 'chip' 단어와 유사한 상위 3개의 단어를 찾아 출력합니다.

이 코드는 특정 단어와 유사한 단어를 찾아 출력하는 함수를 정의하고, 이 함수를 사용하여 'chip' 단어와 유사한 단어를 출력하는 과정을 보여주고 있습니다.

cosine sim=0.702: microprocessor
cosine sim=0.649: mips
cosine sim=0.643: intel

15.4.4. Summary

We can train a skip-gram model with negative sampling using embedding layers and the binary cross-entropy loss.
임베딩 레이어와 이진 교차 엔트로피 손실을 사용하여 네거티브 샘플링으로 스킵 그램 모델을 훈련할 수 있습니다.
Applications of word embeddings include finding semantically similar words for a given word based on the cosine similarity of word vectors.
단어 임베딩의 적용에는 단어 벡터의 코사인 유사성을 기반으로 특정 단어에 대해 의미상 유사한 단어를 찾는 것이 포함됩니다.

15.4.5. Exercises

Using the trained model, find semantically similar words for other input words. Can you improve the results by tuning hyperparameters?
When a training corpus is huge, we often sample context words and noise words for the center words in the current minibatch when updating model parameters. In other words, the same center word may have different context words or noise words in different training epochs. What are the benefits of this method? Try to implement this training method.

'Dive into Deep Learning > D2L Natural language Processing' 카테고리의 다른 글

D2L - 15.10. Pretraining BERT (0)	2023.08.30
D2L - 15.9. The Dataset for Pretraining BERT (0)	2023.08.30
D2L - 15.8. Bidirectional Encoder Representations from Transformers (BERT) (0)	2023.08.30
D2L - 15.7. Word Similarity and Analogy (0)	2023.08.30
D2L - 15.6. Subword Embedding (0)	2023.08.30
D2L - 15.5. Word Embedding with Global Vectors (GloVe) (0)	2023.08.29
D2L - 15.3. The Dataset for Pretraining Word Embeddings (0)	2023.08.29
D2L- 15.2. Approximate Training (0)	2023.08.28
D2L- 15.1. Word Embedding (word2vec) (0)	2023.08.25
D2L - 15. Natural Language Processing: Pretraining (0)	2023.08.24

Dive into Deep Learning/D2L Natural language Processing

D2L - 15.3. The Dataset for Pretraining Word Embeddings

2023. 8. 29. 11:58 | Posted by 솔웅

15.3. The Dataset for Pretraining Word Embeddings — Dive into Deep Learning 1.0.3 documentation (d2l.ai)

15.3. The Dataset for Pretraining Word Embeddings — Dive into Deep Learning 1.0.3 documentation

d2l.ai

15.3. The Dataset for Pretraining Word Embeddings

Now that we know the technical details of the word2vec models and approximate training methods, let’s walk through their implementations. Specifically, we will take the skip-gram model in Section 15.1 and negative sampling in Section 15.2 as an example. In this section, we begin with the dataset for pretraining the word embedding model: the original format of the data will be transformed into minibatches that can be iterated over during training.

이제 word2vec 모델의 기술적 세부 사항과 대략적인 훈련 방법을 알았으므로 구현을 살펴보겠습니다. 구체적으로, 섹션 15.1의 스킵 그램 모델과 섹션 15.2의 네거티브 샘플링을 예로 들어보겠습니다. 이 섹션에서는 단어 임베딩 모델을 사전 훈련하기 위한 데이터 세트부터 시작합니다. 데이터의 원래 형식은 훈련 중에 반복할 수 있는 미니 배치로 변환됩니다.

import collections
import math
import os
import random
import torch
from d2l import torch as d2l

이 코드는 PyTorch와 d2l(Dive into Deep Learning) 라이브러리의 일부 함수들을 사용하여 딥 러닝 모델을 구현하기 위한 환경을 설정하는 부분입니다.

import collections:
- collections 모듈을 가져옵니다. 이 모듈은 파이썬에서 컨테이너 데이터 타입을 보다 쉽게 다룰 수 있도록 도와주는 클래스들을 제공합니다.
import math:
- math 모듈을 가져옵니다. 이 모듈은 수학적인 연산을 수행하는 함수들을 제공합니다.
import os:
- os 모듈을 가져옵니다. 이 모듈은 운영 체제와 관련된 기능을 제공하여 파일 경로, 디렉토리 생성 등을 다루는 데 사용됩니다.
import random:
- random 모듈을 가져옵니다. 이 모듈은 난수 생성 및 관련된 함수를 제공합니다.
import torch:
- PyTorch 라이브러리를 가져옵니다. PyTorch는 딥 러닝 모델을 구현하고 훈련하는 데에 사용되는 강력한 라이브러리입니다.
from d2l import torch as d2l:
- d2l 라이브러리에서 PyTorch와 관련된 함수들을 가져와 d2l이라는 이름으로 사용하겠다는 의미입니다. 이 라이브러리는 "Dive into Deep Learning" 책의 코드와 예제를 포함하고 있는 라이브러리로, 딥 러닝 학습을 돕기 위해 만들어진 것입니다.

이 코드는 여러 모듈과 라이브러리를 가져와서 딥 러닝 모델을 구현하고 실행하기 위한 기반을 설정하는 것입니다.

15.3.1. Reading the Dataset

The dataset that we use here is Penn Tree Bank (PTB). This corpus is sampled from Wall Street Journal articles, split into training, validation, and test sets. In the original format, each line of the text file represents a sentence of words that are separated by spaces. Here we treat each word as a token.

여기서 사용하는 데이터세트는 PTB(Penn Tree Bank)입니다. 이 자료는 Wall Street Journal 기사에서 샘플링되었으며 훈련, 검증 및 테스트 세트로 구분됩니다. 원본 형식에서 텍스트 파일의 각 줄은 공백으로 구분된 단어 문장을 나타냅니다. 여기서는 각 단어를 토큰으로 처리합니다.

#@save
d2l.DATA_HUB['ptb'] = (d2l.DATA_URL + 'ptb.zip',
                       '319d85e578af0cdc590547f26231e4e31cdf1e42')

#@save
def read_ptb():
    """Load the PTB dataset into a list of text lines."""
    data_dir = d2l.download_extract('ptb')
    # Read the training set
    with open(os.path.join(data_dir, 'ptb.train.txt')) as f:
        raw_text = f.read()
    return [line.split() for line in raw_text.split('\n')]

sentences = read_ptb()
f'# sentences: {len(sentences)}'

이 코드는 PTB 데이터셋을 로드하고 텍스트로 처리하는 부분을 보여주고 있습니다.

d2l.DATA_HUB['ptb'] = (d2l.DATA_URL + 'ptb.zip', '319d85e578af0cdc590547f26231e4e31cdf1e42'):
- d2l.DATA_HUB 딕셔너리에 'ptb'라는 키와 해당 데이터셋의 URL과 해시 값을 저장합니다. 이렇게 함으로써 데이터를 다운로드하고 압축을 해제할 때 사용할 수 있습니다.
def read_ptb()::
- read_ptb 함수를 정의합니다. 이 함수는 PTB 데이터셋을 로드하고 텍스트를 줄 단위로 분리하여 리스트로 반환합니다.
data_dir = d2l.download_extract('ptb'):
- d2l.download_extract 함수를 사용하여 PTB 데이터셋을 다운로드하고 압축을 해제합니다. 데이터가 저장될 디렉토리 경로를 data_dir 변수에 저장합니다.
with open(os.path.join(data_dir, 'ptb.train.txt')) as f::
- PTB 데이터셋 내에 있는 'ptb.train.txt' 파일을 엽니다.
raw_text = f.read():
- 파일을 읽어서 raw_text 변수에 저장합니다.
return [line.split() for line in raw_text.split('\n')]:
- raw_text를 줄 단위로 분리한 후 각 줄을 공백으로 분리하여 단어 리스트로 만들어 반환합니다.
sentences = read_ptb():
- read_ptb 함수를 호출하여 PTB 데이터셋의 텍스트를 처리한 결과를 sentences 변수에 저장합니다.
f'# sentences: {len(sentences)}':
- 처리된 문장의 개수를 출력하는 문자열을 생성합니다. 이를 통해 문장의 수를 확인할 수 있습니다.

After reading the training set, we build a vocabulary for the corpus, where any word that appears less than 10 times is replaced by the “<unk>” token. Note that the original dataset also contains “<unk>” tokens that represent rare (unknown) words.

훈련 세트를 읽은 후 우리는 10번 미만으로 나타나는 모든 단어가 "<unk>" 토큰으로 대체되는 말뭉치에 대한 어휘를 구축합니다. 원본 데이터세트에는 희귀한(알 수 없는) 단어를 나타내는 “<unk>” 토큰도 포함되어 있습니다.

vocab = d2l.Vocab(sentences, min_freq=10)
f'vocab size: {len(vocab)}'

이 코드는 PTB 데이터셋에서 단어 사전을 생성하는 과정을 나타내고 있습니다.

vocab = d2l.Vocab(sentences, min_freq=10):
- d2l.Vocab 클래스를 사용하여 단어 사전(vocabulary)을 생성합니다. 이 때 sentences는 PTB 데이터셋에서 읽어온 문장들의 리스트이며, min_freq=10은 최소 빈도수가 10 이상인 단어만을 포함하도록 단어 사전을 구성하겠다는 설정입니다. 이렇게 함으로써 빈도가 낮은 희귀한 단어는 제외됩니다.
f'vocab size: {len(vocab)}':
- 생성된 단어 사전의 크기를 출력하는 문자열을 생성합니다. len(vocab)은 단어 사전에 포함된 단어의 수를 나타냅니다. 이를 통해 단어 사전의 크기를 확인할 수 있습니다.

이 코드는 PTB 데이터셋에서 단어 사전을 생성하고 그 크기를 출력하는 과정을 보여주고 있습니다.

15.3.2. Subsampling

Text data typically have high-frequency words such as “the”, “a”, and “in”: they may even occur billions of times in very large corpora. However, these words often co-occur with many different words in context windows, providing little useful signals. For instance, consider the word “chip” in a context window: intuitively its co-occurrence with a low-frequency word “intel” is more useful in training than the co-occurrence with a high-frequency word “a”. Moreover, training with vast amounts of (high-frequency) words is slow. Thus, when training word embedding models, high-frequency words can be subsampled (Mikolov et al., 2013). Specifically, each indexed word wi in the dataset will be discarded with probability where f(wi) is the ratio of the number of words wi to the total number of words in the dataset, and the constant t is a hyperparameter (10**−4 in the experiment). We can see that only when the relative frequency f(wi)>t can the (high-frequency) word wi be discarded, and the higher the relative frequency of the word, the greater the probability of being discarded.

텍스트 데이터에는 일반적으로 "the", "a" 및 "in"과 같은 빈도가 높은 단어가 있으며 매우 큰 말뭉치에서는 수십억 번 나타날 수도 있습니다. 그러나 이러한 단어는 컨텍스트 창에서 다양한 단어와 함께 나타나는 경우가 많아 유용한 신호를 거의 제공하지 않습니다. 예를 들어, 컨텍스트 창에서 "chip"이라는 단어를 생각해 보세요. 직관적으로 낮은 빈도의 단어 "intel"과의 동시 발생은 높은 빈도의 단어 "a"와의 동시 발생보다 훈련에 더 유용합니다. 더욱이, 방대한 양의 (빈도가 높은) 단어를 사용한 훈련은 느립니다. 따라서 단어 임베딩 모델을 훈련할 때 빈도가 높은 단어를 서브샘플링할 수 있습니다(Mikolov et al., 2013). 구체적으로, 데이터세트의 각 색인 단어 wi는 확률에 따라 삭제됩니다. 여기서 f(wi)는 데이터세트의 전체 단어 수에 대한 wi 단어 수의 비율이고 상수 t는 하이퍼파라미터(실험에서는 10**− 4). 상대도수 f(wi)>t가 되어야만 (고빈도) 단어 wi가 폐기될 수 있고, 단어의 상대도수가 높을수록 폐기될 확률이 높아지는 것을 알 수 있다.

#@save
def subsample(sentences, vocab):
    """Subsample high-frequency words."""
    # Exclude unknown tokens ('<unk>')
    sentences = [[token for token in line if vocab[token] != vocab.unk]
                 for line in sentences]
    counter = collections.Counter([
        token for line in sentences for token in line])
    num_tokens = sum(counter.values())

    # Return True if `token` is kept during subsampling
    def keep(token):
        return(random.uniform(0, 1) <
               math.sqrt(1e-4 / counter[token] * num_tokens))

    return ([[token for token in line if keep(token)] for line in sentences],
            counter)

subsampled, counter = subsample(sentences, vocab)

이 코드는 높은 빈도 단어를 하위 샘플링하는 과정을 나타내고 있습니다.

def subsample(sentences, vocab)::
- subsample 함수를 정의합니다. 이 함수는 입력으로 문장들의 리스트 sentences와 단어 사전 vocab을 받습니다.
sentences = [[token for token in line if vocab[token] != vocab.unk] for line in sentences]:
- sentences 내에서 미지의 토큰('<unk>')을 제외하고 단어들을 추출하여 각 문장을 구성합니다.
counter = collections.Counter([...]):
- 모든 문장에서 각 단어의 빈도수를 계산하여 counter에 저장합니다.
num_tokens = sum(counter.values()):
- counter에 저장된 모든 단어의 빈도수를 합하여 총 토큰 수를 계산합니다.
def keep(token)::
- 하위 샘플링 중에 특정 단어 token을 유지할지 여부를 결정하는 함수를 정의합니다. 이 함수는 단어의 빈도수와 총 토큰 수에 기반하여 확률적으로 결정됩니다.
return ([[token for token in line if keep(token)] for line in sentences], counter):
- keep 함수에 따라 하위 샘플링된 문장들과 counter를 반환합니다.
subsampled, counter = subsample(sentences, vocab):
- subsample 함수를 호출하여 높은 빈도 단어를 하위 샘플링한 결과인 subsampled와 빈도수 정보인 counter를 얻습니다.

이 코드는 높은 빈도 단어를 하위 샘플링하여 데이터셋을 줄이는 과정을 나타내고 있습니다.

SubSampling 이란?

Subsampling, in the context of natural language processing, refers to a technique used to reduce the frequency of high-frequency words in a text corpus. This technique is often applied to address the issue of words that occur very frequently and provide limited contextual information, such as common words like "the," "and," "is," etc.

서브샘플링(Subsampling)은 자연어 처리의 맥락에서 텍스트 말뭉치 내에서 높은 빈도를 가진 단어의 빈도를 줄이는 기술을 말합니다. 이 기술은 "the," "and," "is"와 같은 일반적인 단어와 같이 빈번하게 나타나지만 제한된 문맥 정보를 제공하는 단어들의 문제를 해결하기 위해 자주 활용됩니다.

The idea behind subsampling is to randomly discard some instances of high-frequency words while keeping the overall distribution of words in the text relatively unchanged. By doing so, the resulting text data can be more balanced and contain a better representation of less frequent but potentially more informative words.

서브샘플링의 아이디어는 높은 빈도 단어의 일부 인스턴스를 무작위로 제거하면서 텍스트의 단어 분포를 상대적으로 유지하는 것입니다. 이를 통해 생성된 텍스트 데이터는 보다 균형적이며 덜 빈번하지만 더 유용한 정보를 제공할 수 있는 단어의 표현을 포함하게 됩니다.

Subsampling is particularly useful in word embedding methods like Word2Vec. High-frequency words can dominate the learning process, affecting the quality of word embeddings and the models' ability to capture subtle semantic relationships. Subsampling helps mitigate this by reducing the influence of these words while retaining the essence of the data.

서브샘플링은 특히 Word2Vec과 같은 단어 임베딩 방법에서 유용하게 활용됩니다. 높은 빈도 단어는 학습 과정을 지배할 수 있으며, 단어 임베딩의 품질과 모델이 미묘한 의미적 관계를 포착하는 능력에 영향을 줄 수 있습니다. 서브샘플링은 이러한 영향을 줄이면서도 데이터의 본질을 유지함으로써 이러한 단어들의 영향을 줄이는 데 도움을 줍니다.

In subsampling, the decision of whether to keep or discard a word is often based on its frequency. Words that appear very frequently are more likely to be discarded, while less frequent words have a higher chance of being retained. The specific criteria for subsampling, such as the threshold frequency for discarding a word, may vary depending on the application and the dataset.

서브샘플링에서 어떤 단어를 유지하거나 버릴지의 결정은 주로 그 빈도에 기반합니다. 매우 빈번하게 나타나는 단어는 버려질 가능성이 높으며, 덜 빈번한 단어는 보다 높은 유지 확률을 갖습니다. 서브샘플링의 구체적인 기준(예: 단어를 버릴 빈도의 임계값)은 응용 및 데이터셋에 따라 다양할 수 있습니다.

In summary, subsampling is a technique used to reduce the influence of high-frequency words in text data, improving the quality of word embeddings and enhancing the ability of models to capture meaningful relationships between words.

요약하면, 서브샘플링은 텍스트 데이터에서 높은 빈도 단어의 영향을 줄이는 기술로, 단어 임베딩의 품질을 향상시키고 모델이 단어 간 의미적인 관계를 더 잘 포착할 수 있는 능력을 강화하는 데 사용됩니다.

The following code snippet plots the histogram of the number of tokens per sentence before and after subsampling. As expected, subsampling significantly shortens sentences by dropping high-frequency words, which will lead to training speedup.

다음 코드 조각은 서브샘플링 전후의 문장당 토큰 수에 대한 히스토그램을 표시합니다. 예상한 대로 서브샘플링은 빈도가 높은 단어를 삭제하여 문장을 크게 단축하여 훈련 속도를 향상시킵니다.

d2l.show_list_len_pair_hist(['origin', 'subsampled'], '# tokens per sentence',
                            'count', sentences, subsampled);

이 코드는 두 개의 데이터 리스트에 대한 히스토그램을 그리는 d2l(Dive into Deep Learning) 라이브러리의 함수를 호출하는 부분입니다.

d2l.show_list_len_pair_hist(['origin', 'subsampled'], '# tokens per sentence', 'count', sentences, subsampled);:
- show_list_len_pair_hist 함수를 호출합니다. 이 함수는 두 개의 데이터 리스트에 대한 길이(또는 개수)에 관한 히스토그램을 그립니다.
- 첫 번째 인자 ['origin', 'subsampled']는 두 개의 데이터 리스트를 나타내는 이름입니다. 'origin'은 원본 데이터 리스트를, 'subsampled'는 서브샘플링된 데이터 리스트를 나타냅니다.
- 두 번째 인자 '# tokens per sentence'는 x축의 레이블로서 "문장 당 토큰 수"를 나타냅니다.
- 세 번째 인자 'count'는 y축의 레이블로서 "개수"를 나타냅니다.
- 네 번째와 다섯 번째 인자 sentences와 subsampled는 각각 원본 데이터 리스트와 서브샘플링된 데이터 리스트를 나타냅니다.

이 코드는 원본 데이터와 서브샘플링된 데이터 간의 문장당 토큰 수에 대한 히스토그램을 그리는 기능을 수행합니다.

For individual tokens, the sampling rate of the high-frequency word “the” is less than 1/20.

개별 토큰의 경우 빈도가 높은 단어 “the”의 샘플링 비율은 1/20 미만입니다.

def compare_counts(token):
    return (f'# of "{token}": '
            f'before={sum([l.count(token) for l in sentences])}, '
            f'after={sum([l.count(token) for l in subsampled])}')

compare_counts('the')

이 코드는 특정 단어의 빈도수를 비교하는 함수를 정의하고 호출하는 부분을 나타내고 있습니다.

def compare_counts(token)::
- compare_counts 함수를 정의합니다. 이 함수는 특정 단어의 빈도수를 비교하여 문자열 형태로 반환합니다. 함수는 token이라는 인자를 받습니다.
return (f'# of "{token}": ' ...):
- 함수의 반환값으로 사용될 문자열을 생성합니다.
- f'# of "{token}": '는 token의 이름을 포함하는 문자열을 나타냅니다.
f'before={sum([l.count(token) for l in sentences])}, ':
- 원본 데이터 리스트 sentences에서 token의 빈도수를 계산하고 합산한 값을 나타냅니다. 이를 문자열로 생성합니다.
f'after={sum([l.count(token) for l in subsampled])}':
- 서브샘플링된 데이터 리스트 subsampled에서 token의 빈도수를 계산하고 합산한 값을 나타냅니다. 이를 문자열로 생성합니다.
compare_counts('the'):
- compare_counts 함수를 호출하여 'the'라는 단어의 빈도수를 비교한 결과를 얻습니다.

이 코드는 특정 단어의 원본 데이터와 서브샘플링된 데이터에서의 빈도수를 비교하는 함수를 호출하여 'the'라는 단어의 빈도수를 비교한 결과를 출력합니다.

In contrast, low-frequency words “join” are completely kept.

반면, 빈도가 낮은 단어인 "join"은 완전히 유지됩니다.

compare_counts('join')

After subsampling, we map tokens to their indices for the corpus.

서브샘플링 후 토큰을 코퍼스의 인덱스에 매핑합니다.

corpus = [vocab[line] for line in subsampled]
corpus[:3]

이 코드는 서브샘플링된 데이터 리스트를 단어 사전에 매핑하여 새로운 말뭉치(corpus)를 생성하고, 이를 확인하는 부분을 나타내고 있습니다.

corpus = [vocab[line] for line in subsampled]:
- subsampled에 있는 각 문장(line)을 단어 사전(vocab)에 매핑하여 말뭉치(corpus)를 생성합니다. 각 문장의 단어들이 해당하는 단어 사전의 인덱스로 변환됩니다.
corpus[:3]:
- 생성된 말뭉치 corpus에서 처음부터 3개의 원소를 슬라이싱하여 확인합니다. 이를 통해 새로운 말뭉치에서 처음 3개의 문장에 해당하는 단어 인덱스들을 확인할 수 있습니다.

이 코드는 서브샘플링된 데이터 리스트를 단어 사전에 매핑하여 말뭉치(corpus)를 생성하고, 그 말뭉치에서 처음 3개의 문장에 해당하는 단어 인덱스들을 확인하는 기능을 수행합니다.

15.3.3. Extracting Center Words and Context Words

The following get_centers_and_contexts function extracts all the center words and their context words from corpus. It uniformly samples an integer between 1 and max_window_size at random as the context window size. For any center word, those words whose distance from it does not exceed the sampled context window size are its context words.

다음 get_centers_and_contexts 함수는 말뭉치에서 모든 중심 단어와 해당 문맥 단어를 추출합니다. 1과 max_window_size 사이의 정수를 컨텍스트 창 크기로 무작위로 균일하게 샘플링합니다. 중심 단어의 경우, 그로부터의 거리가 샘플링된 컨텍스트 창 크기를 초과하지 않는 단어는 해당 단어입니다.

#@save
def get_centers_and_contexts(corpus, max_window_size):
    """Return center words and context words in skip-gram."""
    centers, contexts = [], []
    for line in corpus:
        # To form a "center word--context word" pair, each sentence needs to
        # have at least 2 words
        if len(line) < 2:
            continue
        centers += line
        for i in range(len(line)):  # Context window centered at `i`
            window_size = random.randint(1, max_window_size)
            indices = list(range(max(0, i - window_size),
                                 min(len(line), i + 1 + window_size)))
            # Exclude the center word from the context words
            indices.remove(i)
            contexts.append([line[idx] for idx in indices])
    return centers, contexts

이 코드는 스킵-그램(Skip-Gram) 모델에서 중심 단어와 문맥 단어를 반환하는 함수를 정의하고 있습니다.

def get_centers_and_contexts(corpus, max_window_size)::
- get_centers_and_contexts 함수를 정의합니다. 이 함수는 말뭉치(corpus)와 최대 윈도우 크기(max_window_size)를 인자로 받습니다.
if len(line) < 2::
- 현재 처리 중인 문장(line)의 길이가 2 미만이면 건너뜁니다. 스킵-그램 모델에서는 하나의 중심 단어와 그 주변 문맥 단어를 처리해야 하므로, 적어도 2개의 단어가 필요합니다.
centers += line:
- 중심 단어 리스트에 현재 문장의 모든 단어를 추가합니다.
for i in range(len(line))::
- 각 문장의 인덱스 i를 기준으로 문맥 창을 생성합니다.
window_size = random.randint(1, max_window_size):
- 현재 중심 단어에 대한 윈도우 크기를 무작위로 선택합니다. 최대 윈도우 크기까지의 랜덤한 값으로 설정됩니다.
indices = list(range(max(0, i - window_size), min(len(line), i + 1 + window_size))):
- 현재 문맥 창을 위한 인덱스 범위를 생성합니다. 중심 단어의 좌우로 window_size 범위 내의 인덱스를 선택합니다.
indices.remove(i):
- 중심 단어의 인덱스 i를 문맥 단어 리스트에서 제거합니다. 중심 단어와 문맥 단어는 동일하면 안 되기 때문입니다.
contexts.append([line[idx] for idx in indices]):
- 문맥 단어 리스트에 현재 문맥 창의 단어들을 추가합니다. 이 때 중심 단어를 제외한 인덱스들에 해당하는 단어들이 포함됩니다.
return centers, contexts:
- 중심 단어 리스트와 문맥 단어 리스트를 반환합니다.

이 코드는 스킵-그램 모델을 위한 중심 단어와 문맥 단어를 생성하는 함수를 구현하고 있습니다.

Next, we create an artificial dataset containing two sentences of 7 and 3 words, respectively. Let the maximum context window size be 2 and print all the center words and their context words.

다음으로, 각각 7개 단어와 3개 단어로 구성된 두 문장을 포함하는 인공 데이터 세트를 만듭니다. 최대 컨텍스트 창 크기를 2로 설정하고 모든 중앙 단어와 해당 컨텍스트 단어를 인쇄합니다.

tiny_dataset = [list(range(7)), list(range(7, 10))]
print('dataset', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):
    print('center', center, 'has contexts', context)

이 코드는 작은 데이터셋을 생성하고, 해당 데이터셋에서 중심 단어와 문맥 단어를 얻어 출력하는 부분을 보여주고 있습니다.

tiny_dataset = [list(range(7)), list(range(7, 10))]:
- 작은 데이터셋 tiny_dataset을 생성합니다. 첫 번째 리스트는 0부터 6까지의 숫자를 포함하고, 두 번째 리스트는 7부터 9까지의 숫자를 포함합니다.
print('dataset', tiny_dataset):
- 생성한 작은 데이터셋 tiny_dataset을 출력합니다.
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2))::
- get_centers_and_contexts 함수를 호출하여 중심 단어와 문맥 단어를 얻습니다. 이때 윈도우 크기는 최대 2로 설정합니다.
- zip(*...)를 사용하여 중심 단어와 문맥 단어를 반복문에서 동시에 순회합니다. 각 반복에서 center는 중심 단어, context는 해당 중심 단어의 문맥 단어들을 나타냅니다.
print('center', center, 'has contexts', context):
- 현재 중심 단어와 해당 중심 단어의 문맥 단어들을 출력합니다.

이 코드는 작은 데이터셋에서 생성된 중심 단어와 문맥 단어를 출력하는 기능을 수행합니다.

When training on the PTB dataset, we set the maximum context window size to 5. The following extracts all the center words and their context words in the dataset.

PTB 데이터 세트를 훈련할 때 최대 컨텍스트 창 크기를 5로 설정했습니다. 다음은 데이터 세트의 모든 중심 단어와 해당 컨텍스트 단어를 추출합니다.

all_centers, all_contexts = get_centers_and_contexts(corpus, 5)
f'# center-context pairs: {sum([len(contexts) for contexts in all_contexts])}'

이 코드는 모든 중심 단어와 해당 중심 단어의 문맥 단어들을 생성하고, 생성된 중심-문맥 쌍의 총 개수를 출력하는 부분을 보여주고 있습니다.

all_centers, all_contexts = get_centers_and_contexts(corpus, 5):
- get_centers_and_contexts 함수를 호출하여 모든 중심 단어와 그에 해당하는 문맥 단어들을 생성합니다. 이때 윈도우 크기는 최대 5로 설정합니다.
- all_centers에는 중심 단어 리스트가 저장되고, all_contexts에는 문맥 단어 리스트들의 리스트가 저장됩니다.
f'# center-context pairs: {sum([len(contexts) for contexts in all_contexts])}':
- 생성된 중심-문맥 쌍의 총 개수를 문자열 형태로 출력합니다.
- sum([len(contexts) for contexts in all_contexts])는 all_contexts에 저장된 각 문맥 단어 리스트의 길이를 모두 합하여 총 중심-문맥 쌍의 개수를 계산합니다.

이 코드는 모든 중심 단어와 그에 해당하는 문맥 단어들을 생성하고, 생성된 중심-문맥 쌍의 총 개수를 출력하는 기능을 수행합니다.

15.3.4. Negative Sampling

We use negative sampling for approximate training. To sample noise words according to a predefined distribution, we define the following RandomGenerator class, where the (possibly unnormalized) sampling distribution is passed via the argument sampling_weights.

우리는 대략적인 훈련을 위해 음성 샘플링을 사용합니다. 사전 정의된 분포에 따라 노이즈 단어를 샘플링하기 위해 다음 RandomGenerator 클래스를 정의합니다. 여기서 (정규화되지 않은) 샘플링 분포는 sampling_weights 인수를 통해 전달됩니다.

#@save
class RandomGenerator:
    """Randomly draw among {1, ..., n} according to n sampling weights."""
    def __init__(self, sampling_weights):
        # Exclude
        self.population = list(range(1, len(sampling_weights) + 1))
        self.sampling_weights = sampling_weights
        self.candidates = []
        self.i = 0

    def draw(self):
        if self.i == len(self.candidates):
            # Cache `k` random sampling results
            self.candidates = random.choices(
                self.population, self.sampling_weights, k=10000)
            self.i = 0
        self.i += 1
        return self.candidates[self.i - 1]

이 코드는 n개의 샘플링 가중치에 따라 {1, ..., n} 중에서 무작위로 추출하는 클래스를 정의하고 있습니다.

class RandomGenerator::
- RandomGenerator 클래스를 정의합니다. 이 클래스는 n개의 샘플링 가중치에 따라 무작위로 추출하는 기능을 제공합니다.
def __init__(self, sampling_weights)::
- RandomGenerator 클래스의 초기화 메서드입니다. 샘플링 가중치를 인자로 받습니다.
- self.population은 1부터 샘플링 가중치 개수까지의 정수 리스트입니다.
- self.sampling_weights는 입력받은 샘플링 가중치를 저장합니다.
- self.candidates는 추출한 후보 값들을 저장하는 리스트입니다.
- self.i는 추출된 후보 값들 중 현재 사용 중인 값을 나타냅니다.
def draw(self)::
- 추출 결과를 반환하는 메서드입니다.
- self.i가 self.candidates의 길이와 같다면, 새로운 무작위 샘플링 결과를 10000번 추출하여 self.candidates에 저장합니다.
- self.i를 1 증가시키고, self.candidates[self.i - 1] 값을 반환합니다.

이 코드는 샘플링 가중치에 따라 무작위로 값을 추출하는 RandomGenerator 클래스를 정의하고 있습니다.

For example, we can draw 10 random variables X among indices 1, 2, and 3 with sampling probabilities P(X=1)=2/9,P(X=2)=3/9, and P(X=3)=4/9 as follows.

예를 들어, 샘플링 확률 P(X=1)=2/9,P(X=2)=3/9, 그리고 P(X=3)=4/9 을 사용하여 인덱스 1, 2, 3 중에서 10개의 확률 변수 X를 추출할 수 있습니다

generator = RandomGenerator([2, 3, 4])
[generator.draw() for _ in range(10)]

이 코드는 샘플링 가중치를 사용하여 무작위로 값을 추출하는 RandomGenerator 객체를 생성하고, 해당 객체를 이용해 값을 10번 추출하는 부분을 보여주고 있습니다.

generator = RandomGenerator([2, 3, 4]):
- RandomGenerator 클래스의 인스턴스인 generator를 생성합니다. 샘플링 가중치로 [2, 3, 4]를 사용합니다. 이 가중치에 따라 1, 2, 3이 선택될 확률이 각각 1/2, 1/3, 1/4가 됩니다.
[generator.draw() for _ in range(10)]:
- generator에서 draw 메서드를 호출하여 값을 10번 추출합니다. 추출된 값들은 리스트에 저장됩니다.
- _는 반복문에서 사용하지 않는 값에 대한 관용적인 표현입니다. 따라서 10번 반복되지만 추출된 값들은 사용되지 않습니다.

이 코드는 샘플링 가중치에 따라 RandomGenerator 객체에서 값을 10번 추출하고, 추출된 값을 리스트로 저장하는 기능을 수행합니다

For a pair of center word and context word, we randomly sample K (5 in the experiment) noise words. According to the suggestions in the word2vec paper, the sampling probability P(w) of a noise word w is set to its relative frequency in the dictionary raised to the power of 0.75 (Mikolov et al., 2013).

중심 단어와 문맥 단어 쌍에 대해 K개(실험에서는 5개)의 노이즈 단어를 무작위로 샘플링합니다. word2vec 논문의 제안에 따르면, 의미 없는 단어 w의 샘플링 확률 P(w)는 사전의 상대 빈도로 0.75승으로 설정됩니다(Mikolov et al., 2013).

#@save
def get_negatives(all_contexts, vocab, counter, K):
    """Return noise words in negative sampling."""
    # Sampling weights for words with indices 1, 2, ... (index 0 is the
    # excluded unknown token) in the vocabulary
    sampling_weights = [counter[vocab.to_tokens(i)]**0.75
                        for i in range(1, len(vocab))]
    all_negatives, generator = [], RandomGenerator(sampling_weights)
    for contexts in all_contexts:
        negatives = []
        while len(negatives) < len(contexts) * K:
            neg = generator.draw()
            # Noise words cannot be context words
            if neg not in contexts:
                negatives.append(neg)
        all_negatives.append(negatives)
    return all_negatives

all_negatives = get_negatives(all_contexts, vocab, counter, 5)

이 코드는 부정적 샘플링에 사용할 노이즈 단어를 생성하는 함수를 정의하고 있습니다.

def get_negatives(all_contexts, vocab, counter, K)::
- get_negatives 함수를 정의합니다. 이 함수는 부정적 샘플링에 사용할 노이즈 단어를 반환합니다. 인자로 문맥 단어들의 리스트(all_contexts), 단어 사전(vocab), 빈도수 카운터(counter), 부정적 샘플링의 개수(K)를 받습니다.
sampling_weights = [counter[vocab.to_tokens(i)]**0.75 ...]:
- 단어 사전의 각 단어에 대한 샘플링 가중치를 계산합니다. 가중치는 해당 단어의 빈도수를 0.75 제곱한 값으로 계산됩니다.
all_negatives, generator = [], RandomGenerator(sampling_weights):
- 부정적 샘플링에 사용할 노이즈 단어들을 저장할 리스트인 all_negatives를 생성하고, 랜덤 값을 생성하는 RandomGenerator 객체를 생성합니다. 이 객체는 위에서 계산한 샘플링 가중치를 기반으로 값을 무작위로 추출합니다.
for contexts in all_contexts::
- 모든 문맥 단어 리스트에 대해 반복합니다.
negatives = []:
- 현재 문맥 단어들에 대한 노이즈 단어들을 저장할 리스트인 negatives를 초기화합니다.
while len(negatives) < len(contexts) * K::
- 노이즈 단어의 개수가 문맥 단어 개수 * K보다 작을 때까지 반복합니다.
neg = generator.draw():
- 랜덤 생성기에서 값을 추출하여 neg에 저장합니다.
if neg not in contexts::
- 추출한 노이즈 단어 neg가 현재 문맥 단어들에 속하지 않으면 다음을 수행합니다.
negatives.append(neg):
- 현재 노이즈 단어를 negatives 리스트에 추가합니다.
all_negatives.append(negatives):
- 모든 노이즈 단어 리스트를 all_negatives 리스트에 추가합니다.
return all_negatives:
- 모든 노이즈 단어 리스트를 반환합니다.
all_negatives = get_negatives(all_contexts, vocab, counter, 5):
- get_negatives 함수를 호출하여 모든 문맥 단어들에 대한 노이즈 단어 리스트를 생성합니다. 부정적 샘플링 개수는 5로 설정됩니다.

이 코드는 부정적 샘플링에 사용할 노이즈 단어를 생성하는 함수를 구현하고 있습니다.

15.3.5. Loading Training Examples in Minibatches

After all the center words together with their context words and sampled noise words are extracted, they will be transformed into minibatches of examples that can be iteratively loaded during training.

모든 중심 단어와 해당 문맥 단어 및 샘플링된 의미 없는 단어가 추출된 후에는 훈련 중에 반복적으로 로드할 수 있는 예제의 미니 배치로 변환됩니다.

In a minibatch, the ith example includes a center word and its ni context words and mi noise words. Due to varying context window sizes, ni+mi varies for different i. Thus, for each example we concatenate its context words and noise words in the contexts_negatives variable, and pad zeros until the concatenation length reaches maxini+mi (max_len). To exclude paddings in the calculation of the loss, we define a mask variable masks. There is a one-to-one correspondence between elements in masks and elements in contexts_negatives, where zeros (otherwise ones) in masks correspond to paddings in contexts_negatives.

미니배치에서 i번째 예제에는 중심 단어와 해당 단어의 ni 문맥 단어 및 mi 노이즈 단어가 포함됩니다. 다양한 컨텍스트 창 크기로 인해 ni+mi는 i에 따라 다릅니다. 따라서 각 예에 대해 contexts_negatives 변수에서 해당 컨텍스트 단어와 노이즈 단어를 연결하고 연결 길이가 maxini+mi(max_len)에 도달할 때까지 0을 채웁니다. 손실 계산에서 패딩을 제외하기 위해 마스크 변수 마스크를 정의합니다. 마스크의 요소와 contexts_negatives의 요소 사이에는 일대일 대응이 있습니다. 여기서 마스크의 0(그렇지 않은 경우 1)은 contexts_negatives의 패딩에 해당합니다.

To distinguish between positive and negative examples, we separate context words from noise words in contexts_negatives via a labels variable. Similar to masks, there is also a one-to-one correspondence between elements in labels and elements in contexts_negatives, where ones (otherwise zeros) in labels correspond to context words (positive examples) in contexts_negatives.

긍정적인 예와 부정적인 예를 구별하기 위해 labels 변수를 통해 contexts_negatives의 의미 없는 단어와 컨텍스트 단어를 분리합니다. 마스크와 마찬가지로 레이블의 요소와 contexts_negatives의 요소 사이에는 일대일 대응도 있습니다. 여기서 레이블의 1(그렇지 않으면 0)은 contexts_negatives의 문맥 단어(긍정적 예)에 해당합니다.

The above idea is implemented in the following batchify function. Its input data is a list with length equal to the batch size, where each element is an example consisting of the center word center, its context words context, and its noise words negative. This function returns a minibatch that can be loaded for calculations during training, such as including the mask variable.

위의 아이디어는 다음 배치화 기능으로 구현됩니다. 입력 데이터는 배치 크기와 길이가 같은 목록입니다. 여기서 각 요소는 중심 단어 center, 문맥 단어 context 및 노이즈 단어 negative로 구성된 예입니다. 이 함수는 훈련 중에 마스크 변수를 포함하는 등의 계산을 위해 로드할 수 있는 미니배치를 반환합니다.

#@save
def batchify(data):
    """Return a minibatch of examples for skip-gram with negative sampling."""
    max_len = max(len(c) + len(n) for _, c, n in data)
    centers, contexts_negatives, masks, labels = [], [], [], []
    for center, context, negative in data:
        cur_len = len(context) + len(negative)
        centers += [center]
        contexts_negatives += [context + negative + [0] * (max_len - cur_len)]
        masks += [[1] * cur_len + [0] * (max_len - cur_len)]
        labels += [[1] * len(context) + [0] * (max_len - len(context))]
    return (torch.tensor(centers).reshape((-1, 1)), torch.tensor(
        contexts_negatives), torch.tensor(masks), torch.tensor(labels))

이 코드는 부정적 샘플링을 이용한 스킵-그램 모델을 위한 미니배치 데이터를 생성하는 함수를 정의하고 있습니다.

def batchify(data)::
- batchify 함수를 정의합니다. 이 함수는 부정적 샘플링을 이용한 스킵-그램 모델을 위한 미니배치 데이터를 생성합니다. 인자로 데이터(data)를 받습니다.
max_len = max(len(c) + len(n) for _, c, n in data):
- 모든 데이터에서 중심 단어, 문맥 단어, 부정적 단어를 합친 길이의 최댓값(max_len)을 계산합니다.
centers, contexts_negatives, masks, labels = [], [], [], []:
- 중심 단어, 문맥 및 부정적 단어 조합, 마스크, 레이블을 저장할 리스트들을 초기화합니다.
for center, context, negative in data::
- 모든 데이터에 대해 반복합니다.
cur_len = len(context) + len(negative):
- 현재 문맥 단어와 부정적 단어 조합의 길이(cur_len)를 계산합니다.
centers += [center], contexts_negatives += [context + negative + [0] * (max_len - cur_len)], masks += [[1] * cur_len + [0] * (max_len - cur_len)], labels += [[1] * len(context) + [0] * (max_len - len(context))]:
- 중심 단어, 문맥 및 부정적 단어 조합, 마스크, 레이블을 각각 해당 리스트에 추가합니다.
- 추가할 때에는 길이를 맞추기 위해 부족한 부분은 0으로 채워 넣습니다.
return (torch.tensor(centers).reshape((-1, 1)), torch.tensor(contexts_negatives), torch.tensor(masks), torch.tensor(labels)):
- 생성된 데이터를 PyTorch 텐서로 변환하여 반환합니다. 중심 단어 텐서는 형태를 변환하여 열 벡터로 만듭니다. 이 때 중심 단어, 문맥 및 부정적 단어 조합, 마스크, 레이블이 순서대로 반환됩니다.

이 코드는 부정적 샘플링을 이용한 스킵-그램 모델을 위한 미니배치 데이터를 생성하는 함수를 구현하고 있습니다.

Let’s test this function using a minibatch of two examples.

두 가지 예제의 미니 배치를 사용하여 이 기능을 테스트해 보겠습니다.

x_1 = (1, [2, 2], [3, 3, 3, 3])
x_2 = (1, [2, 2, 2], [3, 3])
batch = batchify((x_1, x_2))

names = ['centers', 'contexts_negatives', 'masks', 'labels']
for name, data in zip(names, batch):
    print(name, '=', data)

이 코드는 두 개의 예시 데이터 (x_1, x_2)를 이용하여 스킵-그램 모델을 위한 미니배치 데이터를 생성하고, 생성된 데이터의 각 요소를 출력하는 부분을 보여주고 있습니다.

x_1 = (1, [2, 2], [3, 3, 3, 3]), x_2 = (1, [2, 2, 2], [3, 3]):
- x_1과 x_2는 각각 중심 단어, 문맥 단어, 부정적 단어들을 튜플 형태로 저장한 예시 데이터입니다.
batch = batchify((x_1, x_2)):
- (x_1, x_2)를 인자로하여 batchify 함수를 호출하여 미니배치 데이터를 생성합니다. 이 데이터는 batch에 저장됩니다.
names = ['centers', 'contexts_negatives', 'masks', 'labels']:
- 생성된 미니배치 데이터의 각 요소에 대한 이름을 나타내는 리스트 names을 생성합니다.
for name, data in zip(names, batch)::
- names 리스트와 batch의 데이터를 동시에 순회하면서 반복합니다.
print(name, '=', data):
- 각 요소의 이름과 해당 데이터를 출력합니다.

이 코드는 스킵-그램 모델을 위한 미니배치 데이터 생성 및 출력 과정을 보여주고 있습니다.

15.3.6. Putting It All Together

Last, we define the load_data_ptb function that reads the PTB dataset and returns the data iterator and the vocabulary.

마지막으로 PTB 데이터 세트를 읽고 데이터 반복자와 어휘를 반환하는 load_data_ptb 함수를 정의합니다.

#@save
def load_data_ptb(batch_size, max_window_size, num_noise_words):
    """Download the PTB dataset and then load it into memory."""
    num_workers = d2l.get_dataloader_workers()
    sentences = read_ptb()
    vocab = d2l.Vocab(sentences, min_freq=10)
    subsampled, counter = subsample(sentences, vocab)
    corpus = [vocab[line] for line in subsampled]
    all_centers, all_contexts = get_centers_and_contexts(
        corpus, max_window_size)
    all_negatives = get_negatives(
        all_contexts, vocab, counter, num_noise_words)

    class PTBDataset(torch.utils.data.Dataset):
        def __init__(self, centers, contexts, negatives):
            assert len(centers) == len(contexts) == len(negatives)
            self.centers = centers
            self.contexts = contexts
            self.negatives = negatives

        def __getitem__(self, index):
            return (self.centers[index], self.contexts[index],
                    self.negatives[index])

        def __len__(self):
            return len(self.centers)

    dataset = PTBDataset(all_centers, all_contexts, all_negatives)

    data_iter = torch.utils.data.DataLoader(dataset, batch_size, shuffle=True,
                                      collate_fn=batchify,
                                      num_workers=num_workers)
    return data_iter, vocab

이 코드는 PTB 데이터셋을 다운로드하고 메모리로 로드하여 스킵-그램 모델을 위한 학습 데이터를 생성하는 함수를 정의하고 있습니다.

num_workers = d2l.get_dataloader_workers():
- 데이터 로더의 워커 수를 설정합니다. 병렬 처리를 위한 워커 수입니다.
sentences = read_ptb():
- PTB 데이터셋을 읽어와 문장들의 리스트로 저장합니다.
vocab = d2l.Vocab(sentences, min_freq=10):
- 문장 리스트를 바탕으로 단어 사전을 생성합니다. 최소 빈도수가 10인 단어들만 단어 사전에 포함됩니다.
subsampled, counter = subsample(sentences, vocab):
- 문장 리스트를 서브샘플링하여 새로운 문장 리스트와 빈도수 카운터를 생성합니다.
corpus = [vocab[line] for line in subsampled]:
- 서브샘플링된 문장 리스트를 단어 사전의 인덱스로 변환하여 말뭉치(corpus)를 생성합니다.
all_centers, all_contexts = get_centers_and_contexts(corpus, max_window_size):
- 말뭉치를 바탕으로 중심 단어와 문맥 단어를 생성합니다. 최대 윈도우 크기는 max_window_size로 설정됩니다.
all_negatives = get_negatives(all_contexts, vocab, counter, num_noise_words):
- 문맥 단어를 기반으로 부정적 샘플링을 통해 노이즈 단어들을 생성합니다. 부정적 샘플링의 개수는 num_noise_words로 설정됩니다.
class PTBDataset(torch.utils.data.Dataset)::
- PyTorch의 데이터셋 클래스를 상속하여 PTB 데이터셋을 위한 사용자 정의 데이터셋 클래스인 PTBDataset을 정의합니다.
def __init__(self, centers, contexts, negatives)::
- PTBDataset 클래스의 초기화 메서드입니다. 중심 단어, 문맥 단어, 부정적 단어들을 인자로 받습니다.
def __getitem__(self, index)::
- 해당 인덱스의 중심 단어, 문맥 단어, 부정적 단어들을 반환합니다.
def __len__(self)::
- 데이터셋의 총 데이터 수를 반환합니다.
dataset = PTBDataset(all_centers, all_contexts, all_negatives):
- PTBDataset 클래스의 인스턴스인 데이터셋 객체 dataset을 생성합니다.
data_iter = torch.utils.data.DataLoader(dataset, batch_size, shuffle=True, collate_fn=batchify, num_workers=num_workers):
- 생성한 데이터셋을 이용하여 데이터 로더를 생성합니다. 미니배치 크기는 batch_size로 설정되며, 데이터를 섞어서 가져오고, batchify 함수를 이용하여 미니배치 데이터를 처리합니다.
return data_iter, vocab:
- 생성한 데이터 로더와 단어 사전을 반환합니다.

이 코드는 PTB 데이터셋을 다운로드하고 메모리로 로드하여 스킵-그램 모델을 위한 학습 데이터를 생성하는 함수를 구현하고 있습니다.

Let’s print the first minibatch of the data iterator.

데이터 반복자의 첫 번째 미니 배치를 인쇄해 보겠습니다.

data_iter, vocab = load_data_ptb(512, 5, 5)
for batch in data_iter:
    for name, data in zip(names, batch):
        print(name, 'shape:', data.shape)
    break

이 코드는 PTB 데이터셋을 미니배치 형태로 로드하여 미니배치 데이터의 형태(shape)를 출력하는 과정을 보여주고 있습니다.

data_iter, vocab = load_data_ptb(512, 5, 5):
- load_data_ptb 함수를 호출하여 PTB 데이터셋을 미니배치 형태로 로드합니다. 미니배치 크기는 512, 최대 윈도우 크기는 5, 부정적 샘플링 개수는 5로 설정됩니다. 반환되는 data_iter는 데이터 로더, vocab은 단어 사전을 나타냅니다.
for batch in data_iter::
- 데이터 로더를 통해 미니배치를 순회합니다.
for name, data in zip(names, batch)::
- names 리스트와 현재 미니배치(batch)의 데이터를 동시에 순회하면서 반복합니다.
print(name, 'shape:', data.shape):
- 각 데이터의 이름과 형태(shape)를 출력합니다.
break:
- 첫 번째 미니배치만 확인하기 위해 break를 사용하여 반복을 종료합니다.

이 코드는 PTB 데이터셋을 미니배치 형태로 로드하여 미니배치 데이터의 형태(shape)를 출력하는 과정을 보여주고 있습니다.

15.3.7. Summary

High-frequency words may not be so useful in training. We can subsample them for speedup in training.
빈도가 높은 단어는 훈련에 그다지 유용하지 않을 수 있습니다. 훈련 속도를 높이기 위해 서브샘플링을 할 수 있습니다.
For computational efficiency, we load examples in minibatches. We can define other variables to distinguish paddings from non-paddings, and positive examples from negative ones.
계산 효율성을 위해 예제를 미니배치로 로드합니다. 패딩과 비패딩을 구별하고 긍정적인 예와 부정적인 예를 구별하기 위해 다른 변수를 정의할 수 있습니다.

15.3.8. Exercises¶

How does the running time of code in this section changes if not using subsampling?
The RandomGenerator class caches k random sampling results. Set k to other values and see how it affects the data loading speed.
What other hyperparameters in the code of this section may affect the data loading speed?

'Dive into Deep Learning > D2L Natural language Processing' 카테고리의 다른 글

D2L - 15.10. Pretraining BERT (0)	2023.08.30
D2L - 15.9. The Dataset for Pretraining BERT (0)	2023.08.30
D2L - 15.8. Bidirectional Encoder Representations from Transformers (BERT) (0)	2023.08.30
D2L - 15.7. Word Similarity and Analogy (0)	2023.08.30
D2L - 15.6. Subword Embedding (0)	2023.08.30
D2L - 15.5. Word Embedding with Global Vectors (GloVe) (0)	2023.08.29
D2L - 15.4. Pretraining word2vec (0)	2023.08.29
D2L- 15.2. Approximate Training (0)	2023.08.28
D2L- 15.1. Word Embedding (word2vec) (0)	2023.08.25
D2L - 15. Natural Language Processing: Pretraining (0)	2023.08.24

Dive into Deep Learning/D2L Natural language Processing

D2L- 15.2. Approximate Training

2023. 8. 28. 00:07 | Posted by 솔웅

15.2. Approximate Training — Dive into Deep Learning 1.0.3 documentation (d2l.ai)

15.2. Approximate Training — Dive into Deep Learning 1.0.3 documentation

d2l.ai

Recall our discussions in Section 15.1. The main idea of the skip-gram model is using softmax operations to calculate the conditional probability of generating a context word wo based on the given center word wc in (15.1.4), whose corresponding logarithmic loss is given by the opposite of (15.1.7).

섹션 15.1에서 논의한 내용을 기억해 보십시오. 스킵 그램 모델의 주요 아이디어는 소프트맥스 연산을 사용하여 문맥 단어 wo를 생성하는 조건부 확률을 계산하는 것입니다. 이것은 15.1.4에서 주어진 중심 단어 wc에 기초하여, 해당 로그 손실은 (15.1.7)의 반대에 의해 제공 됩니다.

Due to the nature of the softmax operation, since a context word may be anyone in the dictionary V, the opposite of (15.1.7) contains the summation of items as many as the entire size of the vocabulary. Consequently, the gradient calculation for the skip-gram model in (15.1.8) and that for the continuous bag-of-words model in (15.1.15) both contain the summation. Unfortunately, the computational cost for such gradients that sum over a large dictionary (often with hundreds of thousands or millions of words) is huge!

소프트맥스 연산의 특성상 문맥 단어는 사전 V에 있는 누구라도 될 수 있으므로 (15.1.7)의 반대는 전체 어휘 크기만큼 항목의 합을 포함한다. 결과적으로 (15.1.8)의 스킵 그램 모델에 대한 기울기 계산과 (15.1.15)의 연속 단어 백 모델에 대한 기울기 계산에는 모두 합산이 포함됩니다. 불행하게도, 큰 사전(종종 수십만 또는 수백만 단어로 구성됨)에 걸쳐 합산되는 그러한 그라디언트의 계산 비용은 엄청납니다!

15.2.1. Negative Sampling

Negative sampling modifies the original objective function. Given the context window of a center word wc, the fact that any (context) word wo comes from this context window is considered as an event with the probability modeled by

음수 샘플링 Negative sampling 은 원래 목적 함수를 수정합니다. 중앙 단어 wc의 컨텍스트 창이 주어지면 이 컨텍스트 창에서 임의의 (컨텍스트) 단어 wo가 나온다는 사실은 다음과 같이 모델링된 확률을 갖는 이벤트로 간주됩니다.

where σ (sigma) uses the definition of the sigmoid activation function:

여기서 σ는 시그모이드 활성화 함수activation function의 정의를 사용합니다.

Let’s begin by maximizing the joint probability of all such events in text sequences to train word embeddings. Specifically, given a text sequence of length T, denote by w**(t) the word at time step t and let the context window size be m, consider maximizing the joint probability

단어 임베딩을 훈련하기 위해 텍스트 시퀀스에서 이러한 모든 이벤트의 결합 확률을 최대화하는 것부터 시작해 보겠습니다. 구체적으로, 길이 T의 텍스트 시퀀스가 주어지면 시간 단계 t의 단어를 w**(t)로 표시하고 컨텍스트 창 크기를 m으로 두고 결합 확률을 최대화하는 것을 고려하십시오.

However, (15.2.3) only considers those events that involve positive examples. As a result, the joint probability in (15.2.3) is maximized to 1 only if all the word vectors are equal to infinity. Of course, such results are meaningless. To make the objective function more meaningful, negative sampling adds negative examples sampled from a predefined distribution.

그러나 (15.2.3)은 긍정적인 예를 포함하는 이벤트만 고려합니다. 결과적으로 (15.2.3)의 결합 확률은 모든 단어 벡터가 무한대인 경우에만 1로 최대화됩니다. 물론 그런 결과는 의미가 없다. 목적 함수를 더욱 의미 있게 만들기 위해 음수 샘플링은 사전 정의된 분포에서 샘플링된 음수 예를 추가합니다.

Denote by S the event that a context word wo comes from the context window of a center word wc. For this event involving wo, from a predefined distribution P(w) sample K noise words that are not from this context window. Denote by Nk the event that a noise word wk (k=1,...,K) does not come from the context window of wc. Assume that these events involving both the positive example and negative examples S,N1,...,Nk are mutually independent. Negative sampling rewrites the joint probability (involving only positive examples) in (15.2.3) as

문맥 단어 wo가 중심 단어 wc의 문맥 창에서 나오는 이벤트를 S로 표시합니다. wo와 관련된 이 이벤트의 경우 사전 정의된 분포 P(w)에서 이 컨텍스트 창에 속하지 않는 K개의 노이즈 단어를 샘플링합니다. 의미 없는 단어 wk(k=1,...,K)가 wc의 컨텍스트 창에서 나오지 않는 이벤트를 Nk로 나타냅니다. 긍정적인 예와 부정적인 예 S,N1,...,Nk를 모두 포함하는 이러한 이벤트가 상호 독립적이라고 가정합니다. 음수 샘플링은 (15.2.3)의 결합 확률(양수 예만 포함)을 다음과 같이 다시 작성합니다.

where the conditional probability is approximated through events S,N1,...,Nk:

여기서 조건부 확률은 사건 S,N1,...,Nk를 통해 근사됩니다.

Denote by it and ℎk the indices of a word w**(t) at time step t of a text sequence and a noise word wk, respectively. The logarithmic loss with respect to the conditional probabilities in (15.2.5) is

텍스트 시퀀스의 시간 단계 t에서 단어 w**(t)와 의미 없는 단어 wk의 인덱스를 각각 it과 ℎk로 표시합니다. (15.2.5)의 조건부 확률에 대한 로그 손실은 다음과 같습니다.

We can see that now the computational cost for gradients at each training step has nothing to do with the dictionary size, but linearly depends on K. When setting the hyperparameter K to a smaller value, the computational cost for gradients at each training step with negative sampling is smaller.

이제 각 훈련 단계에서 그래디언트의 계산 비용은 사전 크기와 관련이 없고 K에 선형적으로 의존한다는 것을 알 수 있습니다. 하이퍼파라미터 K를 더 작은 값으로 설정하면 음수 샘플링을 사용하는 각 훈련 단계의 기울기에 대한 계산 비용이 더 작아집니다.

NLP에서 Negative Sampling이란?

In Natural Language Processing (NLP), "negative sampling" refers to a technique used in training word embeddings, particularly in the context of models like Word2Vec. Word embeddings are dense vector representations of words that capture semantic relationships between words in a continuous vector space. Negative sampling is employed to efficiently train these embeddings by focusing on both positive examples (words that co-occur) and negative examples (randomly selected words that do not co-occur).

자연어 처리(NLP)에서 "부정 샘플링"은 주로 Word2Vec과 같은 모델의 단어 임베딩을 훈련하는 데 사용되는 기술로, 특히 단어 임베딩을 효과적으로 훈련시키기 위해 사용됩니다. 단어 임베딩은 단어 간의 의미적 관계를 연속적인 벡터 공간에서 포착하는 밀집 벡터 표현입니다. 부정 샘플링은 이러한 임베딩을 훈련하기 위해 양성 예제(동시에 발생하는 단어)와 음성 예제(동시에 발생하지 않는 임의로 선택된 단어)에 초점을 맞추는 데 사용됩니다.

The main idea behind negative sampling is to simplify the training process by turning it into a binary classification task. Instead of considering all possible words as negative examples, negative sampling randomly selects a small subset of words that do not appear in the context of the given word. For each positive example (word pair that co-occurs), several negative examples are chosen.

부정 샘플링의 주요 아이디어는 훈련 과정을 이진 분류 작업으로 단순화시키는 것입니다. 모든 가능한 단어를 음성 예제로 고려하는 대신, 부정 샘플링은 주어진 단어와 동시에 나타나지 않는 작은 하위 집합의 단어를 임의로 선택합니다. 각 양성 예제(동시에 발생하는 단어 쌍)에 대해 여러 음성 예제가 선택됩니다.

The training objective is to predict whether a given word pair is positive (co-occurring) or negative (randomly selected). This approach makes the training process computationally more efficient and scalable, as it reduces the need to consider all possible negative examples in each training iteration.

훈련 목표는 주어진 단어 쌍이 양성(동시에 발생)인지 음성(임의로 선택된)인지 예측하는 것입니다. 이 접근 방식은 훈련 과정을 계산적으로 더 효율적이고 확장 가능하게 만들며, 각 훈련 반복에서 모든 가능한 부정 예제를 고려할 필요가 줄어듭니다.

In negative sampling, the goal is to optimize the word embeddings in such a way that words that often co-occur in similar contexts are represented closer in the embedding space, while words that rarely or never co-occur are separated. This helps the embeddings capture semantic relationships and contextual information between words.

부정 샘플링에서의 목표는 단어 임베딩을 최적화하여 비슷한 문맥에서 자주 공존하는 단어가 임베딩 공간에서 가깝게 표현되고, 거의 또는 결코 함께 발생하지 않는 단어는 분리되도록 하는 것입니다. 이를 통해 임베딩은 단어 간의 의미적 관계와 문맥 정보를 효과적으로 포착할 수 있습니다.

Negative sampling has been widely used to train word embeddings, improving the efficiency of training while still achieving meaningful and useful representations of words in vector space.

부정 샘플링은 단어 임베딩을 훈련하는 데 널리 사용되며, 훈련의 효율성을 향상시키면서도 단어를 벡터 공간에서 의미 있는 유용한 표현으로 얻을 수 있게 해주는 장점을 가지고 있습니다.

15.2.2. Hierarchical Softmax

As an alternative approximate training method, hierarchical softmax uses the binary tree, a data structure illustrated in Fig. 15.2.1, where each leaf node of the tree represents a word in dictionary V.

대체 근사 훈련 방법 alternative approximate training method으로 계층적 소프트맥스 hierarchical softmax는 그림 15.2.1에 표시된 데이터 구조인 이진 트리를 사용합니다. 여기서 트리의 각 리프 노드는 사전 V의 단어를 나타냅니다.

Fig. 15.2.1  Hierarchical softmax for approximate training, where each leaf node of the tree represents a word in the dictionary.

Denote by L(w) the number of nodes (including both ends) on the path from the root node to the leaf node representing word w in the binary tree. Let n(w,j) be the jth node on this path, with its context word vector being un(w,j). For example, L(w3)=4 in Fig. 15.2.1. Hierarchical softmax approximates the conditional probability in (15.1.4) as

이진 트리에서 단어 w를 나타내는 루트 노드에서 리프 노드까지의 경로에 있는 노드 수(양 끝 포함)를 L(w)로 표시합니다. n(w,j)가 이 경로의 j번째 노드이고 해당 컨텍스트 단어 벡터가 un(w,j)라고 가정합니다. 예를 들어, 그림 15.2.1에서 L(w3)=4이다. 계층적 소프트맥스는 (15.1.4)의 조건부 확률을 다음과 같이 근사화합니다.

where function σ is defined in (15.2.2), and leftChild(n) is the left child node of node n: if x is true, [[x]]=1; otherwise [[x]]=−1.

여기서 함수 σ는 (15.2.2)에 정의되어 있고 leftChild(n)은 노드 n의 왼쪽 자식 노드입니다. x가 참이면 [[x]]=1; 그렇지 않으면 [[x]]=−1입니다.

To illustrate, let’s calculate the conditional probability of generating word w3 given word wc in Fig. 15.2.1. This requires dot products between the word vector vc of wc and non-leaf node vectors on the path (the path in bold in Fig. 15.2.1) from the root to w3, which is traversed left, right, then left:

설명을 위해 그림 15.2.1의 단어 wc가 주어졌을 때 단어 w3을 생성할 조건부 확률을 계산해 보겠습니다. 이를 위해서는 루트에서 w3까지 왼쪽, 오른쪽, 왼쪽으로 순회하는 경로(그림 15.2.1에서 굵은 글씨로 표시된 경로)에서 wc의 단어 벡터 vc와 리프가 아닌 노드 벡터 간의 내적이 필요합니다.

Since σ(x)+σ(−x)=1, it holds that the conditional probabilities of generating all the words in dictionary V based on any word wc sum up to one:

σ(x)+σ(−x)=1이므로 임의의 단어 wc를 기반으로 사전 V의 모든 단어를 생성하는 조건부 확률의 합은 1이 됩니다.

Fortunately, since L(wo)−1 is on the order of O(log2|V|) due to the binary tree structure, when the dictionary size V is huge, the computational cost for each training step using hierarchical softmax is significantly reduced compared with that without approximate training.

다행스럽게도 이진 트리 구조로 인해 L(wo)−1은 O(log2|V|) 정도이므로 사전 크기 V가 클 경우 계층적 소프트맥스를 사용한 각 학습 단계의 계산 비용은 이에 비해 크게 줄어듭니다. 대략적인 훈련 없이도 말이죠.

Hierarchical Softmax 이란?

"Hierarchical Softmax" is a technique used in natural language processing (NLP) to improve the efficiency of training large vocabulary language models, such as neural network-based language models. The goal of hierarchical softmax is to address the computational complexity and memory requirements associated with traditional softmax when dealing with a large output vocabulary.

"계층 소프트맥스(Hierarchical Softmax)"는 자연어 처리(NLP)에서 대규모 어휘 언어 모델의 효율적인 훈련을 개선하기 위해 사용되는 기술로, 신경망 기반 언어 모델과 같은 모델에서 큰 출력 어휘를 다룰 때 발생하는 계산 복잡성과 메모리 요구 사항에 대응합니다. 계층 소프트맥스의 목표는 큰 어휘를 처리할 때 전통적인 소프트맥스의 계산 복잡성과 메모리 요구 사항을 해결하는 것입니다.

In traditional softmax, when calculating the probabilities of all possible words in the vocabulary for a given input, the model needs to compute an exponential number of probabilities, making it computationally expensive and memory-intensive, especially for large vocabularies. Hierarchical softmax offers a more efficient alternative by breaking down the vocabulary into a hierarchy or tree structure.

전통적인 소프트맥스에서는 주어진 입력에 대해 어휘 내 모든 가능한 단어의 확률을 계산할 때, 모델은 지수 개의 확률을 계산해야 하기 때문에 계산적으로 매우 비용이 많이 들며, 특히 큰 어휘에 대해 메모리를 많이 소비합니다. 계층 소프트맥스는 어휘를 계층 또는 트리 구조로 분해하여 효율적인 대안을 제공합니다.

The basic idea of hierarchical softmax is to represent the vocabulary as a binary tree or a hierarchical structure, where each word corresponds to a leaf node. This tree structure allows the model to traverse a specific path from the root to a leaf node, effectively reducing the number of probabilities that need to be computed. Each internal node in the tree represents a binary decision, determining whether to move to the left or right child node based on the probability distribution. The probabilities are computed incrementally along the chosen path, significantly reducing the computational burden.

계층 소프트맥스의 기본 아이디어는 어휘를 이진 트리 또는 계층 구조로 표현하는 것으로, 각 단어가 잎 노드에 해당합니다. 이 트리 구조를 통해 모델은 루트에서 잎 노드로 특정 경로를 따라 이동할 수 있으며, 계산해야 할 확률의 수를 효과적으로 줄일 수 있습니다. 트리 내의 각 내부 노드는 이진 결정을 나타내며, 선택된 경로를 따라 왼쪽 또는 오른쪽 자식 노드로 이동할지 여부를 확률 분포를 기반으로 결정합니다. 확률은 선택한 경로를 따라 점진적으로 계산되어 계산 부담을 크게 줄입니다.

By using hierarchical softmax, the computational complexity of calculating the probabilities is reduced from O(V) in the case of traditional softmax (where V is the size of the vocabulary) to O(log V), making it more feasible to handle large vocabularies. This technique is especially useful when training models on massive datasets with extensive vocabularies.

계층 소프트맥스를 사용하면 확률을 계산하는 계산 복잡성이 전통적인 소프트맥스의 경우(O(V), 여기서 V는 어휘 크기)와 비교하여 O(log V)로 줄어듭니다. 따라서 대규모 어휘를 처리하는 것이 더 가능해집니다. 이러한 기술은 대량의 데이터셋과 광범위한 어휘를 처리하려는 모델에서 특히 유용합니다.

Hierarchical softmax is a trade-off between computational efficiency and model accuracy. While it sacrifices some accuracy compared to the full softmax, it allows training larger models on larger datasets more efficiently. It is commonly used in language models like Word2Vec and FastText, which aim to learn high-quality word representations while efficiently handling large vocabularies.

계층 소프트맥스는 계산 효율성과 모델 정확성 사이의 절충안입니다. 전체 소프트맥스와 비교하여 일부 정확성을 희생하면서 더 큰 모델을 더 효율적으로 대규모 데이터셋에 훈련시키는 것을 가능하게 합니다. Word2Vec 및 FastText와 같은 언어 모델에서 널리 사용되며, 고품질 단어 표현을 학습하면서 큰 어휘를 효율적으로 처리하는 데 활용됩니다.

http://uponthesky.tistory.com/15

계층적 소프트맥스(Hierarchical Softmax, HS) in word2vec

계층적 소프트맥스(Hierarchical Softmax, HS)란? 기존 softmax의 계산량을 현격히 줄인, softmax에 근사시키는 방법론이다. Word2Vec에서 skip-gram방법으로 모델을 훈련시킬 때 네거티브 샘플링(negative sampling)

uponthesky.tistory.com

15.2.3. Summary

Negative sampling constructs the loss function by considering mutually independent events that involve both positive and negative examples. The computational cost for training is linearly dependent on the number of noise words at each step.
네거티브 샘플링은 긍정적인 사례와 부정적인 사례를 모두 포함하는 상호 독립적인 이벤트를 고려하여 손실 함수를 구성합니다. 훈련을 위한 계산 비용은 각 단계의 노이즈 단어 수에 선형적으로 의존합니다.
Hierarchical softmax constructs the loss function using the path from the root node to the leaf node in the binary tree. The computational cost for training is dependent on the logarithm of the dictionary size at each step.

계층적 소프트맥스는 이진 트리의 루트 노드에서 리프 노드까지의 경로를 사용하여 손실 함수를 구성합니다. 훈련을 위한 계산 비용은 각 단계에서 사전 크기의 로그에 따라 달라집니다.

15.2.4. Exercises

How can we sample noise words in negative sampling?
Verify that (15.2.9) holds.
How to train the continuous bag of words model using negative sampling and hierarchical softmax, respectively?

'Dive into Deep Learning > D2L Natural language Processing' 카테고리의 다른 글

D2L - 15.10. Pretraining BERT (0)	2023.08.30
D2L - 15.9. The Dataset for Pretraining BERT (0)	2023.08.30
D2L - 15.8. Bidirectional Encoder Representations from Transformers (BERT) (0)	2023.08.30
D2L - 15.7. Word Similarity and Analogy (0)	2023.08.30
D2L - 15.6. Subword Embedding (0)	2023.08.30
D2L - 15.5. Word Embedding with Global Vectors (GloVe) (0)	2023.08.29
D2L - 15.4. Pretraining word2vec (0)	2023.08.29
D2L - 15.3. The Dataset for Pretraining Word Embeddings (0)	2023.08.29
D2L- 15.1. Word Embedding (word2vec) (0)	2023.08.25
D2L - 15. Natural Language Processing: Pretraining (0)	2023.08.24

Dive into Deep Learning/D2L Natural language Processing

D2L- 15.1. Word Embedding (word2vec)

2023. 8. 25. 19:22 | Posted by 솔웅

15.1. Word Embedding (word2vec) — Dive into Deep Learning 1.0.3 documentation (d2l.ai)

15.1. Word Embedding (word2vec) — Dive into Deep Learning 1.0.3 documentation

d2l.ai

15.1. Word Embedding (word2vec)

Natural language is a complex system used to express meanings. In this system, words are the basic unit of the meaning. As the name implies, word vectors are vectors used to represent words, and can also be considered as feature vectors or representations of words. The technique of mapping words to real vectors is called word embedding. In recent years, word embedding has gradually become the basic knowledge of natural language processing.

자연어는 의미를 표현하는 데 사용되는 복잡한 시스템입니다. 이 시스템에서 단어는 의미의 기본 단위입니다. 이름에서 알 수 있듯이 단어 벡터는 단어를 나타내는 데 사용되는 벡터이며, 특징 벡터 또는 단어 표현으로도 간주될 수 있습니다. 단어를 실제 벡터에 매핑하는 기술을 단어 임베딩이라고 합니다. 최근 몇 년 동안 워드 임베딩은 점차 자연어 처리의 기본 지식이 되었습니다.

15.1.1. One-Hot Vectors Are a Bad Choice

We used one-hot vectors to represent words (characters are words) in Section 9.5. Suppose that the number of different words in the dictionary (the dictionary size) is N, and each word corresponds to a different integer (index) from 0 to N−1. To obtain the one-hot vector representation for any word with index i, we create a length-N vector with all 0s and set the element at position i to 1. In this way, each word is represented as a vector of length N, and it can be used directly by neural networks.

우리는 섹션 9.5에서 단어(문자는 단어)를 표현하기 위해 원-핫 벡터를 사용했습니다. 사전 dictionary 에 있는 서로 다른 단어의 수(사전 크기)가 N이고, 각 단어는 0부터 N-1까지의 서로 다른 정수(인덱스)에 해당한다고 가정합니다. 인덱스 i가 있는 단어에 대한 원-핫 벡터 표현을 얻으려면 모두 0인 길이 N 벡터를 만들고 위치 i의 요소를 1로 설정합니다. 이러한 방식으로 각 단어는 길이 N의 벡터로 표현됩니다. 신경망에서 직접 사용할 수 있습니다.

Although one-hot word vectors are easy to construct, they are usually not a good choice. A main reason is that one-hot word vectors cannot accurately express the similarity between different words, such as the cosine similarity that we often use. For vectors x,y∈ℝ**d, their cosine similarity is the cosine of the angle between them:

원-핫 단어 벡터는 구성하기 쉽지만 일반적으로 좋은 선택은 아닙니다. 주된 이유는 원-핫 단어 벡터가 우리가 자주 사용하는 코사인 유사성 cosine similarity과 같은 서로 다른 단어 간의 유사성을 정확하게 표현할 수 없기 때문입니다. 벡터 x,y∈ℝ**d의 경우 코사인 유사성 cosine similarity은 두 벡터 사이 각도의 코사인입니다.

Since the cosine similarity between one-hot vectors of any two different words is 0, one-hot vectors cannot encode similarities among words.

서로 다른 두 단어의 원-핫 벡터 간의 코사인 유사성은 0이므로 원-핫 벡터는 단어 간의 유사성을 인코딩할 수 없습니다.

코사인이란?

Cosine is a mathematical term that's used to measure the similarity between two things, like directions or vectors. Imagine you have two arrows pointing in certain directions. The cosine similarity tells you how much these arrows are aligned with each other.

코사인은 두 가지 사물, 예를 들어 방향이나 벡터 사이의 유사성을 측정하는 수학 용어입니다. 특정 방향을 가리키는 두 개의 화살표를 생각해보세요. 코사인 유사성은 이러한 화살표가 얼마나 정렬되어 있는지를 알려줍니다.

When the arrows are pointing in the exact same direction, the cosine similarity is 1. This means they are very similar. If they are perpendicular (90 degrees apart), the cosine similarity is 0, indicating they are not similar at all. And if they point in opposite directions, the cosine similarity is -1, showing that they are kind of opposite.

화살표가 정확히 같은 방향을 가리킬 때, 코사인 유사성은 1입니다. 이는 그들이 매우 유사하다는 것을 의미합니다. 만약 그들이 수직이라면 (90도 떨어져 있다면), 코사인 유사성은 0이 되어서 전혀 유사하지 않음을 나타냅니다. 그리고 만약 그들이 반대 방향을 가리키면, 코사인 유사성은 -1이 되어서 그들이 어느 정도 반대라는 것을 나타냅니다.

In simpler terms, cosine similarity helps us figure out how much two things are pointing in the same direction, like arrows or vectors. It's a way to measure how similar or different these things are based on their direction.

보다 간단한 용어로 설명하면, 코사인 유사성은 화살표나 벡터와 같이 두 가지 사물이 같은 방향을 가리키는지 얼마나 많이 가리키는지를 알려줍니다. 이것은 방향을 기반으로 이러한 사물이 얼마나 비슷하거나 다른지를 측정하는 방법입니다.

Word Embedding에서 Cosine Similarity란?

In Word Embedding within Natural Language Processing (NLP), "cosine similarity" is a measure used to quantify the similarity between two word vectors. Word embeddings are numerical representations of words in a continuous vector space. Cosine similarity is employed to assess how alike these word vectors are in terms of their direction within this space.

자연어 처리(NLP)의 단어 임베딩에서 '코사인 유사성'은 두 단어 벡터 사이의 유사성을 측정하는 데 사용되는 척도입니다. 단어 임베딩은 연속 벡터 공간에서 단어의 수치적 표현입니다. 코사인 유사성은 이 공간 내에서 이러한 단어 벡터가 방향 측면에서 얼마나 비슷한지를 평가하기 위해 사용됩니다.

Here's how it works:

작동 방식은 다음과 같습니다:

Each word is represented as a vector in a high-dimensional space. Words with similar meanings or contexts tend to have similar vectors.

각 단어는 고차원 공간에서 벡터로 표현됩니다. 비슷한 의미나 문맥을 가진 단어는 유사한 벡터를 가집니다.
To measure the similarity between two word vectors, cosine similarity calculates the cosine of the angle between them. This angle indicates how aligned the vectors are in this high-dimensional space.

두 단어 벡터 사이의 유사성을 측정하기 위해 코사인 유사성은 그들 사이의 각도의 코사인을 계산합니다. 이 각도는 이 고차원 공간에서 벡터들이 얼마나 정렬되어 있는지를 나타냅니다.
If the angle between the vectors is very small (close to 0 degrees), the cosine of that angle is close to 1, indicating high similarity. This means the words are similar in meaning or context.

벡터 사이의 각이 매우 작은 경우(0도에 가까운 경우), 그 각의 코사인은 1에 가까워져 높은 유사성을 나타냅니다. 이는 단어들이 의미나 문맥에서 유사하다는 것을 의미합니다.
If the angle is close to 90 degrees (perpendicular vectors), the cosine of the angle is close to 0, implying low similarity. The words are dissimilar in meaning or context.

각이 90도에 가까운 경우(수직 벡터), 각의 코사인은 0에 가까워져 낮은 유사성을 나타냅니다. 단어들은 의미나 문맥에서 다르다는 것을 의미합니다.
If the angle is close to 180 degrees (opposite directions), the cosine of the angle is close to -1. This means the words have opposite meanings or contexts.

각이 180도에 가까운 경우(반대 방향), 각의 코사인은 -1에 가까워집니다. 이는 단어들이 반대의 의미나 문맥을 가진다는 것을 의미합니다.

In NLP tasks, cosine similarity is used to compare word embeddings and identify words that are contextually or semantically similar. It helps in various applications such as finding synonyms, clustering similar words, and even understanding the relationships between words in the embedding space.

NLP 작업에서 코사인 유사성은 단어 임베딩을 비교하고 문맥적으로나 의미론적으로 유사한 단어를 식별하는 데 사용됩니다. 이는 동의어를 찾거나 유사한 단어를 클러스터링하고, 심지어 임베딩 공간에서 단어 간의 관계를 이해하는 데에도 도움이 됩니다.

15.1.2. Self-Supervised word2vec

The word2vec tool was proposed to address the above issue. It maps each word to a fixed-length vector, and these vectors can better express the similarity and analogy relationship among different words. The word2vec tool contains two models, namely skip-gram (Mikolov et al., 2013) and continuous bag of words (CBOW) (Mikolov et al., 2013). For semantically meaningful representations, their training relies on conditional probabilities that can be viewed as predicting some words using some of their surrounding words in corpora. Since supervision comes from the data without labels, both skip-gram and continuous bag of words are self-supervised models.

위의 문제를 해결하기 위해 word2vec 도구가 제안되었습니다. 각 단어를 고정 길이 벡터에 매핑하며, 이러한 벡터는 서로 다른 단어 간의 유사성과 유추 관계를 더 잘 표현할 수 있습니다. word2vec 도구에는 Skip-gram(Mikolov et al., 2013)과 Continuous Bag of Words(CBOW)(Mikolov et al., 2013)라는 두 가지 모델이 포함되어 있습니다. 의미상 의미 있는 표현 semantically meaningful representations 의 경우 훈련은 말뭉치의 주변 단어 중 일부를 사용하여 일부 단어를 예측하는 것으로 볼 수 있는 조건부 확률에 의존합니다. 감독 supervision 은 레이블이 없는 데이터에서 발생하므로 스킵 그램 Skip-gram 과 연속 단어 모음 Continuous Bag of Words은 모두 자체 감독 모델 self-supervised models입니다.

word2vek 이란?

Word2Vec is a popular technique in natural language processing (NLP) that is used to generate dense vector representations (embeddings) of words. These embeddings capture semantic relationships between words and enable machines to understand the meanings and contexts of words based on their positions in the vector space.

Word2Vec은 자연어 처리(NLP)에서 널리 사용되는 기술로, 단어의 밀집 벡터 표현(임베딩)을 생성하는 데에 활용됩니다. 이러한 임베딩은 단어 간의 의미적 관계를 포착하며, 단어들의 벡터 공간 내 위치에 기반하여 기계가 단어의 의미와 문맥을 이해할 수 있도록 합니다.

The term "word2vec" refers to a family of models that are trained to learn these word embeddings from large amounts of text data. There are two main architectures within the word2vec framework:

"word2vec"이라는 용어는 대용량 텍스트 데이터로부터 이러한 단어 임베딩을 학습하는 모델 패밀리를 가리킵니다. word2vec 프레임워크 내에서 두 가지 주요 아키텍처가 있습니다:

Skip-Gram: This architecture aims to predict context words (words nearby in a sentence) given a target word. It essentially "skips" over words to predict their context. Skip-Gram is effective for larger datasets and capturing word relationships.

Skip-Gram: 이 아키텍처는 목표 단어를 기반으로 문맥 단어(문장에서 가까운 위치에 있는 단어)를 예측하는 것을 목표로 합니다. 이는 실제로 그들의 contect를 예측하기 위해 단어를 "skips" 합니. Skip-Gram은 대규모 데이터셋과 단어 간의 관계를 잡아내는 데에 효과적입니다.
Continuous Bag of Words (CBOW): This architecture predicts a target word based on its surrounding context words. It aims to predict a "bag" of context words given the target word. CBOW is computationally efficient and works well for smaller datasets.

연속 단어 봉투 (CBOW): 이 아키텍처는 주변 문맥 단어를 기반으로 목표 단어를 예측합니다. 이는 목표 단어를 기반으로 "bag" 형태의 문맥 단어를 예측합니다. CBOW는 계산적으로 효율적이며 작은 데이터셋에 잘 작동합니다.

Here's how word2vec generally works:

Data Preparation: A large corpus of text is collected and preprocessed.

데이터 준비: 대용량의 텍스트 말뭉치가 수집되고 전처리됩니다.
Word Tokenization: The text is divided into words or subwords, creating a vocabulary.

단어 토큰화: 텍스트가 단어나 하위 단어로 분할되어 어휘가 생성됩니다.
Embedding Learning: The chosen word2vec model architecture (Skip-Gram or CBOW) is trained on the text data. The model's parameters (word vectors) are updated during training to minimize a loss function.

임베딩 학습: 선택한 word2vec 모델 아키텍처(Skip-Gram 또는 CBOW)가 텍스트 데이터로 훈련됩니다. 모델의 매개변수(단어 벡터)는 훈련 중에 손실 함수를 최소화하기 위해 업데이트됩니다.
Semantic Relationships: After training, the learned word embeddings capture semantic relationships. Similar words have similar vectors in the embedding space.

의미적 관계: 훈련 후에 학습된 단어 임베딩은 의미적 관계를 포착합니다. 비슷한 단어는 임베딩 공간에서 비슷한 벡터를 가지게 됩니다.
Applications: These word embeddings can be used in various NLP tasks, such as machine translation, sentiment analysis, recommendation systems, and more. They enable algorithms to understand the meanings and contexts of words, even if they haven't encountered them before.

응용: 이러한 단어 임베딩은 기계 번역, 감정 분석, 추천 시스템 등 다양한 NLP 작업에 활용될 수 있습니다. 이들은 알고리즘이 의미와 문맥을 이해할 수 있도록 하며, 이전에 접하지 못한 단어라도 이해할 수 있습니다.

Word2Vec's power lies in its ability to create compact, meaningful, and dense representations of words that are suitable for use in machine learning models. These embeddings facilitate the transfer of linguistic context to numerical vectors, enabling algorithms to operate on and understand textual data more effectively.

Word2Vec의 힘은 머신 러닝 모델에서 사용하기에 적합한 간결하고 의미 있는 밀집형 단어 표현을 만들 수 있는 능력에 있습니다. 이러한 임베딩은 언어적 문맥을 수치 벡터로 전달하여 알고리즘들이 텍스트 데이터를 보다 효과적으로 처리하고 이해하는 데 도움을 줍니다.

In the following, we will introduce these two models and their training methods.

다음에서는 이 두 모델과 그 훈련 방법을 소개하겠습니다.

15.1.3. The Skip-Gram Model

The skip-gram model assumes that a word can be used to generate its surrounding words in a text sequence. Take the text sequence “the”, “man”, “loves”, “his”, “son” as an example. Let’s choose “loves” as the center word and set the context window size to 2. As shown in Fig. 15.1.1, given the center word “loves”, the skip-gram model considers the conditional probability for generating the context words: “the”, “man”, “his”, and “son”, which are no more than 2 words away from the center word:

스킵그램 모델은 단어를 사용하여 텍스트 시퀀스에서 주변 단어를 생성할 수 있다고 가정합니다. 텍스트 시퀀스 "the", "man", "loves", "his", "son"을 예로 들어 보겠습니다. “loves”를 중앙 단어 center word로 선택하고 컨텍스트 창 크기를 2로 설정해 보겠습니다. 그림 15.1.1에 표시된 대로 중앙 단어 “loves”가 주어지면 스킵-그램 모델은 컨텍스트 단어를 생성하기 위한 조건부 확률을 고려합니다. "the", "man", "his" 및 "son"은 중심 단어에서 2단어 이상 떨어져 있지 않습니다.

Assume that the context words are independently generated given the center word (i.e., conditional independence). In this case, the above conditional probability can be rewritten as

중심 단어center word가 주어지면 문맥 단어context words가 독립적으로 생성된다고 가정합니다(즉, 조건부 독립). 이 경우 위의 조건부 확률은 다음과 같이 다시 쓸 수 있습니다.

Fig. 15.1.1  The skip-gram model considers the conditional probability of generating the surrounding context words given a center word.

In the skip-gram model, each word has two d-dimensional-vector representations for calculating conditional probabilities. More concretely, for any word with index i in the dictionary, denote by vi∈ℝ**d and ui∈ℝ**d its two vectors when used as a center word and a context word, respectively. The conditional probability of generating any context word wo (with index o in the dictionary) given the center word wc (with index c in the dictionary) can be modeled by a softmax operation on vector dot products:

스킵 그램 모델에서 각 단어에는 조건부 확률을 계산하기 위한 두 개의 d차원 벡터 표현이 있습니다. 보다 구체적으로 사전dictionary에서 인덱스 i가 있는 단어에 대해 각각 중심 단어와 문맥 단어로 사용될 때 해당 두 벡터를 vi∈ℝ**d 및 ui∈ℝ**d로 표시합니다. 중앙 단어 wc(사전의 인덱스 c 포함)가 주어지면 임의의 컨텍스트 단어 wo(사전의 인덱스 o 포함)를 생성할 조건부 확률은 벡터 내적 vector dot products에 대한 소프트맥스 연산 softmax operation으로 모델링할 수 있습니다.

where the vocabulary index set V={0,1,…,|V|−1}. Given a text sequence of length T, where the word at time step t is denoted as w**(t). Assume that context words are independently generated given any center word. For context window size m, the likelihood function of the skip-gram model is the probability of generating all context words given any center word:

여기서 어휘 색인vocabulary index 은 V={0,1,…,|V|−1}로 설정됩니다. 길이 T의 텍스트 시퀀스가 주어지면 시간 단계 time step t의 단어는 w**(t)로 표시됩니다. 중심 단어 center word가 주어지면 문맥 단어 context words가 독립적으로 생성된다고 가정합니다. 문맥 창 크기 context window size m의 경우 스킵 그램 모델의 우도 함수 likelihood function는 중심 단어 center word가 주어지면 모든 문맥 단어 context words를 생성할 확률입니다.

where any time step that is less than 1 or greater than T can be omitted.

여기서 1보다 작거나 T보다 큰 시간 단계는 생략할 수 있습니다.

Skip gram 모델이란?

The Skip-Gram model is a type of word embedding model used in natural language processing (NLP) to learn dense vector representations of words. It's particularly effective in capturing semantic relationships between words and is a fundamental component of many NLP applications.

스킵 그램 모델은 자연어 처리(NLP)에서 사용되는 단어 임베딩 모델의 한 유형으로, 단어의 밀집 벡터 표현을 학습하는 데에 사용됩니다. 특히 이 모델은 단어 간의 의미적 관계를 포착하는 데 효과적이며, 여러 NLP 응용 프로그램의 기본 구성 요소입니다.

The main idea behind the Skip-Gram model is to predict the context words (words nearby in a sentence) given a target word. This is done by training the model on a large corpus of text data, where the goal is to maximize the probability of predicting the context words based on the target word.

스킵 그램 모델의 주요 아이디어는 목표 단어를 기반으로 문맥 단어(문장에서 가까운 위치에 있는 단어)를 예측하는 것입니다. 이를 위해 모델을 대용량의 텍스트 데이터로 훈련하며, 목표 단어를 기반으로 문맥 단어를 예측하는 확률을 최대화하는 것이 목표입니다.

Here's how the Skip-Gram model works:

스킵 그램 모델의 작동 방식은 다음과 같습니다:

Creating Training Data: For each word in the training corpus, a window of surrounding words (context words) is selected. The target word is paired with these context words to create training examples.

훈련 데이터 생성: 훈련 말뭉치의 각 단어에 대해 주변 단어(문맥 단어)의 창이 선택됩니다. 목표 단어는 이러한 문맥 단어와 결합되어 훈련 예제가 생성됩니다.
Word Embedding Initialization: Each word is represented by two sets of vectors - one for the target word and one for the context words. These vectors are initialized randomly.

단어 임베딩 초기화: 각 단어는 두 개의 벡터 집합으로 표현됩니다 - 하나는 목표 단어를 위한 것이고, 다른 하나는 문맥 단어를 위한 것입니다. 이러한 벡터는 무작위로 초기화됩니다.
Training Objective: The model aims to maximize the probability of predicting context words given the target word. This is achieved by minimizing a loss function, which is typically a form of the negative log likelihood.

훈련 목표: 모델은 목표 단어를 기반으로 문맥 단어를 예측하는 확률을 최대화하는 것이 목표입니다. 이는 일반적으로 음의 로그 우도의 log likelihood 형태로 나타난 손실 함수를 최소화하는 것으로 달성됩니다.
Training Process: During training, the model updates the word vectors in a way that improves the prediction accuracy of context words for a given target word.

훈련 과정: 훈련 중에 모델은 단어 벡터를 업데이트하여 특정 목표 단어에 대한 문맥 단어의 예측 정확도를 개선합니다.
Semantic Relationships: The trained word vectors end up capturing semantic relationships between words. Words with similar meanings or contexts have similar vectors in the embedding space.

의미적 관계: 훈련된 단어 벡터는 단어 간의 의미적 관계를 포착합니다. 의미나 문맥이 비슷한 단어들은 임베딩 공간에서 비슷한 벡터를 가지게 됩니다.
Similarity: The cosine similarity between the learned word vectors can be used to measure the semantic similarity between words.

유사성: 학습된 단어 벡터 간의 코사인 유사도는 단어 간의 의미적 유사성을 측정하는 데 사용될 수 있습니다.
Applications: These word vectors can be used as features in various NLP tasks such as language modeling, sentiment analysis, machine translation, and more.

응용: 이러한 단어 벡터는 언어 모델링, 감정 분석, 기계 번역 등 다양한 NLP 작업에서 특성으로 사용될 수 있습니다.

Skip-Gram model's simplicity and effectiveness in capturing word relationships have made it a widely used technique in NLP. It's worth noting that the Skip-Gram model is often compared with another popular word embedding model called Continuous Bag of Words (CBOW), which predicts the target word given context words. Both models are trained using large amounts of text data and are part of the family of techniques referred to as word2vec.

스킵 그램 모델의 간결함과 단어 관계 포착 능력은 이를 NLP에서 널리 사용되는 기술로 만들었습니다. 스킵 그램 모델은 종종 Continuous Bag of Words (CBOW)라는 다른 인기있는 단어 임베딩 모델과 비교되며, CBOW는 문맥 단어를 기반으로 목표 단어를 예측합니다. 두 모델 모두 대량의 텍스트 데이터를 사용하여 훈련되며, word2vec이라는 기술 패밀리의 일부입니다.

15.1.3.1. Training

The skip-gram model parameters are the center word vector and context word vector for each word in the vocabulary. In training, we learn the model parameters by maximizing the likelihood function (i.e., maximum likelihood estimation). This is equivalent to minimizing the following loss function:

스킵그램 모델 매개변수는 어휘의 각 단어에 대한 중심 단어 벡터와 문맥 단어 벡터입니다. 훈련에서는 우도 likelihood 함수(즉, 최대 우도 likelihood 추정)를 최대화하여 모델 매개변수를 학습합니다. 이는 다음 손실 함수를 최소화하는 것과 동일합니다.

When using stochastic gradient descent to minimize the loss, in each iteration we can randomly sample a shorter subsequence to calculate the (stochastic) gradient for this subsequence to update the model parameters. To calculate this (stochastic) gradient, we need to obtain the gradients of the log conditional probability with respect to the center word vector and the context word vector. In general, according to (15.1.4) the log conditional probability involving any pair of the center word wc and the context word wo is

손실을 최소화하기 위해 확률적 경사하강법 stochastic gradient descent(SGD)을 사용할 때, 각 반복에서 더 짧은 하위 시퀀스를 무작위로 샘플링하여 이 하위 시퀀스에 대한 (확률적) 그라데이션을 계산하여 모델 매개변수를 업데이트할 수 있습니다. 이 (확률적) 기울기를 계산하려면 중앙 단어 벡터와 문맥 단어 벡터에 대한 로그 조건부 확률의 기울기를 구해야 합니다. 일반적으로 (15.1.4)에 따르면 중심 단어 wc와 문맥 단어 wo의 임의 쌍을 포함하는 로그 조건부 확률은 다음과 같습니다.

Stochastic Gradient Descent(SGD)란?

Stochastic Gradient Descent (SGD) is an optimization algorithm used in machine learning and deep learning to train models by iteratively updating the model's parameters based on the gradient of the loss function with respect to the data points.

확률적 경사 하강법(Stochastic Gradient Descent, SGD)은 머신 러닝과 딥 러닝에서 사용되는 최적화 알고리즘으로, 모델의 파라미터를 반복적으로 업데이트하여 데이터 포인트와 관련된 손실 함수의 그래디언트에 따라 모델을 학습시킵니다.

Here's what it means:

다음과 같은 의미를 가집니다:

Gradient Descent: In machine learning, when we train a model, we often try to minimize a loss function. This loss function measures how well the model's predictions match the actual data. Gradient descent is an optimization technique that aims to find the minimum of this loss function by iteratively adjusting the model's parameters in the opposite direction of the gradient (slope) of the loss function.

경사 하강법: 머신 러닝에서 모델을 훈련할 때 종종 손실 함수를 최소화하려고 합니다. 이 손실 함수는 모델의 예측이 실제 데이터와 얼마나 잘 일치하는지를 측정합니다. 경사 하강법은 손실 함수의 최소값을 찾기 위해 모델의 파라미터를 손실 함수의 그래디언트(기울기)의 반대 방향으로 반복적으로 조정하는 최적화 기술입니다.
Stochastic: The term "stochastic" in SGD refers to randomness. Instead of computing the gradient of the loss function using the entire dataset (batch gradient descent), SGD uses a random subset of the data (mini-batch) to compute an estimate of the gradient. This introduces randomness in the optimization process.

확률적: SGD에서의 "확률적"이란 무작위성을 의미합니다. 전체 데이터셋을 사용하여 손실 함수의 그래디언트를 계산하는 대신, SGD는 무작위하게 선택된 데이터의 부분집합(미니 배치)을 사용하여 그래디언트의 추정치를 계산합니다. 이렇게 하면 최적화 과정에 무작위성이 도입됩니다.
Iterative Updates: SGD iteratively updates the model's parameters by taking small steps in the direction that reduces the loss. It processes one mini-batch of data at a time, computes the gradient for that batch, and updates the parameters accordingly.

반복적인 업데이트: SGD는 모델의 파라미터를 반복적으로 업데이트하여 손실을 줄이는 방향으로 작은 단계를 걸어갑니다. 한 번에 하나의 미니 배치 데이터를 처리하고 해당 배치에 대한 그래디언트를 계산하고 파라미터를 그에 따라 업데이트합니다.
Noise and Faster Convergence: Because SGD uses random subsets of data, it introduces a certain amount of noise into the optimization process. While this noise might seem counterintuitive, it can actually help the optimization process converge faster by escaping local minima and speeding up exploration of the loss landscape.

노이즈와 빠른 수렴: SGD는 무작위 데이터 부분집합을 사용하기 때문에 최적화 과정에 어느 정도의 노이즈가 도입됩니다. 이 노이즈는 직관적으로는 이상해보일 수 있지만, 실제로는 로컬 최소값을 벗어나고 손실 공간을 탐색을 빠르게 하여 최적화 과정을 더 빨리 수렴시킬 수 있습니다.
Learning Rate: SGD includes a parameter called the learning rate, which determines the step size for each update. A high learning rate can lead to overshooting the minimum, while a low learning rate can slow down convergence. Finding the right learning rate is important for successful training.

학습률: SGD에는 학습률이라는 매개변수가 포함되어 있으며, 각 업데이트에 대한 단계 크기를 결정합니다. 높은 학습률은 최소값을 지나칠 수 있지만 낮은 학습률은 수렴 속도를 늦출 수 있습니다. 올바른 학습률을 찾는 것은 성공적인 훈련에 중요합니다.

In summary, Stochastic Gradient Descent (SGD) is an optimization method that updates a model's parameters using randomly selected subsets of data. It iteratively adjusts the parameters to minimize the loss function and train machine learning models efficiently, often used in training neural networks and other large-scale models.

요약하면, 확률적 경사 하강법(SGD)은 무작위로 선택된 데이터 부분집합을 사용하여 모델의 파라미터를 업데이트하는 최적화 방법입니다. 이는 손실 함수를 최소화하고 머신 러닝 모델을 효율적으로 훈련시키는 데 사용되며, 주로 신경망 및 대규모 모델을 훈련시킬 때 사용됩니다.

Through differentiation, we can obtain its gradient with respect to the center word vector wc as

미분을 통해 중심 단어 벡터 wc에 대한 기울기를 다음과 같이 얻을 수 있습니다.

Note that the calculation in (15.1.8) requires the conditional probabilities of all words in the dictionary with wc as the center word. The gradients for the other word vectors can be obtained in the same way.

(15.1.8)의 계산에는 wc를 중심 단어로 하는 사전의 모든 단어에 대한 조건부 확률이 필요하다는 점에 유의하십시오. 다른 단어 벡터에 대한 기울기도 같은 방법으로 얻을 수 있습니다.

After training, for any word with index i in the dictionary, we obtain both word vectors vi (as the center word) and ui (as the context word). In natural language processing applications, the center word vectors of the skip-gram model are typically used as the word representations.

훈련 후 사전에 인덱스 i가 있는 단어에 대해 단어 벡터 vi(중앙 단어)와 ui(문맥 단어)를 모두 얻습니다. 자연어 처리 응용 프로그램에서는 스킵 그램 모델의 중심 단어 벡터가 일반적으로 단어 표현으로 사용됩니다.

15.1.4. The Continuous Bag of Words (CBOW) Model

The continuous bag of words (CBOW) model is similar to the skip-gram model. The major difference from the skip-gram model is that the continuous bag of words model assumes that a center word is generated based on its surrounding context words in the text sequence. For example, in the same text sequence “the”, “man”, “loves”, “his”, and “son”, with “loves” as the center word and the context window size being 2, the continuous bag of words model considers the conditional probability of generating the center word “loves” based on the context words “the”, “man”, “his” and “son” (as shown in Fig. 15.1.2), which is

CBOW(Continuous Bag of Words) 모델은 스킵그램 모델과 유사합니다. 스킵 그램 모델과의 주요 차이점은 연속 단어 모음 모델 continuous bag of words model은 텍스트 시퀀스의 주변 문맥 단어를 기반으로 중심 단어가 생성된다고 가정한다는 것입니다. 예를 들어, 동일한 텍스트 시퀀스 "the", "man", "loves", "his" 및 "son"에서 "loves"가 중심 단어이고 컨텍스트 창 크기가 2인 연속 단어 모음 모델은 문맥 단어 "the", "man", "his" 및 "son"을 기반으로 중앙 단어 "loves"를 생성할 조건부 확률을 고려합니다(그림 15.1.2 참조).

Fig. 15.1.2  The continuous bag of words model considers the conditional probability of generating the center word given its surrounding context words.

Since there are multiple context words in the continuous bag of words model, these context word vectors are averaged in the calculation of the conditional probability. Specifically, for any word with index i in the dictionary, denote by vi∈ℝ**d and ui∈ℝ**d its two vectors when used as a context word and a center word (meanings are switched in the skip-gram model), respectively. The conditional probability of generating any center word wc (with index c in the dictionary) given its surrounding context words wo1,...,wo2m (with index o1,...,o2m in the dictionary) can be modeled by

연속 단어 모음 모델 continuous bag of words model에는 여러 개의 문맥 단어가 있으므로 이러한 문맥 단어 벡터는 조건부 확률 계산에서 평균화됩니다. 구체적으로, 사전에 인덱스 i가 있는 단어의 경우 문맥 단어와 중심 단어로 사용될 때 vi∈ℝ**d 및 ui∈ℝ**d 두 벡터로 표시됩니다(의미는 스킵-그램 모델에서 전환됨). ) 각각. 주변 문맥 단어 wo1,...,wo2m(사전에서 인덱스 o1,...,o2m 포함)이 주어지면 임의의 중심 단어 wc(사전에서 인덱스 c 포함)를 생성할 조건부 확률은 다음과 같이 모델링할 수 있습니다.

Given a text sequence of length T, where the word at time step t is denoted as w**(t). For context window size m, the likelihood function of the continuous bag of words model is the probability of generating all center words given their context words:

길이 T의 텍스트 시퀀스가 주어지면 시간 단계 t의 단어는 w**(t)로 표시됩니다. 컨텍스트 창 크기가 m인 경우 연속 단어 가방 모델 continuous bag of words model의 우도 likelihood 함수는 해당 컨텍스트 단어가 주어지면 모든 중심 단어를 생성할 확률입니다.

Continuous bag of words 모델이란?

The Continuous Bag of Words (CBOW) model is another type of word embedding model used in natural language processing (NLP) to learn dense vector representations of words. Like the Skip-Gram model, CBOW is effective in capturing semantic relationships between words and is widely used in various NLP applications.

'연속 단어 봉투' 모델(CBOW)은 자연어 처리(NLP)에서 사용되는 다른 유형의 단어 임베딩 모델로, 단어의 밀집 벡터 표현을 학습하는 데에 사용됩니다. 스킵 그램 모델과 마찬가지로 CBOW 모델도 단어 간의 의미적 관계를 잡아내는 데 효과적이며, 다양한 NLP 응용 분야에서 널리 사용됩니다.

The key concept of the CBOW model is to predict a target word given its surrounding context words. In contrast to the Skip-Gram model, which predicts context words given a target word, CBOW predicts a target word based on its context. This approach can be particularly useful for situations where you want to predict a missing word in a sentence or paragraph.

CBOW 모델의 주요 개념은 주변 문맥 단어를 기반으로 목표 단어를 예측하는 것입니다. 스킵 그램 모델과는 달리, 스킵 그램이 목표 단어를 기반으로 문맥 단어를 예측하는 것과는 반대로 CBOW는 문맥을 기반으로 목표 단어를 예측합니다. 이 접근 방식은 문장이나 단락에서 누락된 단어를 예측하려는 상황에 특히 유용할 수 있습니다.

Here's how the Continuous Bag of Words (CBOW) model works:

다음은 연속 단어 봉투(CBOW) 모델의 작동 방식입니다:

Creating Training Data: For each word in the training corpus, a window of surrounding words (context words) is selected. The context words are used to predict the target word in the center.

훈련 데이터 생성: 훈련 말뭉치의 각 단어에 대해 주변 단어(문맥 단어)의 창이 선택됩니다. 문맥 단어는 중심에 있는 목표 단어를 예측하는 데 사용됩니다.
Word Embedding Initialization: Similar to Skip-Gram, each word is represented by two sets of vectors - one for the target word and one for the context words.

단어 임베딩 초기화: 스킵 그램과 유사하게 각 단어는 목표 단어를 위한 하나의 벡터 집합과 문맥 단어를 위한 다른 하나의 벡터 집합으로 표현됩니다.
Training Objective: The model aims to maximize the probability of predicting the target word based on the context words. This is achieved by minimizing a loss function, often a form of negative log likelihood.

훈련 목표: 모델은 문맥 단어를 기반으로 목표 단어를 예측하는 확률을 최대화하는 것이 목표입니다. 이는 일반적으로 음의 로그 우도의 형태로 나타난 손실 함수를 최소화하는 것으로 달성됩니다.
Training Process: During training, the model updates the word vectors to improve the accuracy of predicting target words from context.

훈련 과정: 훈련 중에 모델은 단어 벡터를 업데이트하여 문맥에서 목표 단어를 예측하는 정확성을 개선합니다.
Semantic Relationships: Just like Skip-Gram, the trained CBOW word vectors capture semantic relationships between words. Similar words have similar vectors in the embedding space.

의미적 관계: 스킵 그램과 마찬가지로 훈련된 CBOW 단어 벡터는 단어 간의 의미적 관계를 포착합니다. 유사한 단어는 임베딩 공간에서 유사한 벡터를 가집니다.
Applications: CBOW's word vectors can be used for various NLP tasks, including language modeling, sentiment analysis, machine translation, and more.

응용: CBOW의 단어 벡터는 언어 모델링, 감정 분석, 기계 번역 등 다양한 NLP 작업에 활용될 수 있습니다.
Efficiency: CBOW is often computationally efficient compared to Skip-Gram, making it useful for applications requiring quicker training.

효율성: CBOW는 종종 스킵 그램과 비교하여 계산적으로 효율적이며, 빠른 훈련이 필요한 응용 프로그램에 유용합니다.

In summary, the Continuous Bag of Words (CBOW) model focuses on predicting a target word from its context words, and its embeddings can be useful for understanding semantic relationships and enhancing the performance of various natural language processing tasks.

요약하면, 연속 단어 봉투(CBOW) 모델은 문맥 단어로부터 목표 단어를 예측하는 데 초점을 두며, 이러한 임베딩은 의미적 관계를 이해하고 다양한 자연어 처리 작업의 성능을 향상시키는 데에 유용합니다.

15.1.4.1. Training

Training continuous bag of words models is almost the same as training skip-gram models. The maximum likelihood estimation of the continuous bag of words model is equivalent to minimizing the following loss function:

연속 단어 모음 모델 continuous bag of words models 학습은 스킵 그램 모델 학습과 거의 동일합니다. 연속 단어 모음 모델의 최대 우도 추정은 다음 손실 함수를 최소화하는 것과 동일합니다.

Notice that

Through differentiation, we can obtain its gradient with respect to any context word vector voi(i=1,…,2m) as

미분을 통해 다음과 같이 모든 문맥 단어 벡터 voi(i=1,…,2m)에 대한 기울기를 얻을 수 있습니다.

The gradients for the other word vectors can be obtained in the same way. Unlike the skip-gram model, the continuous bag of words model typically uses context word vectors as the word representations.

다른 단어 벡터에 대한 기울기도 같은 방법으로 얻을 수 있습니다. 스킵 그램 모델 skip-gram model과 달리 연속 단어 가방 모델 continuous bag of words model 은 일반적으로 문맥 단어 벡터를 단어 표현으로 사용합니다.

5.1.5. Summary

Word vectors are vectors used to represent words, and can also be considered as feature vectors or representations of words. The technique of mapping words to real vectors is called word embedding.

단어 벡터는 단어를 표현하는 데 사용되는 벡터이며, 특징 벡터 또는 단어 표현으로도 간주될 수 있습니다. 단어를 실제 벡터에 매핑하는 기술을 단어 임베딩이라고 합니다.
The word2vec tool contains both the skip-gram and continuous bag of words models.

word2vec 도구에는 skip-gram과 continuous bag of words model이 모두 포함되어 있습니다.
The skip-gram model assumes that a word can be used to generate its surrounding words in a text sequence; while the continuous bag of words model assumes that a center word is generated based on its surrounding context words.

스킵그램 모델은 단어를 사용하여 텍스트 시퀀스에서 주변 단어를 생성할 수 있다고 가정합니다. 연속 단어 가방 모델은 중심 단어가 주변 문맥 단어를 기반으로 생성된다고 가정합니다.

15.1.6. Exercises

What is the computational complexity for calculating each gradient? What could be the issue if the dictionary size is huge?
Some fixed phrases in English consist of multiple words, such as “new york”. How to train their word vectors? Hint: see Section 4 in the word2vec paper (Mikolov et al., 2013).
Let’s reflect on the word2vec design by taking the skip-gram model as an example. What is the relationship between the dot product of two word vectors in the skip-gram model and the cosine similarity? For a pair of words with similar semantics, why may the cosine similarity of their word vectors (trained by the skip-gram model) be high?

'Dive into Deep Learning > D2L Natural language Processing' 카테고리의 다른 글

D2L - 15.10. Pretraining BERT (0)	2023.08.30
D2L - 15.9. The Dataset for Pretraining BERT (0)	2023.08.30
D2L - 15.8. Bidirectional Encoder Representations from Transformers (BERT) (0)	2023.08.30
D2L - 15.7. Word Similarity and Analogy (0)	2023.08.30
D2L - 15.6. Subword Embedding (0)	2023.08.30
D2L - 15.5. Word Embedding with Global Vectors (GloVe) (0)	2023.08.29
D2L - 15.4. Pretraining word2vec (0)	2023.08.29
D2L - 15.3. The Dataset for Pretraining Word Embeddings (0)	2023.08.29
D2L- 15.2. Approximate Training (0)	2023.08.28
D2L - 15. Natural Language Processing: Pretraining (0)	2023.08.24

Dive into Deep Learning/D2L Natural language Processing

D2L - 15. Natural Language Processing: Pretraining

2023. 8. 24. 22:52 | Posted by 솔웅

15. Natural Language Processing: Pretraining — Dive into Deep Learning 1.0.3 documentation (d2l.ai)

15. Natural Language Processing: Pretraining — Dive into Deep Learning 1.0.3 documentation

d2l.ai

15. Natural Language Processing: Pretraining

Humans need to communicate. Out of this basic need of the human condition, a vast amount of written text has been generated on an everyday basis. Given rich text in social media, chat apps, emails, product reviews, news articles, research papers, and books, it becomes vital to enable computers to understand them to offer assistance or make decisions based on human languages.

인간은 의사소통을 해야 합니다. 인간 조건의 이러한 기본적 필요로 인해 매일 방대한 양의 문자가 생성되었습니다. 소셜 미디어, 채팅 앱, 이메일, 제품 리뷰, 뉴스 기사, 연구 논문, 서적에 풍부한 텍스트가 있으면 컴퓨터가 이를 이해하여 인간의 언어를 기반으로 지원을 제공하거나 결정을 내릴 수 있도록 하는 것이 중요합니다.

Natural language processing studies interactions between computers and humans using natural languages. In practice, it is very common to use natural language processing techniques to process and analyze text (human natural language) data, such as language models in Section 9.3 and machine translation models in Section 10.5.

자연어 처리는 자연어를 사용하여 컴퓨터와 인간 사이의 상호 작용을 연구합니다. 실제로 자연어 처리 기술을 사용하여 섹션 9.3의 언어 모델 및 섹션 10.5의 기계 번역 모델과 같은 텍스트(인간 자연어) 데이터를 처리하고 분석하는 것이 매우 일반적입니다.

To understand text, we can begin by learning its representations. Leveraging the existing text sequences from large corpora, self-supervised learning has been extensively used to pretrain text representations, such as by predicting some hidden part of the text using some other part of their surrounding text. In this way, models learn through supervision from massive text data without expensive labeling efforts!

텍스트를 이해하려면 텍스트의 표현을 배우는 것부터 시작할 수 있습니다. 대규모 말뭉치의 기존 텍스트 시퀀스를 활용하는 자기 지도 학습 self-supervised learning은 주변 텍스트의 다른 부분을 사용하여 텍스트의 숨겨진 부분을 예측하는 등 텍스트 표현을 사전 훈련하는 데 광범위하게 사용되었습니다. 이러한 방식으로 모델은 값비싼 라벨링 작업 없이 대규모 텍스트 데이터의 감독을 통해 학습합니다!

As we will see in this chapter, when treating each word or subword as an individual token, the representation of each token can be pretrained using word2vec, GloVe, or subword embedding models on large corpora. After pretraining, representation of each token can be a vector, however, it remains the same no matter what the context is. For instance, the vector representation of “bank” is the same in both “go to the bank to deposit some money” and “go to the bank to sit down”. Thus, many more recent pretraining models adapt representation of the same token to different contexts. Among them is BERT, a much deeper self-supervised model based on the Transformer encoder. In this chapter, we will focus on how to pretrain such representations for text, as highlighted in Fig. 15.1.

이 장에서 볼 수 있듯이 각 단어나 하위 단어를 개별 토큰으로 처리할 때 각 토큰의 표현 representation 은 word2vec, GloVe 또는 대규모 말뭉치의 하위 단어 임베딩 모델을 사용하여 사전 훈련될 수 있습니다. 사전 학습 후 각 토큰의 표현 representation 은 벡터 vector 가 될 수 있지만 컨텍스트가 무엇이든 동일하게 유지됩니다. 예를 들어, "bank"의 벡터 표현 vector representation 은 "go to the bank to deposit some money"와 "go to the bank to sit down" 모두 동일합니다. 따라서 최근의 많은 사전 훈련 모델은 동일한 토큰의 표현 representation 을 다른 상황에 맞게 조정합니다. 그중에는 Transformer 인코더를 기반으로 한 훨씬 더 심층적인 자체 감독 모델 self-supervised model 인 BERT가 있습니다. 이 장에서는 그림 15.1에 강조 표시된 대로 텍스트에 대한 표현 representations 을 사전 훈련하는 방법에 중점을 둘 것입니다.

Representation of Token 이란?

In the context of natural language processing (NLP), a "token" refers to a unit of text that has been segmented from a larger piece of text. This segmentation can be based on various criteria, such as words, subwords, characters, or even more complex linguistic units. The representation of a token refers to how that token is encoded in a numerical format that can be understood and processed by machine learning models.

자연어 처리(NLP)의 맥락에서 "토큰"은 큰 텍스트 조각에서 분할된 텍스트 단위를 나타냅니다. 이 분할은 단어, 하위단어, 문자 또는 더 복잡한 언어 단위를 기준으로 할 수 있습니다. 토큰의 표현은 해당 토큰이 기계 학습 모델이 이해하고 처리할 수 있는 수치 형식으로 인코딩되는 방식을 의미합니다.

In NLP, machine learning models, including deep learning models, work with numerical data. Therefore, text data (which is inherently non-numerical) needs to be transformed into a numerical format that these models can work with. This process of transforming text into numbers is called "token representation" or "text representation."

NLP에서 딥러닝 모델을 포함한 기계 학습 모델은 수치 데이터로 작동합니다. 따라서 기계 학습 모델이 처리할 수 있는 수치 형식으로 변환해야 하는 텍스트 데이터(본질적으로 비숫자적)를 변환해야 합니다. 이 텍스트를 숫자로 변환하는 과정을 "토큰 표현" 또는 "텍스트 표현"이라고 합니다.

There are various methods for representing tokens in NLP:

NLP에서 토큰을 나타내는 다양한 방법이 있습니다:

One-Hot Encoding: Each token is represented as a vector where only one element is "hot" (1) and the rest are "cold" (0). The position of the "hot" element corresponds to the index of the token in a predefined vocabulary.

원-핫 인코딩: 각 토큰은 하나의 요소만 "활성화"(1)되고 나머지는 "비활성화"(0)되는 벡터로 나타납니다. "활성화" 요소의 위치는 사전 정의된 어휘에서 토큰의 인덱스에 해당합니다.
Word Embeddings: Words are mapped to dense vectors in continuous vector spaces. Word embeddings capture semantic relationships between words based on their context and are often pre-trained using large text corpora.

워드 임베딩: 단어가 연속 벡터 공간에서 덴스 벡터로 매핑됩니다. 워드 임베딩은 단어 간 의미 관계를 그들의 문맥을 기반으로 포착하며 종종 큰 텍스트 말뭉치를 사용하여 사전 훈련됩니다.
Subword Embeddings: These are similar to word embeddings but work at a subword level, breaking down words into smaller units like characters or character n-grams. This is useful for handling out-of-vocabulary words and morphological variations.

하위단어 임베딩: 이것은 워드 임베딩과 유사하지만 하위단어 수준에서 작동하며 단어를 문자 또는 문자 n-gram과 같은 더 작은 단위로 분해합니다. 이는 어휘에 없는 단어와 형태학적 변형을 처리하는 데 유용합니다.
Contextualized Embeddings: These embeddings consider the context in which a token appears to generate its representation. Models like ELMo, GPT, and BERT fall into this category.

맥락화된 임베딩: 이러한 임베딩은 토큰이 나타나는 맥락을 고려하여 토큰의 표현을 생성합니다. ELMo, GPT 및 BERT와 같은 모델이 이 범주에 속합니다.
Positional Encodings: In models like transformers, which lack inherent positional information, positional encodings are added to the token embeddings to convey their position in a sequence.

위치 인코딩: 트랜스포머와 같은 모델에서 본질적인 위치 정보가 없는 경우 토큰 임베딩에 위치 인코딩을 추가하여 시퀀스에서의 위치를 전달합니다.
Image-Based Tokenization: In some cases, tokens might not be traditional linguistic units, but rather segments of images (e.g., in image captioning tasks), which need their own representation.

이미지 기반의 토큰화: 경우에 따라서 토큰이 전통적인 언어 단위가 아닌 이미지 세그먼트(예: 이미지 캡션 작업)일 수 있으며, 이들은 고유한 표현이 필요합니다.

The choice of token representation method depends on the task, dataset, and the architecture of the model you're using. Effective token representations are crucial for enabling machine learning models to understand and generate human language effectively.

토큰 표현 방법의 선택은 작업, 데이터셋 및 사용하는 모델의 아키텍처에 따라 달라집니다. 효과적인 토큰 표현은 기계 학습 모델이 인간의 언어를 효과적으로 이해하고 생성할 수 있도록 하는 데 중요합니다.

Fig. 15.1  Pretrained text representations can be fed to various deep learning architectures for different downstream natural language processing applications. This chapter focuses on the upstream text representation pretraining. 사전 훈련된 텍스트 표현은 다양한 다운스트림 자연어 처리 애플리케이션을 위한 다양한 딥 러닝 아키텍처에 제공될 수 있습니다. 이 장에서는 업스트림 텍스트 표현 사전 학습에 중점을 둡니다.

For sight of the big picture, Fig. 15.1 shows that the pretrained text representations can be fed to a variety of deep learning architectures for different downstream natural language processing applications. We will cover them in Section 16.

큰 그림을 보기 위해 그림 15.1은 사전 훈련된 텍스트 표현 text representations 이 다양한 다운스트림 자연어 처리 애플리케이션을 위한 다양한 딥 러닝 아키텍처에 공급될 수 있음을 보여줍니다. 이에 대해서는 섹션 16에서 다루겠습니다.

16. Natural Language Processing: Applications — Dive into Deep Learning 1.0.3 documentation

d2l.ai

'Dive into Deep Learning > D2L Natural language Processing' 카테고리의 다른 글

D2L - 15.10. Pretraining BERT (0)	2023.08.30
D2L - 15.9. The Dataset for Pretraining BERT (0)	2023.08.30
D2L - 15.8. Bidirectional Encoder Representations from Transformers (BERT) (0)	2023.08.30
D2L - 15.7. Word Similarity and Analogy (0)	2023.08.30
D2L - 15.6. Subword Embedding (0)	2023.08.30
D2L - 15.5. Word Embedding with Global Vectors (GloVe) (0)	2023.08.29
D2L - 15.4. Pretraining word2vec (0)	2023.08.29
D2L - 15.3. The Dataset for Pretraining Word Embeddings (0)	2023.08.29
D2L- 15.2. Approximate Training (0)	2023.08.28
D2L- 15.1. Word Embedding (word2vec) (0)	2023.08.25

Dive into Deep Learning/D2L Computer Vision

D2L - 14.12. Neural Style Transfer

2023. 8. 21. 04:49 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/neural-style.html

14.12. Neural Style Transfer — Dive into Deep Learning 1.0.3 documentation

d2l.ai

14.12. Neural Style Transfer

If you are a photography enthusiast, you may be familiar with the filter. It can change the color style of photos so that landscape photos become sharper or portrait photos have whitened skins. However, one filter usually only changes one aspect of the photo. To apply an ideal style to a photo, you probably need to try many different filter combinations. This process is as complex as tuning the hyperparameters of a model.

사진 애호가라면 필터에 익숙할 것입니다. 사진의 색상 스타일을 변경하여 풍경 사진을 더 선명하게 하거나 인물 사진의 피부를 하얗게 만들 수 있습니다. 그러나 하나의 필터는 일반적으로 사진의 한 측면만 변경합니다. 사진에 이상적인 스타일을 적용하려면 아마도 다양한 필터 조합을 시도해야 할 것입니다. 이 프로세스는 모델의 하이퍼파라미터 조정만큼 복잡합니다.

In this section, we will leverage layerwise representations of a CNN to automatically apply the style of one image to another image, i.e., style transfer (Gatys et al., 2016). This task needs two input images: one is the content image and the other is the style image. We will use neural networks to modify the content image to make it close to the style image in style. For example, the content image in Fig. 14.12.1 is a landscape photo taken by us in Mount Rainier National Park in the suburbs of Seattle, while the style image is an oil painting with the theme of autumn oak trees. In the output synthesized image, the oil brush strokes of the style image are applied, leading to more vivid colors, while preserving the main shape of the objects in the content image.

이 섹션에서는 CNN의 레이어별 표현을 활용하여 한 이미지의 스타일을 다른 이미지에 자동으로 적용합니다(예: 스타일 전송)(Gatys et al., 2016). 이 작업에는 두 개의 입력 이미지가 필요합니다. 하나는 콘텐츠 이미지이고 다른 하나는 스타일 이미지입니다. 신경망을 사용하여 콘텐츠 이미지를 스타일에서 스타일 이미지에 가깝게 수정합니다. 예를 들어 그림 14.12.1의 콘텐츠 이미지는 시애틀 교외의 레이니어산 국립공원에서 저희가 촬영한 풍경 사진이고, 스타일 이미지는 가을 참나무를 주제로 한 유화입니다. 출력된 합성 이미지에서는 스타일 이미지의 오일 브러시 스트로크가 적용되어 콘텐츠 이미지에서 오브젝트의 주요 형태를 유지하면서 보다 선명한 색상을 연출합니다.

Fig. 14.12.1  Given content and style images, style transfer outputs a synthesized image.

14.12.1. Method

Fig. 14.12.2 illustrates the CNN-based style transfer method with a simplified example. First, we initialize the synthesized image, for example, into the content image. This synthesized image is the only variable that needs to be updated during the style transfer process, i.e., the model parameters to be updated during training. Then we choose a pretrained CNN to extract image features and freeze its model parameters during training. This deep CNN uses multiple layers to extract hierarchical features for images. We can choose the output of some of these layers as content features or style features. Take Fig. 14.12.2 as an example. The pretrained neural network here has 3 convolutional layers, where the second layer outputs the content features, and the first and third layers output the style features.

그림 14.12.2는 CNN 기반의 style transfer 방법을 단순화한 예와 함께 보여준다. 먼저, 예를 들어 합성 이미지를 콘텐츠 이미지로 초기화합니다. 이 합성 이미지는 스타일 전송 프로세스 중에 업데이트해야 하는 유일한 변수, 즉 훈련 중에 업데이트할 모델 매개변수입니다. 그런 다음 사전 훈련된 CNN을 선택하여 훈련 중에 이미지 특징을 추출하고 모델 매개변수를 동결합니다. 이 심층 CNN은 여러 계층을 사용하여 이미지의 계층적 특징을 추출합니다. 이러한 레이어 중 일부의 출력을 콘텐츠 기능 또는 스타일 기능으로 선택할 수 있습니다. 그림 14.12.2를 예로 들어 보겠습니다. 여기서 사전 훈련된 신경망에는 3개의 컨볼루션 레이어가 있습니다. 여기서 두 번째 레이어는 콘텐츠 기능을 출력하고 첫 번째 및 세 번째 레이어는 스타일 기능을 출력합니다.

Fig. 14.12.2  CNN-based style transfer process. Solid lines show the direction of forward propagation and dotted lines show backward propagation.

Next, we calculate the loss function of style transfer through forward propagation (direction of solid arrows), and update the model parameters (the synthesized image for output) through backpropagation (direction of dashed arrows). The loss function commonly used in style transfer consists of three parts: (i) content loss makes the synthesized image and the content image close in content features; (ii) style loss makes the synthesized image and style image close in style features; and (iii) total variation loss helps to reduce the noise in the synthesized image. Finally, when the model training is over, we output the model parameters of the style transfer to generate the final synthesized image.

다음으로 순방향 전파(실선 화살표 방향)를 통해 스타일 전송의 손실 함수를 계산하고 역전파(파선 화살표 방향)를 통해 모델 매개변수(출력용 합성 이미지)를 업데이트합니다. 스타일 전송에서 일반적으로 사용되는 손실 함수는 세 부분으로 구성됩니다. (i) 콘텐츠 손실은 합성 이미지와 콘텐츠 이미지를 콘텐츠 기능에 가깝게 만듭니다. (ii) 스타일 손실은 합성 이미지와 스타일 이미지를 스타일 특징에 가깝게 만듭니다. (iii) 전체 변형 손실은 합성 이미지의 노이즈를 줄이는 데 도움이 됩니다. 마지막으로 모델 학습이 끝나면 style transfer의 모델 파라미터를 출력하여 최종 합성 이미지를 생성합니다.

In the following, we will explain the technical details of style transfer via a concrete experiment.

다음에서는 구체적인 실험을 통해 스타일 이전의 기술적 세부 사항을 설명합니다.

14.12.2. Reading the Content and Style Images

First, we read the content and style images. From their printed coordinate axes, we can tell that these images have different sizes.

먼저 콘텐츠와 스타일 이미지를 읽습니다. 인쇄된 좌표축에서 이러한 이미지의 크기가 다르다는 것을 알 수 있습니다.

%matplotlib inline
import torch
import torchvision
from torch import nn
from d2l import torch as d2l

d2l.set_figsize()
content_img = d2l.Image.open('../img/rainier.jpg')
d2l.plt.imshow(content_img);

위 코드는 딥러닝 모델을 사용하여 이미지를 분석하고 시각화하는 예제입니다.

%matplotlib inline은 Jupyter Notebook에서 Matplotlib 그래프를 인라인으로 표시하는 명령입니다. 그래프를 보다 쉽게 시각화하기 위해 사용됩니다.
import torch는 파이토치 라이브러리를 임포트합니다.
import torchvision은 torchvision 라이브러리를 임포트합니다. torchvision은 컴퓨터 비전 관련 작업을 위한 데이터셋, 모델 등을 제공합니다.
from torch import nn은 파이토치의 nn 모듈을 가져옵니다. 이 모듈은 신경망 구성 요소를 정의하는 데 사용됩니다.
from d2l import torch as d2l은 D2L 라이브러리에서 torch 모듈을 가져와 d2l이란 이름으로 사용한다는 것을 의미합니다.
d2l.set_figsize()는 Matplotlib 그래프의 크기를 설정하는 함수입니다.
content_img = d2l.Image.open('../img/rainier.jpg')는 'rainier.jpg' 이미지 파일을 열어 content_img에 저장합니다. d2l.Image.open은 D2L 라이브러리의 이미지 처리 함수로, 이미지 파일을 열고 처리하는 기능을 제공합니다.
d2l.plt.imshow(content_img)는 content_img를 Matplotlib을 사용하여 이미지로 표시합니다. 이를 통해 'rainier.jpg' 이미지가 출력됩니다.

즉, 이 코드는 'rainier.jpg' 이미지를 열어서 Matplotlib을 사용하여 화면에 표시하는 예제입니다.

style_img = d2l.Image.open('../img/autumn-oak.jpg')
d2l.plt.imshow(style_img);

위 코드는 스타일 이미지를 열고 화면에 표시하는 예제입니다.

style_img = d2l.Image.open('../img/autumn-oak.jpg')는 'autumn-oak.jpg' 이미지 파일을 열어 style_img에 저장합니다. d2l.Image.open은 D2L 라이브러리의 이미지 처리 함수로, 이미지 파일을 열고 처리하는 기능을 제공합니다.
d2l.plt.imshow(style_img)는 style_img를 Matplotlib을 사용하여 이미지로 표시합니다. 이를 통해 'autumn-oak.jpg' 이미지가 출력됩니다.

즉, 이 코드는 'autumn-oak.jpg' 이미지를 열어서 Matplotlib을 사용하여 화면에 표시하는 예제입니다.

14.12.3. Preprocessing and Postprocessing

Below, we define two functions for preprocessing and postprocessing images. The preprocess function standardizes each of the three RGB channels of the input image and transforms the results into the CNN input format. The postprocess function restores the pixel values in the output image to their original values before standardization. Since the image printing function requires that each pixel has a floating point value from 0 to 1, we replace any value smaller than 0 or greater than 1 with 0 or 1, respectively.

아래에서 이미지 전처리 및 후처리를 위한 두 가지 함수를 정의합니다. 전처리 함수는 입력 이미지의 3개 RGB 채널 각각을 표준화하고 결과를 CNN 입력 형식으로 변환합니다. 후처리 기능은 출력 이미지의 픽셀 값을 표준화 이전의 원래 값으로 복원합니다. 이미지 인쇄 기능은 각 픽셀이 0에서 1까지의 부동 소수점 값을 가져야 하므로 0보다 작은 값 또는 1보다 큰 값을 각각 0 또는 1로 바꿉니다.

rgb_mean = torch.tensor([0.485, 0.456, 0.406])
rgb_std = torch.tensor([0.229, 0.224, 0.225])

def preprocess(img, image_shape):
    transforms = torchvision.transforms.Compose([
        torchvision.transforms.Resize(image_shape),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize(mean=rgb_mean, std=rgb_std)])
    return transforms(img).unsqueeze(0)

def postprocess(img):
    img = img[0].to(rgb_std.device)
    img = torch.clamp(img.permute(1, 2, 0) * rgb_std + rgb_mean, 0, 1)
    return torchvision.transforms.ToPILImage()(img.permute(2, 0, 1))

위 코드는 이미지 전처리와 후처리를 수행하는 함수들을 정의하는 예제입니다.

rgb_mean과 rgb_std는 이미지 전처리 및 후처리에 사용될 평균과 표준편차를 정의한 텐서입니다. 이 값들은 일반적으로 이미지를 정규화(normalize)하는 데 사용됩니다.
preprocess(img, image_shape) 함수는 이미지를 주어진 image_shape로 리사이징하고, ToTensor 변환과 Normalize 변환을 순차적으로 적용하여 이미지를 전처리합니다. 마지막에 .unsqueeze(0)를 사용하여 배치 차원을 추가합니다. 이 함수는 이미지를 모델의 입력으로 사용하기 위해 변환하는데 사용됩니다.
postprocess(img) 함수는 모델의 출력 이미지를 원본 이미지 형식으로 복원하는데 사용됩니다. 먼저, 이미지 텐서에서 배치 차원을 제거한 다음, 평균과 표준편차를 사용하여 역정규화(normalize)를 수행합니다. 마지막으로 ToPILImage 변환을 사용하여 텐서를 PIL 이미지로 변환합니다. 이 함수는 모델의 출력을 실제 이미지로 변환하여 시각화하는데 사용됩니다.

요약하면, 이 코드는 이미지를 모델 입력으로 전처리하거나, 모델의 출력을 실제 이미지로 후처리하는 함수들을 정의하는 예제입니다.

14.12.4. Extracting Features

We use the VGG-19 model pretrained on the ImageNet dataset to extract image features (Gatys et al., 2016).

우리는 ImageNet 데이터 세트에서 사전 훈련된 VGG-19 모델을 사용하여 이미지 특징을 추출합니다(Gatys et al., 2016).

pretrained_net = torchvision.models.vgg19(pretrained=True)

위 코드는 torchvision 라이브러리를 사용하여 미리 학습된 VGG-19 모델을 가져오는 예제입니다.

torchvision.models.vgg19(pretrained=True)은 VGG-19 모델을 가져오는 함수입니다. 이 함수는 미리 학습된 가중치를 가진 VGG-19 모델을 반환합니다. pretrained=True로 설정하면 사전 학습된 가중치가 자동으로 로드됩니다.

VGG-19은 컨볼루션 신경망 아키텍처로, 주로 이미지 분류 작업에 사용됩니다. 사전 학습된 모델을 사용하면 이미지 특징 추출 및 전이 학습에 유용한 특성을 제공할 수 있습니다.

Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /home/ci/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth
100%|██████████| 548M/548M [00:02<00:00, 213MB/s]

In order to extract the content features and style features of the image, we can select the output of certain layers in the VGG network. Generally speaking, the closer to the input layer, the easier to extract details of the image, and vice versa, the easier to extract the global information of the image. In order to avoid excessively retaining the details of the content image in the synthesized image, we choose a VGG layer that is closer to the output as the content layer to output the content features of the image. We also select the output of different VGG layers for extracting local and global style features. These layers are also called style layers. As mentioned in Section 8.2, the VGG network uses 5 convolutional blocks. In the experiment, we choose the last convolutional layer of the fourth convolutional block as the content layer, and the first convolutional layer of each convolutional block as the style layer. The indices of these layers can be obtained by printing the pretrained_net instance.

이미지의 콘텐츠 특징과 스타일 특징을 추출하기 위해 VGG 네트워크에서 특정 레이어의 출력을 선택할 수 있습니다. 일반적으로 입력 레이어에 가까울수록 이미지의 세부 정보를 추출하기 쉽고 그 반대의 경우도 이미지의 전체 정보를 추출하기 쉽습니다. 합성된 이미지에서 콘텐츠 이미지의 세부 사항이 과도하게 유지되는 것을 피하기 위해 출력에 가까운 VGG 레이어를 콘텐츠 레이어로 선택하여 이미지의 콘텐츠 특징을 출력합니다. 또한 로컬 및 글로벌 스타일 기능을 추출하기 위해 다양한 VGG 레이어의 출력을 선택합니다. 이러한 레이어를 스타일 레이어라고도 합니다. 섹션 8.2에서 언급했듯이 VGG 네트워크는 5개의 컨볼루션 블록을 사용합니다. 실험에서는 네 번째 컨볼루션 블록의 마지막 컨볼루션 레이어를 콘텐츠 레이어로 선택하고 각 컨볼루션 블록의 첫 번째 컨볼루션 레이어를 스타일 레이어로 선택합니다. 이러한 레이어의 인덱스는 pretrained_net 인스턴스를 인쇄하여 얻을 수 있습니다.

style_layers, content_layers = [0, 5, 10, 19, 28], [25]

위 코드는 스타일 전이 네트워크를 구성하는 데 사용되는 레이어의 인덱스를 지정하는 예제입니다.

style_layers: 스타일 이미지의 스타일 특징을 포착하는 데 사용되는 레이어의 인덱스 리스트입니다. VGG-19 모델에서는 이들 인덱스에 해당하는 레이어는 스타일 정보를 추출하는데 적절합니다. 스타일 이미지의 다양한 특징을 반영하기 위해 여러 개의 레이어를 선택합니다.
content_layers: 내용 이미지의 내용 특징을 포착하는 데 사용되는 레이어의 인덱스 리스트입니다. VGG-19 모델에서는 이들 인덱스에 해당하는 레이어가 내용 정보를 추출하는데 적절합니다. 일반적으로 스타일 이미지와는 달리 내용 이미지의 특징은 더 깊은 레이어에서 추출됩니다.

이렇게 선택한 스타일 레이어와 내용 레이어의 인덱스는 스타일 전이 네트워크에서 각각의 이미지로부터 추출한 스타일과 내용의 특징을 활용하여 새로운 이미지를 생성하는데 사용됩니다.

When extracting features using VGG layers, we only need to use all those from the input layer to the content layer or style layer that is closest to the output layer. Let’s construct a new network instance net, which only retains all the VGG layers to be used for feature extraction.

VGG 레이어를 사용하여 피처를 추출할 때 입력 레이어에서 콘텐츠 레이어 또는 출력 레이어에 가장 가까운 스타일 레이어까지 모두 사용하면 됩니다. 기능 추출에 사용할 모든 VGG 레이어만 유지하는 새로운 네트워크 인스턴스 네트워크를 구성해 보겠습니다.

net = nn.Sequential(*[pretrained_net.features[i] for i in
                      range(max(content_layers + style_layers) + 1)])

위 코드는 스타일 전이 네트워크를 구성하기 위해 미리 학습된 VGG-19 모델에서 특정 레이어들을 추출하여 Sequential 모듈로 묶는 과정을 수행합니다.

pretrained_net.features: VGG-19 모델의 features 부분은 컨볼루션 레이어들을 포함하는 Sequential 모듈입니다. 이 부분이 입력 이미지를 컨볼루션 레이어들을 통해 특성으로 변환하는 부분입니다.
content_layers와 style_layers: 위에서 설명한 것처럼 내용과 스타일을 추출할 레이어들의 인덱스 리스트입니다.
range(max(content_layers + style_layers) + 1): 선택한 내용 레이어와 스타일 레이어의 최대 인덱스까지의 범위를 생성합니다. 이 범위는 VGG-19 모델의 features 모듈에서 추출할 레이어들의 인덱스 범위를 의미합니다.

nn.Sequential을 사용하여 pretrained_net의 features 모듈에서 선택한 인덱스 범위에 해당하는 레이어들을 추출하여 순차적으로 연결한 net을 생성합니다. 이렇게 생성된 net은 스타일 전이 네트워크에서 입력 이미지를 특성으로 변환하는 역할을 수행합니다.

Given the input X, if we simply invoke the forward propagation net(X), we can only get the output of the last layer. Since we also need the outputs of intermediate layers, we need to perform layer-by-layer computation and keep the content and style layer outputs.

입력 X가 주어지면 순방향 전파 net(X)를 호출하기만 하면 마지막 레이어의 출력만 얻을 수 있습니다. 중간 레이어의 출력도 필요하므로 레이어별 계산을 수행하고 콘텐츠 및 스타일 레이어 출력을 유지해야 합니다.

def extract_features(X, content_layers, style_layers):
    contents = []
    styles = []
    for i in range(len(net)):
        X = net[i](X)
        if i in style_layers:
            styles.append(X)
        if i in content_layers:
            contents.append(X)
    return contents, styles

위 코드는 입력 이미지를 스타일 전이 네트워크를 통해 특성으로 추출하는 함수를 정의하는 부분입니다.

X: 입력 이미지를 나타내는 텐서입니다.
content_layers: 내용을 추출할 레이어들의 인덱스 리스트입니다.
style_layers: 스타일을 추출할 레이어들의 인덱스 리스트입니다.

함수는 입력 이미지 X를 net에 통과시켜 가면 각 레이어에서의 출력을 계산합니다. 만약 현재 레이어의 인덱스가 style_layers에 속한다면, 해당 레이어의 출력을 스타일을 추출하는 데 사용될 스타일 리스트인 styles에 추가합니다. 마찬가지로 현재 레이어의 인덱스가 content_layers에 속한다면, 해당 레이어의 출력을 내용을 추출하는 데 사용될 내용 리스트인 contents에 추가합니다.

따라서 함수의 반환값은 내용과 스타일을 추출한 특성을 나타내는 리스트인 contents와 styles입니다.

Two functions are defined below: the get_contents function extracts content features from the content image, and the get_styles function extracts style features from the style image. Since there is no need to update the model parameters of the pretrained VGG during training, we can extract the content and the style features even before the training starts. Since the synthesized image is a set of model parameters to be updated for style transfer, we can only extract the content and style features of the synthesized image by calling the extract_features function during training.

아래에 두 가지 함수가 정의되어 있습니다. get_contents 함수는 콘텐츠 이미지에서 콘텐츠 특징을 추출하고 get_styles 함수는 스타일 이미지에서 스타일 특징을 추출합니다. 훈련 중에 사전 훈련된 VGG의 모델 매개변수를 업데이트할 필요가 없기 때문에 훈련이 시작되기 전에 콘텐츠와 스타일 특징을 추출할 수 있습니다. 합성 이미지는 스타일 전달을 위해 업데이트할 모델 매개변수의 집합이므로 훈련 중에 extract_features 함수를 호출해야만 합성 이미지의 콘텐츠 및 스타일 특징을 추출할 수 있습니다.

def get_contents(image_shape, device):
    content_X = preprocess(content_img, image_shape).to(device)
    contents_Y, _ = extract_features(content_X, content_layers, style_layers)
    return content_X, contents_Y

def get_styles(image_shape, device):
    style_X = preprocess(style_img, image_shape).to(device)
    _, styles_Y = extract_features(style_X, content_layers, style_layers)
    return style_X, styles_Y

위 코드는 입력 이미지에서 내용과 스타일을 추출하는 함수들을 정의하는 부분입니다.

get_contents(image_shape, device): 이 함수는 내용 이미지로부터 내용을 추출하는 역할을 합니다.
- image_shape: 이미지의 크기를 지정합니다.
- device: 사용할 디바이스를 지정합니다.
함수는 content_img를 preprocess 함수를 사용하여 전처리하고 content_layers와 style_layers를 사용하여 extract_features 함수를 통해 내용 특성을 추출합니다. 반환값으로는 전처리된 내용 이미지 content_X와 추출된 내용 특성 리스트 contents_Y를 반환합니다.
get_styles(image_shape, device): 이 함수는 스타일 이미지로부터 스타일을 추출하는 역할을 합니다.
- image_shape: 이미지의 크기를 지정합니다.
- device: 사용할 디바이스를 지정합니다.
함수는 style_img를 preprocess 함수를 사용하여 전처리하고 content_layers와 style_layers를 사용하여 extract_features 함수를 통해 스타일 특성을 추출합니다. 반환값으로는 전처리된 스타일 이미지 style_X와 추출된 스타일 특성 리스트 styles_Y를 반환합니다.

14.12.5. Defining the Loss Function

Now we will describe the loss function for style transfer. The loss function consists of the content loss, style loss, and total variation loss.

이제 스타일 전송을 위한 손실 함수에 대해 설명하겠습니다. 손실 함수는 콘텐츠 손실, 스타일 손실 및 총 변형 손실로 구성됩니다.

14.12.5.1. Content Loss

Similar to the loss function in linear regression, the content loss measures the difference in content features between the synthesized image and the content image via the squared loss function. The two inputs of the squared loss function are both outputs of the content layer computed by the extract_features function.

선형 회귀의 손실 함수와 유사하게 콘텐츠 손실은 제곱 손실 함수를 통해 합성 이미지와 콘텐츠 이미지 간의 콘텐츠 특징 차이를 측정합니다. 제곱 손실 함수의 두 입력은 모두 extract_features 함수로 계산된 콘텐츠 계층의 출력입니다.

def content_loss(Y_hat, Y):
    # We detach the target content from the tree used to dynamically compute
    # the gradient: this is a stated value, not a variable. Otherwise the loss
    # will throw an error.
    return torch.square(Y_hat - Y.detach()).mean()

위 코드는 내용 손실 함수를 정의하는 부분입니다.

content_loss(Y_hat, Y) 함수는 추출된 내용 특성 Y_hat와 목표 내용 특성 Y 사이의 손실을 계산합니다. Y를 detach() 함수를 사용하여 연결된 계산 그래프에서 분리하여 계산하므로, 이것은 변수가 아닌 값으로 취급되어 경사도 계산 시 에러를 방지할 수 있습니다. 함수 내에서는 torch.square() 함수를 사용하여 제곱을 계산하고, 그 후에 평균을 계산하여 내용 손실을 반환합니다.

14.12.5.2. Style Loss

Style loss, similar to content loss, also uses the squared loss function to measure the difference in style between the synthesized image and the style image. To express the style output of any style layer, we first use the extract_features function to compute the style layer output. Suppose that the output has 1 example, c channels, height ℎ, and width w, we can transform this output into matrix X with c rows and ℎw columns. This matrix can be thought of as the concatenation of c vectors x1,…,xo, each of which has a length of ℎw. Here, vector xi represents the style feature of channel i.

스타일 손실은 콘텐츠 손실과 마찬가지로 제곱 손실 함수를 사용하여 합성 이미지와 스타일 이미지 간의 스타일 차이를 측정합니다. 스타일 레이어의 스타일 출력을 표현하기 위해 먼저 extract_features 함수를 사용하여 스타일 레이어 출력을 계산합니다. 출력에 1개의 채널, 높이 ℎ, 너비 w가 있다고 가정하면 이 출력을 c 행과 ℎw 열이 있는 행렬 X로 변환할 수 있습니다. 이 행렬은 각각 길이가 ℎw인 c 벡터 x1,…,xo의 연결로 생각할 수 있습니다. 여기서 벡터 xi는 채널 i의 스타일 특징을 나타냅니다.

In the Gram matrix of these vectors XX**⊤∈ℝ**c×c, element xij in row i and column j is the dot product of vectors xi and xj. It represents the correlation of the style features of channels i and j. We use this Gram matrix to represent the style output of any style layer. Note that when the value of ℎw is larger, it likely leads to larger values in the Gram matrix. Note also that the height and width of the Gram matrix are both the number of channels c. To allow style loss not to be affected by these values, the gram function below divides the Gram matrix by the number of its elements, i.e., cℎw.

이러한 벡터 XX**⊤∈ℝ**c×c의 그람 행렬에서 i행과 j열의 요소 xij는 벡터 xi와 xj의 내적입니다. 채널 i와 j의 스타일 특징의 상관관계를 나타냅니다. 이 그램 매트릭스를 사용하여 모든 스타일 레이어의 스타일 출력을 나타냅니다. ℎw의 값이 클수록 그람 행렬에서 더 큰 값이 될 가능성이 높습니다. 또한 그램 행렬의 높이와 너비는 모두 채널 c의 수입니다. 스타일 손실이 이러한 값의 영향을 받지 않도록 하기 위해 아래의 gram 함수는 Gram 행렬을 해당 요소의 수, 즉 cℎw로 나눕니다.

def gram(X):
    num_channels, n = X.shape[1], X.numel() // X.shape[1]
    X = X.reshape((num_channels, n))
    return torch.matmul(X, X.T) / (num_channels * n)

위 코드는 그람 행렬 함수를 정의하는 부분입니다.

gram(X) 함수는 입력으로 들어온 특성 맵 X의 그람 행렬을 계산합니다. 그람 행렬은 특성 맵의 채널 간 상관 관계를 캡처하는데 사용되며, 스타일 손실을 계산하는 데 활용됩니다.

함수 내에서는 먼저 X의 크기를 조정하여 num_channels을 채널 수로, n을 각 채널에서의 픽셀 수로 변환합니다. 그런 다음 torch.matmul() 함수를 사용하여 X와 X.T (전치된 X)의 행렬 곱을 계산하고, 이를 (num_channels * n)로 나누어 그람 행렬을 구합니다. 이렇게 구한 그람 행렬은 스타일 손실 계산에 사용됩니다.

Obviously, the two Gram matrix inputs of the squared loss function for style loss are based on the style layer outputs for the synthesized image and the style image. It is assumed here that the Gram matrix gram_Y based on the style image has been precomputed.

분명히 스타일 손실에 대한 제곱 손실 함수의 두 그램 매트릭스 입력은 합성 이미지와 스타일 이미지에 대한 스타일 레이어 출력을 기반으로 합니다. 여기서는 스타일 이미지를 기반으로 하는 그램 행렬 gram_Y가 미리 계산되었다고 가정합니다.

def style_loss(Y_hat, gram_Y):
    return torch.square(gram(Y_hat) - gram_Y.detach()).mean()

위 코드는 스타일 손실 함수를 정의하는 부분입니다.

style_loss(Y_hat, gram_Y) 함수는 스타일 이미지의 그람 행렬 gram_Y와 생성된 이미지의 특성 맵 Y_hat 간의 스타일 손실을 계산합니다. 스타일 손실은 생성된 이미지의 특성 맵이 스타일 이미지의 스타일을 얼마나 잘 따라가는지를 나타내는 지표입니다.

함수 내에서는 Y_hat의 그람 행렬을 계산하기 위해 gram(Y_hat) 함수를 호출하고, 이를 gram_Y와 비교하여 차이의 제곱을 계산한 후 평균을 취합니다. 이로써 스타일 손실을 구하게 됩니다. gram_Y를 detach()하여 그람 행렬의 계산 그래프를 분리하여 오류가 발생하지 않도록 합니다.

Gram Matrix 란?

A Gram Matrix, also known as a Gramian Matrix or a Gramian, is a mathematical construct used in linear algebra and signal processing. It is derived from the inner products of a set of vectors and provides insights into the relationships between these vectors. The Gram Matrix is particularly useful in various applications, including image processing, machine learning, and the analysis of functions.

Gram 행렬은 선형 대수 및 신호 처리에서 사용되는 수학적인 개념으로, 벡터들의 내적(inner product)에서 유도되며 이러한 벡터들 간의 관계에 대한 통찰력을 제공합니다. Gram 행렬은 이미지 처리, 기계 학습, 함수 분석과 같은 다양한 응용 분야에서 특히 유용하게 활용됩니다.

Given a set of vectors in an -dimensional space, the Gram Matrix is defined as follows:

이 -차원 공간에서 주어진 벡터들의 집합이라고 할 때, Gram 행렬 은 다음과 같이 정의됩니다:

x_1^T x_1 & x_1^T x_2 & \ldots & x_1^T x_n \\

x_2^T x_1 & x_2^T x_2 & \ldots & x_2^T x_n \\

\vdots & \vdots & \ddots & \vdots \\

x_n^T x_1 & x_n^T x_2 & \ldots & x_n^T x_n \\

\end{bmatrix}\]

Here, \(x_i^T\) represents the transpose of vector \(x_i\), and \(x_i^T x_j\) represents the inner product (dot product) between vectors \(x_i\) and \(x_j\).

여기서 \(x_i^T\)는 벡터 \(x_i\)의 전치(transpose)를 나타내며, \(x_i^T x_j\)는 벡터 \(x_i\)와 \(x_j\)의 내적(점곱)을 나타냅니다.

The Gram Matrix is symmetric and positive semi-definite, meaning its eigenvalues are non-negative. It captures the relationships between the vectors in terms of their similarities or correlations. In image processing, for example, the Gram Matrix of a set of image feature vectors can be used to analyze texture or style information, which is the basis for techniques like neural style transfer in deep learning.

Gram 행렬은 대칭적이고 양의 준정부호(positive semi-definite)를 가집니다. 즉, 고유값은 비음수입니다. 이는 벡터들 간의 유사성 또는 상관 관계를 포착합니다. 예를 들어 이미지 처리에서, 이미지 특징 벡터들의 집합의 Gram 행렬은 텍스처나 스타일 정보를 분석하는 데 사용될 수 있으며, 이는 딥 러닝에서의 신경망 스타일 전이(neural style transfer)와 같은 기법의 기초입니다.

In machine learning, the Gram Matrix is used in kernel methods for support vector machines (SVMs) and other algorithms that require pairwise similarities between data points. It allows these algorithms to operate in a higher-dimensional space without explicitly computing the transformed feature vectors.

기계 학습에서는 Gram 행렬은 서포트 벡터 머신(SVM)의 커널 방법 및 데이터 포인트 간의 쌍별 유사성을 필요로하는 다른 알고리즘에서 사용됩니다. 이는 이러한 알고리즘들이 변환된 특징 벡터를 명시적으로 계산하지 않고도 더 높은 차원 공간에서 작동할 수 있게 합니다.

Overall, the Gram Matrix is a versatile mathematical tool with applications ranging from signal processing to machine learning, offering insights into the relationships and similarities between vectors in various contexts.

종합적으로, Gram 행렬은 다양한 의미 맥락에서 벡터 간의 관계와 유사성에 대한 통찰력을 제공하는 다재다능한 수학적 도구입니다. 이는 신호 처리부터 기계 학습까지 다양한 분야에서 응용 가능한 중요한 개념입니다.

14.12.5.3. Total Variation Loss

Sometimes, the learned synthesized image has a lot of high-frequency noise, i.e., particularly bright or dark pixels. One common noise reduction method is total variation denoising. Denote by xi,j the pixel value at coordinate (i,j). Reducing total variation loss makes values of neighboring pixels on the synthesized image closer.

때로는 학습된 합성 이미지에 많은 고주파 노이즈, 즉 특히 밝거나 어두운 픽셀이 있습니다. 일반적인 노이즈 감소 방법 중 하나는 전체 변형 노이즈 제거입니다. 좌표(i,j)의 픽셀 값을 xi,j로 표시합니다. 총 변형 손실을 줄이면 합성 이미지의 인접 픽셀 값이 더 가까워집니다.

def tv_loss(Y_hat):
    return 0.5 * (torch.abs(Y_hat[:, :, 1:, :] - Y_hat[:, :, :-1, :]).mean() +
                  torch.abs(Y_hat[:, :, :, 1:] - Y_hat[:, :, :, :-1]).mean())

위 코드는 전체 변위 손실 함수를 정의하는 부분입니다.

tv_loss(Y_hat) 함수는 생성된 이미지의 특성 맵 Y_hat에 대한 전체 변위 손실을 계산합니다. 전체 변위 손실은 생성된 이미지의 픽셀 간의 변화 정도를 나타내며, 이미지의 부드러움과 노이즈 제거에 기여합니다.

함수 내에서는 Y_hat의 특성 맵에 대해 수직 방향 및 수평 방향으로 픽셀 간의 차이를 계산하고, 절댓값을 취하여 평균을 계산합니다. 이렇게 계산된 두 차이의 평균을 더한 후 0.5를 곱하여 최종적인 전체 변위 손실을 계산합니다.

14.12.5.4. Loss Function

The loss function of style transfer is the weighted sum of content loss, style loss, and total variation loss. By adjusting these weight hyperparameters, we can balance among content retention, style transfer, and noise reduction on the synthesized image.

스타일 전송의 손실 함수는 콘텐츠 손실, 스타일 손실 및 총 변형 손실의 가중 합입니다. 이러한 가중치 하이퍼파라미터를 조정하여 합성 이미지의 콘텐츠 유지, 스타일 전송 및 노이즈 감소 간의 균형을 맞출 수 있습니다.

content_weight, style_weight, tv_weight = 1, 1e4, 10

def compute_loss(X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram):
    # Calculate the content, style, and total variance losses respectively
    contents_l = [content_loss(Y_hat, Y) * content_weight for Y_hat, Y in zip(
        contents_Y_hat, contents_Y)]
    styles_l = [style_loss(Y_hat, Y) * style_weight for Y_hat, Y in zip(
        styles_Y_hat, styles_Y_gram)]
    tv_l = tv_loss(X) * tv_weight
    # Add up all the losses
    l = sum(styles_l + contents_l + [tv_l])
    return contents_l, styles_l, tv_l, l

위 코드는 이미지 생성을 위한 전체 손실을 계산하는 함수를 정의하는 부분입니다.

compute_loss(X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram) 함수는 생성된 이미지 X와 이에 대한 콘텐츠 및 스타일 특성 맵의 예측 contents_Y_hat 및 styles_Y_hat, 그리고 정답 콘텐츠 및 스타일 그람 매트릭스 contents_Y와 styles_Y_gram을 사용하여 전체 손실을 계산합니다.

함수 내에서는 콘텐츠 손실, 스타일 손실, 그리고 전체 변위 손실을 계산합니다. 각각의 손실에는 content_weight, style_weight, tv_weight 가중치가 곱해진 후 합산되어 최종 손실 l이 계산됩니다.

이렇게 계산된 각 손실들을 반환하며, 생성된 이미지의 품질을 조정하는 데 사용됩니다.

14.12.6. Initializing the Synthesized Image

In style transfer, the synthesized image is the only variable that needs to be updated during training. Thus, we can define a simple model, SynthesizedImage, and treat the synthesized image as the model parameters. In this model, forward propagation just returns the model parameters.

스타일 전송에서 합성 이미지는 학습 중에 업데이트해야 하는 유일한 변수입니다. 따라서 간단한 모델인 SynthesizedImage를 정의하고 합성된 이미지를 모델 매개변수로 처리할 수 있습니다. 이 모델에서 정방향 전파는 모델 매개변수만 반환합니다.

class SynthesizedImage(nn.Module):
    def __init__(self, img_shape, **kwargs):
        super(SynthesizedImage, self).__init__(**kwargs)
        self.weight = nn.Parameter(torch.rand(*img_shape))

    def forward(self):
        return self.weight

위 코드는 신경망 모델을 정의하는 클래스인 SynthesizedImage를 나타냅니다.

SynthesizedImage 클래스는 nn.Module을 상속하여 정의되며, 생성된 이미지를 나타내는 모델입니다. 생성된 이미지는 nn.Parameter를 통해 모델의 학습 가능한 매개변수로서 정의됩니다. 생성된 이미지의 크기는 img_shape로 주어지며, 이는 self.weight 매개변수의 형태가 됩니다.

forward 메서드는 해당 모델의 순전파(forward) 연산을 수행하는 부분으로, 단순히 self.weight 값을 반환합니다. 이렇게 정의된 클래스는 이미지 생성 과정에서 모델을 사용하여 이미지를 조작하고 업데이트하는 데 활용될 수 있습니다.

Next, we define the get_inits function. This function creates a synthesized image model instance and initializes it to the image X. Gram matrices for the style image at various style layers, styles_Y_gram, are computed prior to training.

다음으로 get_inits 함수를 정의합니다. 이 함수는 합성된 이미지 모델 인스턴스를 생성하고 이미지 X로 초기화합니다. 다양한 스타일 레이어에서 스타일 이미지에 대한 그램 매트릭스(style_Y_gram)는 훈련 전에 계산됩니다.

def get_inits(X, device, lr, styles_Y):
    gen_img = SynthesizedImage(X.shape).to(device)
    gen_img.weight.data.copy_(X.data)
    trainer = torch.optim.Adam(gen_img.parameters(), lr=lr)
    styles_Y_gram = [gram(Y) for Y in styles_Y]
    return gen_img(), styles_Y_gram, trainer

위 코드는 생성된 이미지의 초기화 및 학습에 필요한 초기화 요소를 얻는 함수 get_inits를 나타냅니다.

이 함수는 다음과 같은 역할을 수행합니다:

X: 원본 이미지
device: 사용할 디바이스 (CPU 또는 GPU)
lr: 학습률
styles_Y: 스타일 이미지의 Gram 행렬들

위 매개변수들을 이용하여 다음 작업을 수행합니다:

SynthesizedImage 모델의 인스턴스 gen_img를 생성하고, 해당 모델의 가중치를 원본 이미지 X의 데이터로 초기화합니다.
styles_Y의 각 스타일 이미지에 대한 Gram 행렬을 계산하여 styles_Y_gram에 저장합니다.
gen_img, styles_Y_gram, 그리고 Adam 옵티마이저를 생성하여 반환합니다.

이렇게 생성된 이미지와 초기화에 필요한 정보는 실제 이미지를 생성하고 업데이트하는 학습 과정에서 활용될 것입니다.

14.12.7. Training

When training the model for style transfer, we continuously extract content features and style features of the synthesized image, and calculate the loss function. Below defines the training loop.

스타일 전달을 위해 모델을 훈련할 때 합성 이미지의 콘텐츠 특징과 스타일 특징을 지속적으로 추출하고 손실 함수를 계산합니다. 아래는 훈련 루프를 정의합니다.

def train(X, contents_Y, styles_Y, device, lr, num_epochs, lr_decay_epoch):
    X, styles_Y_gram, trainer = get_inits(X, device, lr, styles_Y)
    scheduler = torch.optim.lr_scheduler.StepLR(trainer, lr_decay_epoch, 0.8)
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                            xlim=[10, num_epochs],
                            legend=['content', 'style', 'TV'],
                            ncols=2, figsize=(7, 2.5))
    for epoch in range(num_epochs):
        trainer.zero_grad()
        contents_Y_hat, styles_Y_hat = extract_features(
            X, content_layers, style_layers)
        contents_l, styles_l, tv_l, l = compute_loss(
            X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram)
        l.backward()
        trainer.step()
        scheduler.step()
        if (epoch + 1) % 10 == 0:
            animator.axes[1].imshow(postprocess(X))
            animator.add(epoch + 1, [float(sum(contents_l)),
                                     float(sum(styles_l)), float(tv_l)])
    return X

위 코드는 스타일 전이 네트워크를 학습하는 함수 train을 나타냅니다.

이 함수는 다음과 같은 역할을 수행합니다:

X: 원본 이미지
contents_Y: 내용 이미지의 특징
styles_Y: 스타일 이미지의 특징
device: 사용할 디바이스 (CPU 또는 GPU)
lr: 학습률
num_epochs: 에포크 수
lr_decay_epoch: 학습률 감소를 적용할 에포크 간격

위 매개변수들을 이용하여 다음 작업을 수행합니다:

생성된 이미지 X, 스타일 이미지의 Gram 행렬들 styles_Y_gram, 그리고 Adam 옵티마이저를 초기화하고 얻습니다.
StepLR 스케줄러를 생성하여 학습률을 조정하는데 사용합니다.
애니메이션을 위한 Animator를 초기화합니다.
주어진 에포크 수만큼 학습을 반복하면서 다음을 수행합니다:
- 옵티마이저의 그래디언트를 초기화합니다.
- 생성된 이미지 X를 스타일 전이 네트워크에 통과시켜 내용과 스타일 특징을 추출합니다.
- 내용 손실, 스타일 손실, 총 변동 손실을 계산합니다.
- 총 손실에 대한 역전파를 수행하고 옵티마이저를 업데이트합니다.
- 스케줄러를 업데이트하여 학습률을 조정합니다.
- 일정 에포크마다 생성된 이미지를 시각화하고, 내용 손실, 스타일 손실, 변동 손실을 기록합니다.
최종적으로 생성된 이미지 X를 반환합니다.

Now we start to train the model. We rescale the height and width of the content and style images to 300 by 450 pixels. We use the content image to initialize the synthesized image.

이제 모델 학습을 시작합니다. 콘텐츠 및 스타일 이미지의 높이와 너비를 300 x 450픽셀로 재조정합니다. 콘텐츠 이미지를 사용하여 합성 이미지를 초기화합니다.

device, image_shape = d2l.try_gpu(), (300, 450)  # PIL Image (h, w)
net = net.to(device)
content_X, contents_Y = get_contents(image_shape, device)
_, styles_Y = get_styles(image_shape, device)
output = train(content_X, contents_Y, styles_Y, device, 0.3, 500, 50)

위 코드는 주어진 내용 이미지와 스타일 이미지를 사용하여 스타일 전이 네트워크를 학습하고, 생성된 이미지를 반환하는 작업을 수행합니다.

device와 image_shape를 설정합니다. device는 학습에 사용할 디바이스(CPU 또는 GPU)를 결정하고, image_shape는 생성될 이미지의 크기를 결정합니다.
net을 지정된 디바이스로 옮깁니다. 이는 네트워크를 선택한 디바이스에서 실행하도록 하는 역할을 합니다.
get_contents 함수를 사용하여 내용 이미지의 특징을 추출합니다. 추출한 내용 이미지와 특징을 content_X와 contents_Y로 저장합니다.
get_styles 함수를 사용하여 스타일 이미지의 특징을 추출합니다. 추출한 스타일 이미지와 특징을 styles_X와 styles_Y로 저장합니다.
train 함수를 호출하여 스타일 전이 네트워크를 학습합니다. content_X, contents_Y, styles_Y, device, 학습률(0.3), 에포크 수(500), 학습률 감소 에포크(50)를 매개변수로 전달합니다. 학습된 결과로 생성된 이미지를 output에 저장합니다.

즉, 위 코드는 주어진 내용 이미지와 스타일 이미지를 이용하여 스타일 전이 네트워크를 학습하고, 학습된 결과로 생성된 이미지를 반환합니다.

We can see that the synthesized image retains the scenery and objects of the content image, and transfers the color of the style image at the same time. For example, the synthesized image has blocks of color like those in the style image. Some of these blocks even have the subtle texture of brush strokes.

합성된 이미지는 콘텐츠 이미지의 풍경과 사물을 그대로 유지함과 동시에 스타일 이미지의 색상을 전달함을 알 수 있다. 예를 들어 합성 이미지에는 스타일 이미지와 같은 색상 블록이 있습니다. 이러한 블록 중 일부는 브러시 스트로크의 미묘한 질감을 가지고 있습니다.

14.12.8. Summary

The loss function commonly used in style transfer consists of three parts: (i) content loss makes the synthesized image and the content image close in content features; (ii) style loss makes the synthesized image and style image close in style features; and (iii) total variation loss helps to reduce the noise in the synthesized image.
스타일 전송에서 일반적으로 사용되는 손실 함수는 세 부분으로 구성됩니다. (i) 콘텐츠 손실은 합성 이미지와 콘텐츠 이미지를 콘텐츠 기능에 가깝게 만듭니다. (ii) 스타일 손실은 합성 이미지와 스타일 이미지를 스타일 특징에 가깝게 만듭니다. (iii) 전체 변형 손실은 합성 이미지의 노이즈를 줄이는 데 도움이 됩니다.
We can use a pretrained CNN to extract image features and minimize the loss function to continuously update the synthesized image as model parameters during training.
사전 훈련된 CNN을 사용하여 이미지 특징을 추출하고 손실 함수를 최소화하여 훈련 중에 합성된 이미지를 모델 매개변수로 지속적으로 업데이트할 수 있습니다.
We use Gram matrices to represent the style outputs from the style layers.
스타일 레이어의 스타일 출력을 나타내기 위해 그램 매트릭스를 사용합니다.

14.12.9. Exercises

How does the output change when you select different content and style layers?
Adjust the weight hyperparameters in the loss function. Does the output retain more content or have less noise?
Use different content and style images. Can you create more interesting synthesized images?
Can we apply style transfer for text? Hint: you may refer to the survey paper by Hu et al. (2022).

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.11. Fully Convolutional Networks (0)	2023.08.21
D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19

Dive into Deep Learning/D2L Computer Vision

D2L - 14.11. Fully Convolutional Networks

2023. 8. 21. 04:10 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/fcn.html

14.11. Fully Convolutional Networks — Dive into Deep Learning 1.0.3 documentation

d2l.ai

As discussed in Section 14.9, semantic segmentation classifies images in pixel level. A fully convolutional network (FCN) uses a convolutional neural network to transform image pixels to pixel classes (Long et al., 2015). Unlike the CNNs that we encountered earlier for image classification or object detection, a fully convolutional network transforms the height and width of intermediate feature maps back to those of the input image: this is achieved by the transposed convolutional layer introduced in Section 14.10. As a result, the classification output and the input image have a one-to-one correspondence in pixel level: the channel dimension at any output pixel holds the classification results for the input pixel at the same spatial position.

섹션 14.9에서 논의한 바와 같이 의미론적 분할 semantic segmentation은 픽셀 수준에서 이미지를 분류합니다. 완전 컨볼루션 네트워크 fully convolutional network (FCN)는 컨볼루션 신경망을 사용하여 이미지 픽셀을 픽셀 클래스로 변환합니다(Long et al., 2015). 이전에 이미지 분류 또는 객체 감지를 위해 접한 CNN과 달리 완전 컨볼루션 네트워크는 중간 피처 맵의 높이와 너비를 입력 이미지의 높이와 너비로 다시 변환합니다. 결과적으로 분류 출력과 입력 이미지는 픽셀 수준에서 일대일 대응을 갖습니다. 모든 출력 픽셀의 채널 차원은 동일한 공간 위치에서 입력 픽셀에 대한 분류 결과를 유지합니다.

Fully Convolutional Networks 란?

"Fully Convolutional Networks"는 완전 합성곱 신경망이라고도 불리며, 주로 이미지 분할과 같은 컴퓨터 비전 작업에 사용되는 딥러닝 아키텍처를 의미합니다.

기존의 전통적인 합성곱 신경망(Convolutional Neural Networks, CNN)은 주로 분류 작업을 위해 설계되었으며, 입력 이미지를 각기 다른 레이어로 전달하여 점진적으로 특징을 추출한 후, 마지막에 소프트맥스 레이어로 클래스 확률을 출력하는 구조를 가지고 있습니다.

반면에 Fully Convolutional Networks (FCN)는 입력 이미지를 픽셀 단위로 분류하는 이미지 분할 작업에 특화된 구조입니다. FCN은 일반적인 CNN의 특징 추출 부분을 유지하면서, 마지막 레이어를 완전 합성곱 레이어로 변경하여 원본 이미지의 공간적인 정보를 보존할 수 있도록 합니다. 이를 통해 각 픽셀에 대한 클래스 레이블을 예측하게 됩니다.

FCN은 픽셀 단위 분류 작업에서 효과적이며, 이미지 분할, 객체 탐지, 시맨틱 세그멘테이션 등의 컴퓨터 비전 작업에서 널리 사용되는 아키텍처입니다. FCN은 입력 이미지의 크기에 따라 유연하게 대응할 수 있으며, 특정 객체나 물체의 경계를 정확하게 분할할 수 있는 장점을 가지고 있습니다.

%matplotlib inline
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

이 코드는 주피터 노트북(Jupyter Notebook)에서 실행되는 경우에 사용되는 매직 명령 %matplotlib inline을 시작으로, 필요한 모듈과 라이브러리를 가져오는 부분입니다.

%matplotlib inline: 주피터 노트북 환경에서 Matplotlib 그래프를 인라인으로 표시하도록 설정하는 매직 명령입니다.
import torch: 파이토치 라이브러리를 가져옵니다.
import torchvision: 파이토치의 컴퓨터 비전 관련 기능을 확장한 torchvision 라이브러리를 가져옵니다.
from torch import nn: 파이토치의 nn (neural network) 모듈에서 neural network 관련 기능들을 가져옵니다.
from torch.nn import functional as F: 파이토치의 nn 모듈에서 함수형 인터페이스를 가져와 F로 별칭을 지정합니다. 이것은 주로 활성화 함수나 손실 함수와 같은 함수적 기능을 가리키기 위해 사용됩니다.
from d2l import torch as d2l: "Dive into Deep Learning" 책의 도구 함수를 사용하기 위해 d2l 라이브러리를 가져옵니다. torch 별칭을 d2l로 변경하여 사용합니다.

이 코드 조각은 파이토치와 관련된 라이브러리와 함수를 가져옴으로써 딥러닝 모델을 구축하고 훈련하기 위한 기본적인 환경을 설정합니다.

14.11.1. The Model

Here we describe the basic design of the fully convolutional network model. As shown in Fig. 14.11.1, this model first uses a CNN to extract image features, then transforms the number of channels into the number of classes via a 1×1 convolutional layer, and finally transforms the height and width of the feature maps to those of the input image via the transposed convolution introduced in Section 14.10. As a result, the model output has the same height and width as the input image, where the output channel contains the predicted classes for the input pixel at the same spatial position.

여기서 우리는 완전 컨벌루션 네트워크 모델의 기본 설계를 설명합니다. 그림 14.11.1과 같이 이 모델은 먼저 CNN을 사용하여 이미지 특징을 추출한 다음 1×1 컨벌루션 레이어를 통해 채널 수를 클래스 수로 변환하고 마지막으로 특징 맵의 높이와 너비를 변환합니다. 섹션 14.10에서 소개한 전치 컨벌루션을 통해 입력 이미지의 이미지로 변환합니다. 결과적으로 모델 출력은 입력 이미지와 동일한 높이와 너비를 가지며 출력 채널에는 동일한 공간 위치에서 입력 픽셀에 대해 예측된 클래스가 포함됩니다.

Fig. 14.11.1  Fully convolutional network.

Below, we use a ResNet-18 model pretrained on the ImageNet dataset to extract image features and denote the model instance as pretrained_net. The last few layers of this model include a global average pooling layer and a fully connected layer: they are not needed in the fully convolutional network.

아래에서는 ImageNet 데이터 세트에서 사전 훈련된 ResNet-18 모델을 사용하여 이미지 기능을 추출하고 모델 인스턴스를 pretrained_net으로 표시합니다. 이 모델의 마지막 몇 계층에는 전역 평균 풀링 계층과 완전 연결 계층이 포함됩니다. 완전 컨벌루션 네트워크에서는 필요하지 않습니다.

pretrained_net = torchvision.models.resnet18(pretrained=True)
list(pretrained_net.children())[-3:]

이 코드는 사전 훈련된 ResNet-18 모델을 불러오고, 해당 모델의 마지막 세 개의 레이어를 리스트로 반환하는 과정을 나타냅니다.

pretrained_net: 사전 훈련된 ResNet-18 모델을 불러옵니다.
torchvision.models.resnet18(pretrained=True): 사전 훈련된 ResNet-18 모델을 불러옵니다. pretrained=True는 사전 훈련된 가중치를 사용하겠다는 의미입니다.
list(pretrained_net.children()): 불러온 ResNet-18 모델의 모든 하위 모듈을 리스트로 반환합니다. 이는 네트워크의 각 레이어를 순차적으로 표현한 것입니다.
[-3:]: 리스트의 마지막 세 개의 원소를 선택합니다. 이는 불러온 ResNet-18 모델의 마지막 세 개의 레이어를 의미합니다.

따라서, 위 코드는 사전 훈련된 ResNet-18 모델을 불러온 뒤, 해당 모델의 마지막 세 개의 레이어를 리스트로 반환하는 과정을 나타냅니다.

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /home/ci/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 56.3MB/s]

[Sequential(
   (0): BasicBlock(
     (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (relu): ReLU(inplace=True)
     (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (downsample): Sequential(
       (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
       (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     )
   )
   (1): BasicBlock(
     (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (relu): ReLU(inplace=True)
     (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
     (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   )
 ),
 AdaptiveAvgPool2d(output_size=(1, 1)),
 Linear(in_features=512, out_features=1000, bias=True)]

위 출력은 ResNet-18 모델의 구조를 설명합니다. 아래에서부터 위로 각각의 레이어와 그 내용을 설명하겠습니다.

Linear 레이어:
- 입력 특성 개수: 512
- 출력 특성 개수: 1000
- 활성화 함수: 없음
- 주어진 입력 특성에 대한 선형 변환을 수행하는 fully connected 레이어입니다. 이것은 마지막 분류 레이어로, 모델의 출력이 클래스 수에 대한 확률 분포를 나타냅니다.
AdaptiveAvgPool2d 레이어:
- 출력 크기: (1, 1)
- 입력으로 들어오는 특성 맵을 주어진 출력 크기에 맞게 자동으로 평균 풀링하는 레이어입니다. 입력 이미지의 크기에 관계 없이 일정한 크기의 출력을 생성합니다.
Sequential 레이어 (2개의 BasicBlock 레이어 포함):
- 첫 번째 BasicBlock:
  - conv1: 입력 특성 개수 256, 출력 특성 개수 512의 3x3 컨볼루션 레이어
  - bn1: 배치 정규화 레이어
  - relu: ReLU 활성화 함수
  - conv2: 입력 특성 개수 512, 출력 특성 개수 512의 3x3 컨볼루션 레이어
  - bn2: 배치 정규화 레이어
  - downsample: 다운샘플링을 위한 Sequential 레이어
- 두 번째 BasicBlock:
  - 위와 같은 구조를 가진 BasicBlock 레이어
BasicBlock은 ResNet의 기본 구성 요소입니다. 입력 특성을 컨볼루션과 배치 정규화를 거쳐 변환한 뒤, 활성화 함수 ReLU를 적용하고 다시 컨볼루션과 배치 정규화를 거칩니다. 이러한 구성으로 인해 ResNet은 깊게 쌓여도 그레이디언트 소실 문제를 해결하며 성능을 높일 수 있습니다.

따라서 위의 출력은 ResNet-18 모델의 구조를 나타내며, 각 레이어의 구성 요소를 설명합니다.

Next, we create the fully convolutional network instance net. It copies all the pretrained layers in the ResNet-18 except for the final global average pooling layer and the fully connected layer that are closest to the output.

다음으로 완전 컨볼루션 네트워크 인스턴스 네트워크를 만듭니다. 최종 전역 평균 풀링 계층과 출력에 가장 가까운 완전 연결 계층을 제외하고 ResNet-18의 모든 사전 훈련된 계층을 복사합니다.

net = nn.Sequential(*list(pretrained_net.children())[:-2])

위 코드는 미리 학습된 ResNet-18 모델의 마지막 두 레이어를 제외한 모든 레이어를 새로운 신경망 모델로 가져오는 과정을 나타냅니다.

pretrained_net은 미리 학습된 ResNet-18 모델을 나타냅니다.
pretrained_net.children()는 모델의 모든 하위 레이어를 가져오는 메소드입니다.
list(pretrained_net.children())[:-2]는 마지막 두 레이어를 제외한 모든 레이어를 리스트로 가져옵니다.
nn.Sequential(*...)는 주어진 레이어들을 순차적으로 연결한 새로운 신경망 모델을 생성합니다.

따라서 net은 미리 학습된 ResNet-18 모델의 마지막 두 레이어를 제외한 모든 레이어를 가지는 새로운 신경망 모델입니다. 이렇게 하면 기존의 미리 학습된 모델에서 일부 레이어를 재활용하고, 이에 새로운 레이어를 추가하여 원하는 작업에 맞는 모델을 빠르게 구축할 수 있습니다.

Given an input with height and width of 320 and 480 respectively, the forward propagation of net reduces the input height and width to 1/32 of the original, namely 10 and 15.

높이와 너비가 각각 320과 480인 입력이 주어지면 net의 정방향 전파는 입력 높이와 너비를 원본의 1/32, 즉 10과 15로 줄입니다.

X = torch.rand(size=(1, 3, 320, 480))
net(X).shape

위 코드는 미리 학습된 ResNet-18 모델의 마지막 두 레이어를 제외한 나머지 레이어로 입력 데이터를 전달하여 출력의 형태를 확인하는 과정을 나타냅니다.

torch.rand(size=(1, 3, 320, 480))은 크기가 (1, 3, 320, 480)인 랜덤한 값을 가지는 입력 데이터를 생성합니다. 이는 배치 크기가 1이고 채널 수가 3이며, 높이가 320이고 너비가 480인 이미지 데이터를 나타냅니다.
net(X)는 위에서 생성한 입력 데이터 X를 미리 학습된 ResNet-18 모델인 net에 통과시켜 출력을 계산하는 과정입니다.
.shape는 텐서의 크기(shape)를 반환하는 속성입니다. 따라서 net(X).shape는 네트워크의 출력 형태를 나타내게 됩니다.

이 코드를 실행하면 입력 데이터 X를 네트워크에 전달한 결과의 출력 형태를 확인할 수 있습니다.

torch.Size([1, 512, 10, 15])

Next, we use a 1×1 convolutional layer to transform the number of output channels into the number of classes (21) of the Pascal VOC2012 dataset. Finally, we need to increase the height and width of the feature maps by 32 times to change them back to the height and width of the input image. Recall how to calculate the output shape of a convolutional layer in Section 7.3. Since (320−64+16×2+32)/32=10 and (480−64+16×2+32)/32=15, we construct a transposed convolutional layer with stride of 32, setting the height and width of the kernel to 64, the padding to 16. In general, we can see that for stride s, padding s/2 (assuming s/2 is an integer), and the height and width of the kernel 2s, the transposed convolution will increase the height and width of the input by s times.

다음으로 1×1 컨벌루션 레이어를 사용하여 출력 채널 수를 Pascal VOC2012 데이터 세트의 클래스 수(21)로 변환합니다. 마지막으로 피처 맵의 높이와 너비를 32배 늘려 입력 이미지의 높이와 너비로 다시 변경해야 합니다. 섹션 7.3에서 컨볼루션 레이어의 출력 형태를 계산하는 방법을 기억하십시오. (320-64+16×2+32)/32=10이고 (480-64+16×2+32)/32=15이므로 보폭이 32인 전치 컨볼루션 레이어를 구성하고 커널은 64로, 패딩은 16으로 높이와 너비를 설정합니다. 일반적으로 보폭 s, 패딩 s/2(s/2가 정수라고 가정) 및 커널 2s의 높이와 너비에 대해 전치 컨볼루션이 입력의 높이와 너비를 s배로 증가시키는 것을 볼 수 있습니다. .

num_classes = 21
net.add_module('final_conv', nn.Conv2d(512, num_classes, kernel_size=1))
net.add_module('transpose_conv', nn.ConvTranspose2d(num_classes, num_classes,
                                    kernel_size=64, padding=16, stride=32))

위 코드는 기존의 신경망 모델 net에 두 개의 추가 레이어를 추가하는 과정을 나타냅니다.

num_classes = 21: 클래스의 개수를 나타내는 변수로, 이 경우에는 21개의 클래스를 분류하는 문제를 가정하고 있습니다.
nn.Conv2d(512, num_classes, kernel_size=1): final_conv라는 이름의 1x1 컨볼루션 레이어를 생성하고, 입력 채널 수가 512이며 출력 채널 수가 num_classes인 컨볼루션 레이어를 추가합니다. 이 레이어는 마지막 컨볼루션 레이어를 대체하는 역할을 합니다.
nn.ConvTranspose2d(num_classes, num_classes, kernel_size=64, padding=16, stride=32): transpose_conv라는 이름의 전치 컨볼루션 레이어를 생성하고, 입력 채널 수와 출력 채널 수가 모두 num_classes인 전치 컨볼루션 레이어를 추가합니다. 이 레이어는 특성 맵의 크기를 늘려주는 데 사용되며, 이 경우에는 크기를 32배 확대하고 패딩과 스트라이드를 조절하여 원하는 출력 크기를 얻습니다.

따라서 위 코드는 미리 학습된 신경망 net의 마지막 레이어를 대체하고, 그 뒤에 전치 컨볼루션 레이어를 추가하여 원하는 출력을 생성할 수 있도록 수정하는 과정을 나타냅니다. 이는 주로 세그멘테이션 등의 문제에서 사용되며, 네트워크의 출력 크기를 조정하고 클래스별 특성 맵을 얻을 때 유용합니다.

14.11.2. Initializing Transposed Convolutional Layers

We already know that transposed convolutional layers can increase the height and width of feature maps. In image processing, we may need to scale up an image, i.e., upsampling. Bilinear interpolation is one of the commonly used upsampling techniques. It is also often used for initializing transposed convolutional layers.

우리는 transposed convolutional layer가 feature map의 height와 width를 증가시킬 수 있다는 것을 이미 알고 있습니다. 이미지 처리에서 이미지를 확장해야 할 수도 있습니다. 즉, 업샘플링이 필요할 수 있습니다. 쌍선형 보간 Bilinear interpolation 은 일반적으로 사용되는 업샘플링 기술 중 하나입니다. 또한 전치된 컨볼루션 레이어를 초기화하는 데 자주 사용됩니다.

To explain bilinear interpolation, say that given an input image we want to calculate each pixel of the upsampled output image. In order to calculate the pixel of the output image at coordinate (x,y), first map (x,y) to coordinate (x′,y′) on the input image, for example, according to the ratio of the input size to the output size. Note that the mapped x′ and y′ are real numbers. Then, find the four pixels closest to coordinate (x′,y′) on the input image. Finally, the pixel of the output image at coordinate (x,y) is calculated based on these four closest pixels on the input image and their relative distance from (x′,y′).

쌍선형 보간법 bilinear interpolation을 설명하기 위해 입력 이미지가 주어졌을 때 업샘플링된 출력 이미지의 각 픽셀을 계산하려고 한다고 가정해 보겠습니다. 좌표 (x,y)에서 출력 이미지의 픽셀을 계산하기 위해 먼저 입력 크기의 비율에 따라 (x,y)를 입력 이미지의 좌표 (x′,y′)에 매핑합니다. 출력 크기에. 매핑된 x′ 및 y′는 실수입니다. 그런 다음 입력 이미지에서 좌표(x′,y′)에 가장 가까운 4개의 픽셀을 찾습니다. 마지막으로 좌표 (x,y)에 있는 출력 이미지의 픽셀은 입력 이미지에서 가장 가까운 4개의 픽셀과 (x',y')로부터의 상대적인 거리를 기반으로 계산됩니다.

bilinear interpolation이란?

'Bilinear Interpolation'은 이미지나 그래픽의 크기나 해상도를 변경하는 과정에서 사용되는 보간 기법 중 하나입니다. 보간(interpolation)은 주어진 데이터 포인트 사이에 새로운 값을 추정하는 기술을 의미하며, 이를 통해 부드럽게 변환된 이미지나 그래픽을 생성할 수 있습니다.

이 중 'Bilinear Interpolation'은 2차원 평면에서 사용되며, 두 개의 인접한 데이터 포인트 사이에서 새로운 값을 추정합니다. 간단히 말해, 2차원 좌표 상에서 (x, y) 위치에 존재하는 값은 주변 4개의 데이터 포인트를 사용하여 계산됩니다. 이렇게 인접한 4개의 데이터 포인트를 가중치를 적용하여 선형 보간한 결과를 사용하여 새로운 값을 추정합니다.

이 기법은 이미지나 그래픽의 크기 변경, 회전, 이동 등에 사용되며, 특히 이미지 리사이징이나 세그멘테이션과 같은 작업에서 주로 활용됩니다. 'Bilinear'이라는 이름은 가중치를 계산하는 데에 선형 보간을 사용한다는 특성을 나타내며, 두 개의 축에 대한 보간이 동시에 이루어지므로 'Bilinear'이라는 이름이 붙여졌습니다.

Upsampling of bilinear interpolation can be implemented by the transposed convolutional layer with the kernel constructed by the following bilinear_kernel function. Due to space limitations, we only provide the implementation of the bilinear_kernel function below without discussions on its algorithm design.

쌍선형 보간법 bilinear interpolation 의 업샘플링은 다음 bilinear_kernel 함수로 구성된 커널을 사용하여 전치된 컨벌루션 계층 transposed convolutional layer에 의해 구현될 수 있습니다. 공간 제한으로 인해 알고리즘 설계에 대한 논의 없이 bilinear_kernel 함수의 구현만 제공합니다.

def bilinear_kernel(in_channels, out_channels, kernel_size):
    factor = (kernel_size + 1) // 2
    if kernel_size % 2 == 1:
        center = factor - 1
    else:
        center = factor - 0.5
    og = (torch.arange(kernel_size).reshape(-1, 1),
          torch.arange(kernel_size).reshape(1, -1))
    filt = (1 - torch.abs(og[0] - center) / factor) * \
           (1 - torch.abs(og[1] - center) / factor)
    weight = torch.zeros((in_channels, out_channels,
                          kernel_size, kernel_size))
    weight[range(in_channels), range(out_channels), :, :] = filt
    return weight

위 코드는 주어진 입력 채널 수와 출력 채널 수, 그리고 커널 크기에 따라 이차원의 'Bilinear Interpolation' 커널을 생성하는 함수를 정의하고 있습니다. 이 커널은 주로 합성곱 신경망(Convolutional Neural Network, CNN)에서 전치 합성곱 연산에 활용되며, 주로 세그멘테이션(segmentation)과 같은 작업에서 이미지를 업샘플링하는 데 사용됩니다.

여기서 in_channels은 입력 채널 수, out_channels는 출력 채널 수, kernel_size는 커널의 크기(가로와 세로)입니다. 이 커널은 입력 이미지의 각 채널별로 다른 가중치를 가지며, 이를 통해 입력 이미지의 특성을 보존하면서 업샘플링됩니다.

커널을 생성하는 과정은 다음과 같습니다:

factor 변수는 커널 크기의 중앙을 계산합니다.
center 변수는 중앙 위치를 결정합니다.
og는 커널 크기에 따른 좌표 그리드를 생성합니다.
filt는 각 좌표 위치에 대한 커널 가중치를 계산합니다. 이것은 'Bilinear Interpolation'에 따라 계산된 값으로, 가운데에 가까운 값일수록 높은 가중치를 가지며, 커널 크기를 기준으로 대칭적인 값을 가집니다.
weight 텐서는 이렇게 계산된 filt 값을 이용하여 입력 채널 수와 출력 채널 수에 따른 4차원 텐서를 생성합니다. 이 텐서의 각 채널에 커널 가중치 값을 할당합니다.

이러한 커널은 전치 합성곱 연산에서 업샘플링을 수행할 때 사용되며, 주로 모델의 특성 맵 크기를 늘리는 데 활용됩니다.

Let’s experiment with upsampling of bilinear interpolation that is implemented by a transposed convolutional layer. We construct a transposed convolutional layer that doubles the height and weight, and initialize its kernel with the bilinear_kernel function.

전치된 컨볼루션 레이어에 의해 구현되는 쌍선형 보간의 업샘플링을 실험해 봅시다. 높이와 무게를 두 배로 늘리는 전치 컨볼루션 레이어를 구성하고 bilinear_kernel 함수로 커널을 초기화합니다.

conv_trans = nn.ConvTranspose2d(3, 3, kernel_size=4, padding=1, stride=2,
                                bias=False)
conv_trans.weight.data.copy_(bilinear_kernel(3, 3, 4));

위 코드는 입력 채널 수와 출력 채널 수가 모두 3인 커스텀 'Bilinear Interpolation' 커널을 생성하여 이를 사용하여 전치 합성곱 연산을 수행하는 부분입니다.

해석을 해보겠습니다:

conv_trans는 전치 합성곱 레이어를 생성하는데 사용됩니다. nn.ConvTranspose2d는 전치 합성곱을 수행하는 클래스입니다. 여기서 입력 채널 수와 출력 채널 수 모두 3이며, 커널 크기는 4x4이고, 패딩은 1, 스트라이드는 2로 설정됩니다.
conv_trans.weight.data.copy_(bilinear_kernel(3, 3, 4))는 위에서 정의한 bilinear_kernel 함수를 사용하여 커스텀 'Bilinear Interpolation' 커널을 생성하고, 이를 conv_trans 레이어의 가중치 텐서로 복사하는 역할을 합니다. 이를 통해 'Bilinear Interpolation'의 특성을 가지는 커널이 전치 합성곱 레이어에 적용됩니다.

따라서 위 코드는 입력 이미지의 특성 맵을 업샘플링하여 크기를 확장하는 역할을 수행합니다. 이를 통해 세그멘테이션과 같은 작업에서 높은 해상도의 예측을 얻을 수 있습니다.

Read the image X and assign the upsampling output to Y. In order to print the image, we need to adjust the position of the channel dimension.

이미지 X를 읽고 업샘플링 출력을 Y에 할당합니다. 이미지를 인쇄하려면 채널 차원의 위치를 조정해야 합니다.

img = torchvision.transforms.ToTensor()(d2l.Image.open('../img/catdog.jpg'))
X = img.unsqueeze(0)
Y = conv_trans(X)
out_img = Y[0].permute(1, 2, 0).detach()

위 코드는 주어진 이미지에 대해 Bilinear Interpolation을 사용하여 업샘플링을 수행하는 과정을 보여주는 부분입니다.

해석을 해보겠습니다:

img는 d2l.Image.open('../img/catdog.jpg')를 통해 로드된 이미지입니다. 이 이미지를 텐서 형식으로 변환하기 위해 torchvision.transforms.ToTensor()를 사용하여 텐서로 변환합니다.
X는 img를 모델에 입력으로 사용하기 위해 차원을 확장한 것입니다. 일반적으로 모델에 입력할 때 배치 차원을 추가하기 위해 사용됩니다.
conv_trans(X)는 이전에 정의한 conv_trans 레이어를 사용하여 입력 이미지 X에 대한 업샘플링을 수행한 결과를 얻습니다. 이렇게 얻은 결과인 Y는 업샘플링된 이미지입니다.
out_img는 Y 텐서에서 0번째 배치 차원을 선택하여 이미지를 추출합니다. 그리고 이를 .permute(1, 2, 0)를 통해 채널 순서를 변경하여 이미지의 형식을 HWC (높이, 너비, 채널)로 변경합니다. .detach()는 연산을 추적하지 않도록 하기 위해 사용되며, 역전파에 영향을 주지 않습니다.

결과적으로 out_img는 업샘플링된 이미지를 나타내며, 해당 이미지를 시각화하거나 분석하는 데 사용할 수 있습니다.

As we can see, the transposed convolutional layer increases both the height and width of the image by a factor of two. Except for the different scales in coordinates, the image scaled up by bilinear interpolation and the original image printed in Section 14.3 look the same.

보시다시피 transposed convolutional layer는 이미지의 높이와 너비를 2배로 늘립니다. 좌표의 스케일이 다른 것을 제외하면 쌍선형 보간에 의해 확대된 이미지와 14.3절에서 인쇄된 원본 이미지는 동일하게 보입니다.

d2l.set_figsize()
print('input image shape:', img.permute(1, 2, 0).shape)
d2l.plt.imshow(img.permute(1, 2, 0));
print('output image shape:', out_img.shape)
d2l.plt.imshow(out_img);

위 코드는 입력 이미지와 Bilinear Interpolation을 통해 얻은 업샘플링된 출력 이미지를 시각화하는 부분입니다.

해석을 해보겠습니다:

d2l.set_figsize()는 시각화할 그림의 크기를 설정하는 함수입니다. 보통 d2l 라이브러리에서 제공하는 시각화 함수를 사용할 때 함께 사용됩니다.
print('input image shape:', img.permute(1, 2, 0).shape)는 입력 이미지의 형태를 출력합니다. 이미지의 채널 순서를 변경하여 (높이, 너비, 채널) 형식으로 변경하고, .shape를 사용하여 해당 형태를 출력합니다.
d2l.plt.imshow(img.permute(1, 2, 0))는 d2l 라이브러리의 시각화 함수를 사용하여 입력 이미지를 표시합니다. img.permute(1, 2, 0)를 사용하여 채널 순서를 변경하고 이미지를 시각화합니다.
print('output image shape:', out_img.shape)는 업샘플링된 출력 이미지의 형태를 출력합니다.
d2l.plt.imshow(out_img)는 업샘플링된 출력 이미지를 시각화합니다.

결과적으로 코드는 입력 이미지와 업샘플링된 출력 이미지를 시각화하여 보여주는 역할을 합니다.

input image shape: torch.Size([561, 728, 3])
output image shape: torch.Size([1122, 1456, 3])

In a fully convolutional network, we initialize the transposed convolutional layer with upsampling of bilinear interpolation. For the 1×1 convolutional layer, we use Xavier initialization.

완전 컨벌루션 네트워크에서 쌍선형 보간법의 업샘플링으로 전치된 컨벌루션 레이어를 초기화합니다. 1×1 컨볼루션 레이어의 경우 Xavier 초기화를 사용합니다.

W = bilinear_kernel(num_classes, num_classes, 64)
net.transpose_conv.weight.data.copy_(W);

위 코드는 미리 정의된 네트워크 net의 ConvTranspose2d 레이어의 가중치를 Bilinear Interpolation에 사용되는 가중치로 설정하는 부분입니다.

해석을 해보겠습니다:

W = bilinear_kernel(num_classes, num_classes, 64)은 bilinear_kernel 함수를 사용하여 Bilinear Interpolation에 사용되는 가중치 행렬 W를 생성합니다. 이 함수는 입력 채널 수와 출력 채널 수가 같으며, 커널 크기는 64x64입니다. 이렇게 생성된 W는 Bilinear Interpolation을 수행하는 필터 역할을 합니다.
net.transpose_conv.weight.data.copy_(W)은 미리 정의된 net의 ConvTranspose2d 레이어의 가중치를 Bilinear Interpolation에 사용되는 W로 설정합니다. net.transpose_conv.weight.data는 레이어의 가중치 데이터를 의미하며, .copy_(W)를 사용하여 W의 값을 이 가중치에 복사합니다.

이렇게 설정된 가중치로 인해 ConvTranspose2d 레이어는 입력 이미지를 업샘플링하고 Bilinear Interpolation을 수행하여 출력 이미지를 생성하게 됩니다.

14.11.3. Reading the Dataset

We read the semantic segmentation dataset as introduced in Section 14.9. The output image shape of random cropping is specified as 320×480: both the height and width are divisible by 32.

섹션 14.9에서 소개한 시맨틱 분할 데이터 세트를 읽습니다. 임의 자르기의 출력 이미지 모양은 320×480으로 지정됩니다. 높이와 너비는 모두 32로 나눌 수 있습니다.

batch_size, crop_size = 32, (320, 480)
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)

위 코드는 Pascal VOC 데이터셋을 사용하여 학습과 테스트 데이터로 나누어 불러오는 부분입니다.

해석을 해보겠습니다:

batch_size는 미니배치의 크기를 나타내며, 한 번의 iteration에서 처리되는 데이터 샘플의 수입니다.
crop_size는 이미지를 잘라내는 크기입니다. VOC 데이터셋의 이미지를 모두 동일한 크기로 자르기 위해 사용됩니다.
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)는 d2l.load_data_voc 함수를 호출하여 Pascal VOC 데이터셋을 불러오고 학습용 데이터와 테스트용 데이터로 나눈 미니배치 반복자 train_iter와 test_iter를 생성합니다. 이 반복자들은 학습 및 테스트에 사용됩니다.

즉, 위 코드는 Pascal VOC 데이터셋을 불러와서 학습용 데이터와 테스트용 데이터로 나누고, 미니배치 반복자를 생성하는 과정을 나타냅니다.

read 1114 examples
read 1078 examples

14.11.4. Training

Now we can train our constructed fully convolutional network. The loss function and accuracy calculation here are not essentially different from those in image classification of earlier chapters. Because we use the output channel of the transposed convolutional layer to predict the class for each pixel, the channel dimension is specified in the loss calculation. In addition, the accuracy is calculated based on correctness of the predicted class for all the pixels.

이제 우리는 구축한 완전 컨벌루션 네트워크를 훈련할 수 있습니다. 여기서 손실 함수와 정확도 계산은 이전 장의 이미지 분류와 본질적으로 다르지 않습니다. 각 픽셀의 클래스를 예측하기 위해 전치된 컨벌루션 레이어의 출력 채널을 사용하기 때문에 손실 계산에서 채널 차원이 지정됩니다. 또한 정확도는 모든 픽셀에 대해 예측된 클래스의 정확도를 기반으로 계산됩니다.

def loss(inputs, targets):
    return F.cross_entropy(inputs, targets, reduction='none').mean(1).mean(1)

num_epochs, lr, wd, devices = 5, 0.001, 1e-3, d2l.try_all_gpus()
trainer = torch.optim.SGD(net.parameters(), lr=lr, weight_decay=wd)
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)

위 코드는 주어진 모델을 학습시키는 과정을 나타내고 있습니다.

def loss(inputs, targets) 함수는 입력 데이터와 실제 타겟 데이터를 받아와서 Cross Entropy 손실을 계산하는 함수입니다. 여기서 reduction='none'으로 설정하여 각 데이터 포인트의 손실 값을 계산하고, 이를 평균하여 최종 손실을 구합니다.
num_epochs는 에포크 수를 나타냅니다. 에포크는 학습 데이터를 몇 번 반복해서 사용하여 모델을 업데이트하는 단위입니다.
lr은 학습률(learning rate)로, 모델의 가중치를 업데이트할 때 얼마나 큰 스텝으로 업데이트할지를 결정합니다.
wd는 가중치 감쇠(weight decay)로, 가중치의 크기를 제어하여 오버피팅을 방지하고 모델의 일반화 성능을 높이는데 사용됩니다.
devices는 사용 가능한 GPU 장치를 찾는 함수인 d2l.try_all_gpus()를 호출하여 GPU 장치 목록을 가져옵니다.
trainer는 옵티마이저로, torch.optim.SGD를 사용하여 네트워크의 파라미터를 업데이트합니다.
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)는 모델을 학습시키는 함수를 호출합니다. 이 함수는 네트워크 모델, 학습용 데이터 반복자, 테스트용 데이터 반복자, 손실 함수, 옵티마이저, 에포크 수, GPU 장치 정보를 인자로 받아서 모델을 학습시키는 프로세스를 실행합니다.

따라서 위 코드는 주어진 모델을 주어진 데이터로 학습시키는 과정을 수행하는 코드입니다.

loss 0.449, train acc 0.861, test acc 0.852
226.7 examples/sec on [device(type='cuda', index=0), device(type='cuda', index=1)]

14.11.5. Prediction

When predicting, we need to standardize the input image in each channel and transform the image into the four-dimensional input format required by the CNN.

예측할 때 각 채널의 입력 이미지를 표준화하고 이미지를 CNN에서 요구하는 4차원 입력 형식으로 변환해야 합니다.

def predict(img):
    X = test_iter.dataset.normalize_image(img).unsqueeze(0)
    pred = net(X.to(devices[0])).argmax(dim=1)
    return pred.reshape(pred.shape[1], pred.shape[2])

위 코드는 주어진 이미지에 대한 Semantic Segmentation 예측을 수행하는 함수를 정의한 것입니다.

def predict(img) 함수는 이미지를 입력으로 받아와서 Semantic Segmentation을 예측합니다.
X = test_iter.dataset.normalize_image(img).unsqueeze(0)는 입력 이미지를 네트워크의 입력 형식에 맞게 전처리하는 과정입니다. 입력 이미지를 모델의 입력에 맞게 정규화(Normalization)하고, 배치 차원을 추가하여 모델에 한 번에 처리할 수 있도록 준비합니다.
pred = net(X.to(devices[0])).argmax(dim=1)는 전처리된 입력 이미지를 네트워크에 전달하고, 네트워크의 출력을 가져와서 가장 높은 값의 클래스 인덱스를 선택합니다. 이를 통해 각 픽셀에 대한 예측 클래스 인덱스를 얻습니다.
return pred.reshape(pred.shape[1], pred.shape[2])는 예측된 클래스 인덱스를 원본 이미지 크기에 맞게 재조정하여 반환합니다. 반환된 클래스 인덱스 맵은 이미지의 각 픽셀에 대한 클래스 정보를 포함하게 됩니다.

즉, predict 함수는 주어진 이미지에 대해 Semantic Segmentation 모델로 예측을 수행하고, 예측된 클래스 인덱스 맵을 반환하는 역할을 합니다.

To visualize the predicted class of each pixel, we map the predicted class back to its label color in the dataset.

각 픽셀의 예측된 클래스를 시각화하기 위해 예측된 클래스를 데이터 세트의 레이블 색상에 다시 매핑합니다.

def label2image(pred):
    colormap = torch.tensor(d2l.VOC_COLORMAP, device=devices[0])
    X = pred.long()
    return colormap[X, :]

위 코드는 예측된 클래스 인덱스 맵을 컬러 이미지로 변환하는 함수를 정의한 것입니다.

def label2image(pred) 함수는 예측된 클래스 인덱스 맵을 입력으로 받아서 컬러 이미지로 변환합니다.
colormap = torch.tensor(d2l.VOC_COLORMAP, device=devices[0])는 VOC 데이터셋에 사용되는 클래스의 컬러 맵을 Torch Tensor로 변환하여 준비합니다. 각 클래스마다 대응하는 RGB 컬러 값을 가지고 있습니다.
X = pred.long()는 예측된 클래스 인덱스 맵을 정수 형태로 변환합니다. 일반적으로 클래스 인덱스는 정수 값으로 표현됩니다.
return colormap[X, :]는 클래스 인덱스 맵에 대응하는 컬러 값을 colormap에서 찾아서 반환합니다. 따라서 각 픽셀에 대한 클래스 인덱스에 해당하는 컬러로 구성된 컬러 이미지가 반환됩니다.

이 함수를 사용하면 Semantic Segmentation 모델이 예측한 클래스 인덱스 맵을 실제 컬러 이미지로 변환하여 시각화할 수 있습니다.

Images in the test dataset vary in size and shape. Since the model uses a transposed convolutional layer with stride of 32, when the height or width of an input image is indivisible by 32, the output height or width of the transposed convolutional layer will deviate from the shape of the input image. In order to address this issue, we can crop multiple rectangular areas with height and width that are integer multiples of 32 in the image, and perform forward propagation on the pixels in these areas separately. Note that the union of these rectangular areas needs to completely cover the input image. When a pixel is covered by multiple rectangular areas, the average of the transposed convolution outputs in separate areas for this same pixel can be input to the softmax operation to predict the class.

테스트 데이터 세트의 이미지는 크기와 모양이 다양합니다. 이 모델은 보폭이 32인 전치 컨볼루션 레이어를 사용하기 때문에 입력 이미지의 높이 또는 너비가 32로 나눌 수 없는 경우 전치된 컨볼루션 레이어의 출력 높이 또는 너비가 입력 이미지의 모양에서 벗어날 것입니다. 이 문제를 해결하기 위해 이미지에서 높이와 너비가 32의 정수배인 여러 직사각형 영역을 자르고 이 영역의 픽셀에 대해 개별적으로 순방향 전파를 수행할 수 있습니다. 이러한 직사각형 영역의 합집합은 입력 이미지를 완전히 덮어야 합니다. 픽셀이 여러 직사각형 영역으로 덮여 있는 경우 동일한 픽셀에 대해 별도의 영역에서 전치된 컨볼루션 출력의 평균을 소프트맥스 연산에 입력하여 클래스를 예측할 수 있습니다.

For simplicity, we only read a few larger test images, and crop a 320×480 area for prediction starting from the upper-left corner of an image. For these test images, we print their cropped areas, prediction results, and ground-truth row by row.

간단하게 하기 위해 몇 개의 더 큰 테스트 이미지만 읽고 이미지의 왼쪽 상단 모서리에서 시작하여 예측을 위해 320×480 영역을 자릅니다. 이러한 테스트 이미지의 경우 잘린 영역, 예측 결과 및 실측 정보를 행별로 인쇄합니다.

voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')
test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
    crop_rect = (0, 0, 320, 480)
    X = torchvision.transforms.functional.crop(test_images[i], *crop_rect)
    pred = label2image(predict(X))
    imgs += [X.permute(1,2,0), pred.cpu(),
             torchvision.transforms.functional.crop(
                 test_labels[i], *crop_rect).permute(1,2,0)]
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2);

위 코드는 VOC 데이터셋의 테스트 이미지와 해당 이미지에 대한 예측 결과, 그리고 실제 레이블 이미지를 시각화하는 예제입니다.

voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')는 VOC 데이터셋을 다운로드하고 압축을 해제하여 voc_dir에 저장합니다.
test_images, test_labels = d2l.read_voc_images(voc_dir, False)는 VOC 데이터셋의 테스트 이미지와 레이블 이미지를 읽어옵니다.
n, imgs = 4, []는 시각화할 이미지 개수 n과 결과 이미지를 저장할 리스트 imgs를 초기화합니다.
for i in range(n):는 n 개의 이미지에 대해 반복합니다.
crop_rect = (0, 0, 320, 480)는 이미지를 잘라낼 영역을 정의합니다.
X = torchvision.transforms.functional.crop(test_images[i], *crop_rect)는 테스트 이미지 i를 잘라낸 후 X에 저장합니다.
pred = label2image(predict(X))는 잘라낸 이미지 X에 대한 예측을 수행한 뒤, 예측 결과를 컬러 이미지로 변환하여 pred에 저장합니다.
imgs += [X.permute(1,2,0), pred.cpu(), ...]는 잘라낸 원본 이미지, 예측 결과 이미지, 실제 레이블 이미지를 리스트에 추가합니다. .permute(1, 2, 0)는 이미지의 차원 순서를 변경하여 HWC 형식으로 변환합니다.
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2)는 원본 이미지, 예측 결과 이미지, 실제 레이블 이미지를 한 줄에 3개씩 표시하고, 이를 n 행으로 나열하여 시각화합니다. scale=2는 이미지 크기를 확대하여 표시합니다.

이 코드를 통해 VOC 데이터셋의 테스트 이미지에 대한 Semantic Segmentation 모델의 예측 결과와 실제 레이블 이미지를 시각화할 수 있습니다.

14.11.6. Summary

The fully convolutional network first uses a CNN to extract image features, then transforms the number of channels into the number of classes via a 1×1 convolutional layer, and finally transforms the height and width of the feature maps to those of the input image via the transposed convolution.

완전 컨벌루션 네트워크는 먼저 CNN을 사용하여 이미지 특징을 추출합니다. 그런 다음 1×1 컨벌루션 레이어를 통해 채널 수를 클래스 수로 변환합니다. 마지막으로 transposed convolution을 통해 feature map의 높이와 너비를 입력 이미지의 높이와 너비로 변환합니다.
In a fully convolutional network, we can use upsampling of bilinear interpolation to initialize the transposed convolutional layer.

완전 컨벌루션 네트워크에서는 쌍선형 보간의 업샘플링을 사용하여 전치된 컨볼루션 레이어를 초기화할 수 있습니다.

14.11.7. Exercises

If we use Xavier initialization for the transposed convolutional layer in the experiment, how does the result change?
Can you further improve the accuracy of the model by tuning the hyperparameters?
Predict the classes of all pixels in test images.
The original fully convolutional network paper also uses outputs of some intermediate CNN layers (Long et al., 2015). Try to implement this idea.

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.12. Neural Style Transfer (1)	2023.08.21
D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19

Dive into Deep Learning/D2L Computer Vision

D2L - 14.10. Transposed Convolution

2023. 8. 21. 01:26 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/transposed-conv.html

14.10. Transposed Convolution — Dive into Deep Learning 1.0.3 documentation

d2l.ai

14.10. Transposed Convolution

The CNN layers we have seen so far, such as convolutional layers (Section 7.2) and pooling layers (Section 7.5), typically reduce (downsample) the spatial dimensions (height and width) of the input, or keep them unchanged. In semantic segmentation that classifies at pixel-level, it will be convenient if the spatial dimensions of the input and output are the same. For example, the channel dimension at one output pixel can hold the classification results for the input pixel at the same spatial position.

우리가 지금까지 본 컨볼루션 레이어(섹션 7.2) 및 풀링 레이어(섹션 7.5)와 같은 CNN 레이어는 일반적으로 입력의 공간 차원 spatial dimensions (높이 및 너비)을 줄이거나 변경하지 않고 유지합니다. 픽셀 단위로 분류하는 시맨틱 분할semantic segmentation에서는 입력과 출력의 공간적 차원이 같으면 편리할 것이다. 예를 들어, 한 출력 픽셀의 채널 차원은 동일한 공간 위치에서 입력 픽셀에 대한 분류 결과를 보유할 수 있습니다.

To achieve this, especially after the spatial dimensions are reduced by CNN layers, we can use another type of CNN layers that can increase (upsample) the spatial dimensions of intermediate feature maps. In this section, we will introduce transposed convolution, which is also called fractionally-strided convolution (Dumoulin and Visin, 2016), for reversing downsampling operations by the convolution.

이를 달성하기 위해 특히 공간 차원이 CNN 레이어에 의해 축소된 후 중간 피쳐 맵의 공간 차원을 증가(업샘플링)할 수 있는 다른 유형의 CNN 레이어를 사용할 수 있습니다. 이 섹션에서는 convolution에 의한 downsampling 연산을 반전시키기 위해 fractionally-strided convolution(Dumoulin and Visin, 2016)이라고도 하는 transposed convolution을 소개합니다.

import torch
from torch import nn
from d2l import torch as d2l

Transposed Convolution이란?

Transposed convolution, also known as fractionally strided convolution or deconvolution, is an operation in convolutional neural networks (CNNs) that performs an upscaling operation on an input feature map. Unlike regular convolution, which reduces the spatial resolution of the feature map, transposed convolution increases the spatial resolution of the feature map.

전치 합성곱(Transposed Convolution), 또는 분수 합성곱 또는 역 합성곱(deconvolution)으로도 알려져 있으며, 컨볼루션 신경망(CNN)에서 입력 특성 맵에 대한 업스케일링 작업을 수행하는 연산입니다. 일반적인 컨볼루션과는 달리, 전치 합성곱은 특성 맵의 공간 해상도를 증가시킵니다.

The operation involves using learnable parameters (also known as convolutional kernels) to map each input pixel to multiple output pixels. This process effectively enlarges the feature map by inserting zeros or duplicating values in between the original input values. Transposed convolution is commonly used for tasks like image segmentation, image generation, and increasing the spatial resolution of feature maps.

이 연산은 학습 가능한 매개변수(컨볼루션 커널이라고도 함)를 사용하여 각 입력 픽셀을 여러 출력 픽셀로 매핑합니다. 이 과정은 원래 입력 값 사이에 0을 삽입하거나 값을 복제하여 특성 맵을 확장합니다. 전치 합성곱은 이미지 세그멘테이션, 이미지 생성, 특성 맵의 공간 해상도 증가와 같은 작업에 널리 사용됩니다.

The main steps of the transposed convolution operation are as follows:

전치 합성곱 연산의 주요 단계는 다음과 같습니다:

Input: The input feature map has a certain height and width, and it is commonly represented as a 3D tensor with dimensions (channels, height, width).

입력: 입력 특성 맵은 특정한 높이와 너비를 갖고 있으며, 보통 (채널, 높이, 너비) 차원의 3D 텐서로 나타납니다.
Convolution: A set of learnable convolutional kernels (filters) is applied to the input feature map. Each kernel produces a response map by convolving with the input.

컨볼루션: 학습 가능한 컨볼루션 커널(필터) 집합이 입력 특성 맵에 적용됩니다. 각 커널은 입력과 컨볼루션하여 응답 맵을 생성합니다.
Upsampling: The response maps are then upsampled to increase the spatial resolution. This is achieved by inserting zeros or duplicating values in between the original input values, effectively increasing the size of the feature map.

업스케일링: 응답 맵은 업스케일링되어 공간 해상도를 증가시킵니다. 이를 위해 원래 입력 값 사이에 0을 삽입하거나 값을 복제하여 특성 맵의 크기를 키웁니다.
Output: The final output is an upsampled feature map with higher spatial resolution, which can be used for subsequent tasks like segmentation or generating higher-resolution images.

출력: 최종 출력은 공간 해상도가 높아진 업스케일링된 특성 맵으로, 세그멘테이션이나 더 높은 해상도의 이미지 생성과 같은 후속 작업에 사용할 수 있습니다.

Transposed convolution is commonly used in various architectures, such as in generative adversarial networks (GANs) for image generation and in decoder portions of autoencoders for feature reconstruction. It provides a way to recover spatial information that might have been lost during earlier convolutional operations, enabling the network to generate fine-grained details and high-resolution representations.

전치 합성곱은 이미지 생성을 위한 생성적 적대 신경망(GAN)이나 특성 재구성을 위한 오토인코더의 디코더 부분과 같은 다양한 아키텍처에서 자주 사용됩니다. 이를 통해 네트워크는 이전 컨볼루션 연산 중에 손실될 수 있는 공간 정보를 복구하여 미세한 세부 정보와 고해상도 표현을 생성할 수 있는 방법을 제공 합니다.

GAN이란 ?

GAN stands for "Generative Adversarial Network." It is a class of machine learning models that consists of two neural networks, known as the generator and the discriminator, that are trained simultaneously through a competitive process.

The key idea behind GANs is to have two networks compete against each other in a game-like scenario:

GAN은 "생성적 적대 신경망"을 의미합니다. 이는 생성자(generator)와 판별자(discriminator) 두 개의 신경망으로 구성되어 있으며, 이들이 경쟁적인 과정을 통해 동시에 훈련되는 머신 러닝 모델의 한 유형입니다.

GAN의 핵심 아이디어는 두 개의 신경망을 게임과 유사한 상황에서 서로 경쟁하게 하는 것입니다:

Generator: The generator network takes random noise as input and generates data, such as images, audio, or text. It learns to create data that is similar to real data from a given dataset.

생성자: 생성자 신경망은 무작위한 잡음을 입력으로 받아 이미지, 음성 또는 텍스트와 같은 데이터를 생성합니다. 이는 주어진 데이터셋의 실제 데이터와 유사한 데이터를 생성하는 방법을 학습합니다.
Discriminator: The discriminator network, also called the critic, takes both real data from the training dataset and generated data from the generator as input. It learns to distinguish between real and generated data.

판별자: 판별자 신경망, 또는 비평자(critic)라고도 하는 이 네트워크는 훈련 데이터셋의 실제 데이터와 생성자로부터 생성된 데이터를 모두 입력으로 받습니다. 판별자는 실제와 생성된 데이터를 구분하는 방법을 학습합니다.

During training, the generator aims to produce data that is so realistic that it can fool the discriminator into believing that it is real data. On the other hand, the discriminator tries to correctly classify whether the input data is real or generated. This process creates a feedback loop where the generator improves its ability to create realistic data, while the discriminator becomes better at distinguishing real from generated data.

훈련 중에 생성자는 판별자를 속여 실제 데이터처럼 보이는 데이터를 생성하는 것을 목표로 합니다. 반면, 판별자는 입력 데이터가 실제 데이터인지 생성된 데이터인지 올바르게 분류하려고 합니다. 이 과정은 생성자가 실제 데이터와 거의 구별되지 않는 실제 같은 데이터를 생성하도록 개선되는 반면, 판별자는 실제와 생성된 데이터를 구별하는 능력을 향상시킵니다.

The training of a GAN involves an adversarial process, where the generator and discriminator are iteratively updated to improve their performance. As training progresses, the generator becomes more adept at creating data that is increasingly difficult for the discriminator to differentiate from real data. The ultimate goal is for the generator to produce data that is indistinguishable from real data, effectively learning the underlying distribution of the training dataset.

GAN의 훈련은 생성자와 판별자를 반복적으로 업데이트하여 성능을 개선하는 적대적인 과정을 포함합니다. 훈련이 진행됨에 따라 생성자는 판별자가 실제 데이터와 구별하기 어려운 데이터를 더 잘 생성하게 되며, 최종 목표는 생성자가 실제 데이터와 구별할 수 없을 정도로 실제와 똑같은 데이터를 생성하도록 하는 것입니다. 이로써 생성자는 훈련 데이터셋의 기반이 되는 분포를 학습합니다.

GANs have demonstrated remarkable success in various applications, including image generation, style transfer, data augmentation, super-resolution, and more. They have also led to the development of more advanced GAN variants, such as Conditional GANs, CycleGANs, and Progressive GANs, each with their own specific applications and improvements

GAN은 이미지 생성, 스타일 변환, 데이터 증강, 초해상도 등 다양한 응용 분야에서 놀라운 성과를 보여주었습니다. 더 나아가 조건부 GAN, CycleGAN, Progressive GAN 등과 같은 고급 GAN 변형들이 개발되어 각각 특정 응용 및 개선에 활용되고 있습니다.

14.10.1. Basic Operation

Ignoring channels for now, let’s begin with the basic transposed convolution operation with stride of 1 and no padding. Suppose that we are given a nℎ×nw input tensor and a kℎ×kw kernel. Sliding the kernel window with stride of 1 for nw times in each row and nℎ times in each column yields a total of nℎnw intermediate results. Each intermediate result is a (nℎ+kℎ−1)×(nw+kw−1) tensor that are initialized as zeros. To compute each intermediate tensor, each element in the input tensor is multiplied by the kernel so that the resulting kℎ×kw tensor replaces a portion in each intermediate tensor. Note that the position of the replaced portion in each intermediate tensor corresponds to the position of the element in the input tensor used for the computation. In the end, all the intermediate results are summed over to produce the output.

지금은 채널을 무시하고 보폭이 1이고 패딩이 없는 기본 전치 컨볼루션 작업 transposed convolution operation 부터 시작하겠습니다. nℎ×nw 입력 텐서와 kℎ×kw 커널이 주어졌다고 가정합니다. 커널 창을 각 행에서 nw번, 각 열에서 nℎ번 stride 1로 슬라이딩하면 총 nℎnw 중간 결과가 생성됩니다. 각 중간 결과는 0으로 초기화되는 (nℎ+kℎ−1)×(nw+kw−1) 텐서입니다. 각 중간 텐서를 계산하기 위해 입력 텐서의 각 요소에 커널을 곱하여 결과 kℎ×kw 텐서가 각 중간 텐서의 일부를 대체합니다. 각 중간 텐서에서 대체된 부분의 위치는 계산에 사용된 입력 텐서의 요소 위치에 해당합니다. 결국 모든 중간 결과가 합산되어 출력이 생성됩니다.

As an example, Fig. 14.10.1 illustrates how transposed convolution with a 2×2 kernel is computed for a 2×2 input tensor.

예를 들어, 그림 14.10.1은 2×2 입력 텐서에 대해 2×2 커널을 사용한 전치 컨볼루션이 계산되는 방법을 보여줍니다.

Fig. 14.10.1  Transposed convolution with a 2×2 kernel. The shaded portions are a portion of an intermediate tensor as well as the input and kernel tensor elements used for the computation.

We can implement this basic transposed convolution operation trans_conv for a input matrix X and a kernel matrix K.

입력 행렬 X와 커널 행렬 K에 대해 이 기본 전치 합성곱 연산transposed convolution operation trans_conv를 구현할 수 있습니다.

def trans_conv(X, K):
    h, w = K.shape
    Y = torch.zeros((X.shape[0] + h - 1, X.shape[1] + w - 1))
    for i in range(X.shape[0]):
        for j in range(X.shape[1]):
            Y[i: i + h, j: j + w] += X[i, j] * K
    return Y

이 코드는 2D 행렬 X와 커널 행렬 K 간의 2D 전치 합성곱(Transposed Convolution) 연산을 수행하는 함수를 정의한 내용입니다.

def trans_conv(X, K):: trans_conv 함수를 정의합니다. 이 함수는 2D 전치 합성곱 연산을 수행합니다.
- h, w = K.shape: 커널 K의 높이와 너비를 가져옵니다.
- Y = torch.zeros((X.shape[0] + h - 1, X.shape[1] + w - 1)): 출력 결과를 저장할 2D 행렬 Y를 초기화합니다. 크기는 입력 X의 높이와 너비에 커널 K의 크기를 더한 값으로 설정됩니다.
- for i in range(X.shape[0]):: 입력 X의 행 인덱스에 대해 반복합니다.
  - for j in range(X.shape[1]):: 입력 X의 열 인덱스에 대해 반복합니다.
    - Y[i: i + h, j: j + w] += X[i, j] * K: 입력 X의 (i, j) 위치에서 커널 K를 가중치로 사용하여 출력 Y의 해당 영역에 누적합니다.
- return Y: 전치 합성곱 연산 결과인 출력 행렬 Y를 반환합니다.

이 함수는 입력과 커널로 주어진 데이터에 대해 2D 전치 합성곱 연산을 수행하여 출력 결과를 생성합니다.

In contrast to the regular convolution (in Section 7.2) that reduces input elements via the kernel, the transposed convolution broadcasts input elements via the kernel, thereby producing an output that is larger than the input. We can construct the input tensor X and the kernel tensor K from Fig. 14.10.1 to validate the output of the above implementation of the basic two-dimensional transposed convolution operation.

커널을 통해 입력 요소를 줄이는 일반 컨볼루션(섹션 7.2)과 달리 전치 컨볼루션은 커널을 통해 입력 요소를 브로드캐스트하여 입력보다 큰 출력을 생성합니다. 그림 14.10.1에서 입력 텐서 X와 커널 텐서 K를 구성하여 기본 2차원 전치 컨볼루션 연산의 위 구현 결과를 검증할 수 있습니다.

X = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
trans_conv(X, K)

이 코드는 주어진 입력 행렬 X와 커널 K에 대해 2D 전치 합성곱 연산을 수행하는 과정을 나타냅니다.

X = torch.tensor([[0.0, 1.0], [2.0, 3.0]]): 2x2 크기의 입력 행렬 X를 생성합니다.
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]]): 2x2 크기의 커널 K를 생성합니다.
trans_conv(X, K): 앞에서 정의한 trans_conv 함수에 입력 행렬 X와 커널 K를 전달하여 전치 합성곱 연산을 수행합니다.

이 코드는 입력 행렬과 커널을 사용하여 전치 합성곱 연산을 실행한 결과를 반환합니다. 결과적으로 2D 전치 합성곱 연산의 결과를 확인할 수 있습니다.

tensor([[ 0.,  0.,  1.],
        [ 0.,  4.,  6.],
        [ 4., 12.,  9.]])

Alternatively, when the input X and kernel K are both four-dimensional tensors, we can use high-level APIs to obtain the same results.

또는 입력 X와 커널 K가 모두 4차원 텐서인 경우 고급 API를 사용하여 동일한 결과를 얻을 수 있습니다.

X, K = X.reshape(1, 1, 2, 2), K.reshape(1, 1, 2, 2)
tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, bias=False)
tconv.weight.data = K
tconv(X)

이 코드는 주어진 입력 행렬 X와 커널 K에 대해 PyTorch의 합성곱 전치(Convolutional Transpose) 연산을 수행하는 과정을 나타냅니다.

X, K = X.reshape(1, 1, 2, 2), K.reshape(1, 1, 2, 2): 입력 행렬 X와 커널 K를 4차원 텐서로 변환합니다. 이를 위해 reshape 함수를 사용하여 각각 (1, 1, 2, 2) 크기의 4차원 텐서로 변환합니다. 이는 PyTorch 합성곱 연산의 입력 형식에 맞추기 위함입니다.
tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, bias=False): PyTorch의 nn.ConvTranspose2d 클래스를 사용하여 합성곱 전치 연산을 정의합니다. 입력 채널과 출력 채널을 각각 1로 설정하고, 커널 크기를 2로 설정하며, 편향을 사용하지 않도록 bias=False로 설정합니다.
tconv.weight.data = K: 합성곱 전치 연산의 가중치(weight)를 커널 K로 설정합니다.
tconv(X): 정의한 합성곱 전치 연산 tconv를 입력 텐서 X에 적용하여 결과를 계산합니다. 이는 입력 X를 커널 K를 사용하여 전치 합성곱 연산을 수행한 결과입니다.

이 코드는 PyTorch의 합성곱 전치 연산을 활용하여 주어진 입력 행렬과 커널에 대한 결과를 계산하고 반환하는 과정을 나타냅니다.

tensor([[[[ 0.,  0.,  1.],
          [ 0.,  4.,  6.],
          [ 4., 12.,  9.]]]], grad_fn=<ConvolutionBackward0>)

14.10.2. Padding, Strides, and Multiple Channels

Different from in the regular convolution where padding is applied to input, it is applied to output in the transposed convolution. For example, when specifying the padding number on either side of the height and width as 1, the first and last rows and columns will be removed from the transposed convolution output.

패딩이 입력에 적용되는 일반 컨볼루션과 달리 전치 컨볼루션에서는 출력에 패딩이 적용됩니다. 예를 들어 높이와 너비 양쪽의 패딩 수를 1로 지정하면 Transposed Convolution 출력에서 첫 번째와 마지막 행과 열이 제거됩니다.

tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, padding=1, bias=False)
tconv.weight.data = K
tconv(X)

이 코드는 주어진 입력 행렬 X와 커널 K에 대해 PyTorch의 패딩이 포함된 합성곱 전치(Convolutional Transpose) 연산을 수행하는 과정을 나타냅니다.

tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, padding=1, bias=False): PyTorch의 nn.ConvTranspose2d 클래스를 사용하여 패딩이 포함된 합성곱 전치 연산을 정의합니다. 입력 채널과 출력 채널을 각각 1로 설정하고, 커널 크기를 2로 설정하며, 패딩을 1로 설정하고, 편향을 사용하지 않도록 bias=False로 설정합니다. 이렇게 함으로써 입력 이미지에 패딩이 적용된 합성곱 전치 연산을 수행하게 됩니다.
tconv.weight.data = K: 합성곱 전치 연산의 가중치(weight)를 커널 K로 설정합니다.
tconv(X): 정의한 패딩이 포함된 합성곱 전치 연산 tconv를 입력 텐서 X에 적용하여 결과를 계산합니다. 이는 입력 X에 패딩이 적용된 상태에서 커널 K를 사용하여 전치 합성곱 연산을 수행한 결과입니다.

이 코드는 PyTorch의 패딩이 포함된 합성곱 전치 연산을 활용하여 주어진 입력 행렬과 커널에 대한 결과를 계산하고 반환하는 과정을 나타냅니다.

tensor([[[[4.]]]], grad_fn=<ConvolutionBackward0>)

In the transposed convolution, strides are specified for intermediate results (thus output), not for input. Using the same input and kernel tensors from Fig. 14.10.1, changing the stride from 1 to 2 increases both the height and weight of intermediate tensors, hence the output tensor in Fig. 14.10.2.

Transposed Convolution에서 보폭은 입력이 아닌 중간 결과(따라서 출력)에 지정됩니다. 그림 14.10.1의 동일한 입력 및 커널 텐서를 사용하여 stride를 1에서 2로 변경하면 중간 텐서의 높이와 무게가 모두 증가하므로 그림 14.10.2의 출력 텐서가 증가합니다.

Fig. 14.10.2  Transposed convolution with a 2×2 kernel with stride of 2. The shaded portions are a portion of an intermediate tensor as well as the input and kernel tensor elements used for the computation.

The following code snippet can validate the transposed convolution output for stride of 2 in Fig. 14.10.2.

다음 코드 스니펫은 그림 14.10.2에서 보폭이 2인 전치 컨볼루션 출력을 검증할 수 있습니다.

tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, stride=2, bias=False)
tconv.weight.data = K
tconv(X)

이 코드는 주어진 입력 행렬 X와 커널 K에 대해 PyTorch의 스트라이드(stride)가 적용된 합성곱 전치(Convolutional Transpose) 연산을 수행하는 과정을 나타냅니다.

tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, stride=2, bias=False): PyTorch의 nn.ConvTranspose2d 클래스를 사용하여 스트라이드가 적용된 합성곱 전치 연산을 정의합니다. 입력 채널과 출력 채널을 각각 1로 설정하고, 커널 크기를 2로 설정하며, 스트라이드를 2로 설정하고, 편향을 사용하지 않도록 bias=False로 설정합니다. 이렇게 함으로써 입력 이미지에 스트라이드가 적용된 합성곱 전치 연산을 수행하게 됩니다.
tconv.weight.data = K: 합성곱 전치 연산의 가중치(weight)를 커널 K로 설정합니다.
tconv(X): 정의한 스트라이드가 적용된 합성곱 전치 연산 tconv를 입력 텐서 X에 적용하여 결과를 계산합니다. 이는 입력 X에 스트라이드가 적용된 상태에서 커널 K를 사용하여 전치 합성곱 연산을 수행한 결과입니다.

이 코드는 PyTorch의 스트라이드가 적용된 합성곱 전치 연산을 활용하여 주어진 입력 행렬과 커널에 대한 결과를 계산하고 반환하는 과정을 나타냅니다.

tensor([[[[0., 0., 0., 1.],
          [0., 0., 2., 3.],
          [0., 2., 0., 3.],
          [4., 6., 6., 9.]]]], grad_fn=<ConvolutionBackward0>)

For multiple input and output channels, the transposed convolution works in the same way as the regular convolution. Suppose that the input has ci channels, and that the transposed convolution assigns a kℎ×kw kernel tensor to each input channel. When multiple output channels are specified, we will have a ci×kℎ×kw kernel for each output channel.

다중 입력 및 출력 채널의 경우 전치 컨볼루션은 일반 컨볼루션과 동일한 방식으로 작동합니다. 입력에 ci 채널이 있고 전치 컨벌루션이 각 입력 채널에 kℎ×kw 커널 텐서를 할당한다고 가정합니다. 여러 출력 채널이 지정된 경우 각 출력 채널에 대해 ci×kℎ×kw 커널이 있습니다.

As in all, if we feed X into a convolutional layer f to output Y=f(X) and create a transposed convolutional layer g with the same hyperparameters as f except for the number of output channels being the number of channels in X, then g(Y) will have the same shape as X. This can be illustrated in the following example.

대체로 X를 컨볼루션 레이어 f에 공급하여 Y=f(X)를 출력하고 출력 채널 수가 X의 채널 수인 것을 제외하고 f와 동일한 하이퍼파라미터로 전치된 컨볼루션 레이어 g를 생성하면 g(Y)는 X와 동일한 모양을 갖습니다. 이는 다음 예에서 설명할 수 있습니다.

X = torch.rand(size=(1, 10, 16, 16))
conv = nn.Conv2d(10, 20, kernel_size=5, padding=2, stride=3)
tconv = nn.ConvTranspose2d(20, 10, kernel_size=5, padding=2, stride=3)
tconv(conv(X)).shape == X.shape

이 코드는 주어진 입력 텐서 X에 대해 합성곱(Convolution)과 합성곱 전치(Convolutional Transpose) 연산을 차례로 적용하고, 최종 결과의 형태를 확인하는 과정을 나타냅니다.

X = torch.rand(size=(1, 10, 16, 16)): 크기가 (1, 10, 16, 16)인 랜덤한 값을 가지는 4차원 텐서 X를 생성합니다. 이는 1개의 채널, 10개의 필터, 16x16 크기의 이미지를 의미합니다.
conv = nn.Conv2d(10, 20, kernel_size=5, padding=2, stride=3): 입력 채널이 10이고 출력 채널이 20인 합성곱 연산(nn.Conv2d)을 정의합니다. 커널 크기를 5로 설정하고, 패딩을 2로 설정하여 출력 크기를 입력과 같게 만들어주고, 스트라이드를 3으로 설정하여 스트라이드가 적용된 합성곱을 정의합니다.
tconv = nn.ConvTranspose2d(20, 10, kernel_size=5, padding=2, stride=3): 입력 채널이 20이고 출력 채널이 10인 합성곱 전치 연산(nn.ConvTranspose2d)을 정의합니다. 커널 크기를 5로 설정하고, 패딩을 2로 설정하여 출력 크기를 입력과 같게 만들어주고, 스트라이드를 3으로 설정하여 스트라이드가 적용된 합성곱 전치를 정의합니다.
tconv(conv(X)).shape == X.shape: 먼저 입력 X에 합성곱 conv를 적용한 결과에 합성곱 전치 tconv를 적용한 결과의 형태가 입력 X의 형태와 같은지 비교합니다. 이 비교 결과는 True 또는 False가 됩니다. 이 비교는 합성곱과 합성곱 전치 연산을 차례로 적용했을 때 입력과 동일한 형태의 결과가 나오는지 확인하는 것입니다.

이 코드는 PyTorch의 합성곱과 합성곱 전치 연산을 적용하고 그 결과의 형태를 확인하는 과정을 나타내며, 해당 비교를 통해 합성곱과 합성곱 전치 연산의 효과를 확인할 수 있습니다.

True

14.10.3. Connection to Matrix Transposition

The transposed convolution is named after the matrix transposition. To explain, let’s first see how to implement convolutions using matrix multiplications. In the example below, we define a 3×3 input X and a 2×2 convolution kernel K, and then use the corr2d function to compute the convolution output Y.

전치 컨볼루션은 행렬 전치의 이름을 따서 명명되었습니다. 설명을 위해 먼저 행렬 곱셈을 사용하여 컨볼루션을 구현하는 방법을 살펴보겠습니다. 아래 예에서는 3×3 입력 X와 2×2 컨볼루션 커널 K를 정의한 다음 corr2d 함수를 사용하여 컨볼루션 출력 Y를 계산합니다.

X = torch.arange(9.0).reshape(3, 3)
K = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
Y = d2l.corr2d(X, K)
Y

이 코드는 주어진 입력 행렬 X와 커널 K에 대해 2D 상관(Correlation) 연산을 수행하는 과정을 나타냅니다.

X = torch.arange(9.0).reshape(3, 3): 0부터 8까지의 값을 가지는 1차원 텐서를 3x3 크기의 2차원 행렬 X로 재구성합니다. 이는 입력 행렬입니다.
K = torch.tensor([[1.0, 2.0], [3.0, 4.0]]): 주어진 값을 가지는 2x2 크기의 커널 행렬 K를 생성합니다. 이는 2D 상관 연산에 사용될 커널입니다.
Y = d2l.corr2d(X, K): 2D 상관 연산 함수 d2l.corr2d를 사용하여 입력 행렬 X와 커널 K에 대해 상관 연산을 수행한 결과인 출력 행렬 Y를 계산합니다.

이 코드는 주어진 입력 행렬과 커널에 대해 2D 상관 연산을 수행하고 그 결과인 출력 행렬 Y를 반환하는 과정을 나타냅니다.

tensor([[27., 37.],
        [57., 67.]])

Next, we rewrite the convolution kernel K as a sparse weight matrix W containing a lot of zeros. The shape of the weight matrix is (4, 9), where the non-zero elements come from the convolution kernel K.

다음으로 컨볼루션 커널 K를 많은 0을 포함하는 희소 가중치 행렬 W로 다시 작성합니다. 가중치 행렬의 모양은 (4, 9)이며, 여기서 0이 아닌 요소는 컨볼루션 커널 K에서 나옵니다.

def kernel2matrix(K):
    k, W = torch.zeros(5), torch.zeros((4, 9))
    k[:2], k[3:5] = K[0, :], K[1, :]
    W[0, :5], W[1, 1:6], W[2, 3:8], W[3, 4:] = k, k, k, k
    return W

W = kernel2matrix(K)
W

이 코드는 주어진 2x2 크기의 커널 K를 행렬 형태로 변환하는 과정을 나타냅니다.

def kernel2matrix(K): 커널 K를 입력으로 받아서 행렬 형태로 변환하는 함수 kernel2matrix를 정의합니다.
k, W = torch.zeros(5), torch.zeros((4, 9)): 크기가 5인 1차원 텐서 k와 크기가 (4, 9)인 2차원 텐서 W를 생성합니다. k는 행렬 W를 구성하기 위한 임시 벡터입니다.
k[:2], k[3:5] = K[0, :], K[1, :]: K의 첫 번째 행과 두 번째 행의 값을 k의 첫 번째 원소와 세 번째, 네 번째 원소에 대입합니다. 이를 통해 k 벡터에 커널 K의 값들이 할당됩니다.
W[0, :5], W[1, 1:6], W[2, 3:8], W[3, 4:] = k, k, k, k: 행렬 W의 각 행에 k의 값들을 할당하여 행렬 형태로 변환합니다. 각 행은 서로 다른 위치에서 시작하는 값을 가지도록 할당됩니다.
return W: 변환된 행렬 W를 반환합니다.
W = kernel2matrix(K): 주어진 커널 K에 대해 kernel2matrix 함수를 호출하여 커널을 행렬 형태로 변환한 결과인 행렬 W를 얻습니다.

이 코드는 주어진 2x2 크기의 커널을 행렬 형태로 변환하여 행렬 W로 반환하는 과정을 나타냅니다.

tensor([[1., 2., 0., 3., 4., 0., 0., 0., 0.],
        [0., 1., 2., 0., 3., 4., 0., 0., 0.],
        [0., 0., 0., 1., 2., 0., 3., 4., 0.],
        [0., 0., 0., 0., 1., 2., 0., 3., 4.]])

Concatenate the input X row by row to get a vector of length 9. Then the matrix multiplication of W and the vectorized X gives a vector of length 4. After reshaping it, we can obtain the same result Y from the original convolution operation above: we just implemented convolutions using matrix multiplications.

입력 X를 행별로 연결하여 길이가 9인 벡터를 얻습니다. 그런 다음 W와 벡터화된 X의 행렬 곱셈은 길이가 4인 벡터를 제공합니다. 이를 재구성한 후 위의 원래 컨볼루션 연산에서 동일한 결과 Y를 얻을 수 있습니다. 방금 행렬 곱셈을 사용하여 컨볼루션을 구현했습니다.

Y == torch.matmul(W, X.reshape(-1)).reshape(2, 2)

이 코드는 주어진 입력 행렬 X와 변환된 행렬 W를 사용하여 행렬-벡터 곱셈을 수행한 결과인 출력 행렬 Y가 정확히 일치하는지를 확인하는 조건을 나타냅니다.

Y: 위 코드에서 계산한 행렬 Y입니다.
torch.matmul(W, X.reshape(-1)): 행렬 W와 입력 행렬 X를 벡터로 펼친 뒤, 행렬-벡터 곱셈을 수행한 결과입니다.
.reshape(2, 2): 이전 계산 결과를 2x2 크기의 행렬로 재구성합니다.
==: 두 행렬이 원소별로 일치하는지를 비교하는 연산자입니다.

따라서, 위 코드는 변환된 행렬 W와 입력 행렬 X를 사용하여 계산한 출력 행렬 Y가 정확히 일치하는지 여부를 확인하고 그 결과를 반환하는 조건을 나타냅니다.

tensor([[True, True],
        [True, True]])

Likewise, we can implement transposed convolutions using matrix multiplications. In the following example, we take the 2×2 output Y from the above regular convolution as input to the transposed convolution. To implement this operation by multiplying matrices, we only need to transpose the weight matrix W with the new shape (9,4).

마찬가지로 행렬 곱셈을 사용하여 전치 컨볼루션을 구현할 수 있습니다. 다음 예제에서는 위의 일반 컨볼루션에서 2×2 출력 Y를 전치 컨볼루션의 입력으로 사용합니다. 행렬을 곱하여 이 연산을 구현하려면 가중치 행렬 W를 새 모양(9,4)으로 바꾸면 됩니다.

Z = trans_conv(Y, K)
Z == torch.matmul(W.T, Y.reshape(-1)).reshape(3, 3)

이 코드는 주어진 출력 행렬 Y와 변환된 행렬 W를 사용하여 역변환 행렬-벡터 곱셈을 수행한 결과인 행렬 Z가 정확히 일치하는지를 확인하는 조건을 나타냅니다.

Z: trans_conv 함수를 사용하여 계산한 행렬 Z입니다.
trans_conv(Y, K): 출력 행렬 Y와 주어진 커널 K를 사용하여 역변환 행렬-벡터 곱셈을 수행한 결과입니다.
torch.matmul(W.T, Y.reshape(-1)): 행렬 W의 전치 행렬과 Y를 벡터로 펼친 뒤, 역변환 행렬-벡터 곱셈을 수행한 결과입니다.
.reshape(3, 3): 이전 계산 결과를 3x3 크기의 행렬로 재구성합니다.
==: 두 행렬이 원소별로 일치하는지를 비교하는 연산자입니다.

따라서, 위 코드는 주어진 출력 행렬 Y와 변환된 행렬 W를 사용하여 역변환 행렬-벡터 곱셈을 수행한 결과인 행렬 Z가 정확히 일치하는지 여부를 확인하고 그 결과를 반환하는 조건을 나타냅니다.

tensor([[True, True, True],
        [True, True, True],
        [True, True, True]])

Consider implementing the convolution by multiplying matrices. Given an input vector x and a weight matrix W, the forward propagation function of the convolution can be implemented by multiplying its input with the weight matrix and outputting a vector y=Wx. Since backpropagation follows the chain rule and ∇xy=W⊤, the backpropagation function of the convolution can be implemented by multiplying its input with the transposed weight matrix W⊤. Therefore, the transposed convolutional layer can just exchange the forward propagation function and the backpropagation function of the convolutional layer: its forward propagation and backpropagation functions multiply their input vector with W⊤ and W, respectively.

행렬을 곱하여 컨벌루션을 구현하는 것을 고려하십시오. 입력 벡터 x와 가중치 행렬 W가 주어지면 컨볼루션의 순방향 전파 함수는 입력에 가중치 행렬을 곱하고 벡터 y=Wx를 출력하여 구현할 수 있습니다. 역전파는 체인 규칙과 ∇xy=W⊤를 따르기 때문에 컨볼루션의 역전파 함수는 전치된 가중치 행렬 W⊤를 입력에 곱하여 구현할 수 있습니다. 따라서 전치된 컨볼루션 레이어는 컨볼루션 레이어의 순방향 전파 함수와 역전파 함수를 교환할 수 있습니다. 순방향 전파 및 역전파 함수는 각각 입력 벡터에 W⊤ 및 W를 곱합니다.

14.10.4. Summary

In contrast to the regular convolution that reduces input elements via the kernel, the transposed convolution broadcasts input elements via the kernel, thereby producing an output that is larger than the input.

커널을 통해 입력 요소를 줄이는 일반 컨볼루션과 달리 전치 컨볼루션은 커널을 통해 입력 요소를 브로드캐스팅하여 입력보다 큰 출력을 생성합니다.
If we feed X into a convolutional layer f to output Y=f(X) and create a transposed convolutional layer g with the same hyperparameters as f except for the number of output channels being the number of channels in X, then g(Y) will have the same shape as X.

X를 컨볼루션 레이어 f에 공급하여 Y=f(X)를 출력하고 출력 채널 수가 X의 채널 수인 것을 제외하고 f와 동일한 하이퍼파라미터로 전치된 컨볼루션 레이어 g를 생성하면 g(Y) X와 같은 모양을 갖게 됩니다.
We can implement convolutions using matrix multiplications. The transposed convolutional layer can just exchange the forward propagation function and the backpropagation function of the convolutional layer.

행렬 곱셈을 사용하여 컨볼루션을 구현할 수 있습니다. Transposed convolutional layer는 convolutional layer의 forward propagation function과 backpropagation function을 교환할 수 있습니다.

14.10.5. Exercises

In Section 14.10.3, the convolution input X and the transposed convolution output Z have the same shape. Do they have the same value? Why?
Is it efficient to use matrix multiplications to implement convolutions? Why?

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.12. Neural Style Transfer (1)	2023.08.21
D2L - 14.11. Fully Convolutional Networks (0)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19

Dive into Deep Learning/D2L Computer Vision

D2L - 14.9. Semantic Segmentation and the Dataset

2023. 8. 20. 22:40 | Posted by 솔웅

https://d2l.ai/chapter_computer-vision/semantic-segmentation-and-dataset.html

14.9. Semantic Segmentation and the Dataset — Dive into Deep Learning 1.0.3 documentation

d2l.ai

14.9. Semantic Segmentation and the Dataset

When discussing object detection tasks in Section 14.3–Section 14.8, rectangular bounding boxes are used to label and predict objects in images. This section will discuss the problem of semantic segmentation, which focuses on how to divide an image into regions belonging to different semantic classes. Different from object detection, semantic segmentation recognizes and understands what are in images in pixel level: its labeling and prediction of semantic regions are in pixel level. Fig. 14.9.1 shows the labels of the dog, cat, and background of the image in semantic segmentation. Compared with in object detection, the pixel-level borders labeled in semantic segmentation are obviously more fine-grained.

섹션 14.3–섹션 14.8에서 개체 감지 작업을 논의할 때 사각형 경계 상자는 이미지의 개체에 레이블을 지정하고 예측하는 데 사용됩니다. 이 섹션에서는 이미지를 다른 의미semantic 클래스에 속하는 영역으로 나누는 방법에 중점을 둔 의미 분할 semantic segmentation 문제에 대해 설명합니다. 객체 감지와 달리 semantic segmentation은 픽셀 수준에서 이미지에 있는 것을 인식하고 이해합니다. 의미 영역 semantic regions 의 라벨링 및 예측은 픽셀 수준에서 이루어집니다. 그림 14.9.1은 시맨틱 세그멘테이션에서 이미지의 개, 고양이, 배경의 라벨을 보여준다. 개체 감지와 비교할 때 시맨틱 분할semantic segmentation에서 레이블이 지정된 픽셀 수준 경계는 분명히 더 세분화됩니다.

Fig. 14.9.1  Labels of the dog, cat, and background of the image in semantic segmentation.

14.9.1. Image Segmentation and Instance Segmentation

There are also two important tasks in the field of computer vision that are similar to semantic segmentation, namely image segmentation and instance segmentation. We will briefly distinguish them from semantic segmentation as follows.

컴퓨터 비전 분야에는 시맨틱 분할semantic segmentation과 유사한 두 가지 중요한 작업, 즉 이미지 분할image segmentation과 인스턴스 분할instance segmentation이 있습니다. 다음과 같이 시맨틱 분할과 간략하게 구분합니다.

Image segmentation divides an image into several constituent regions. The methods for this type of problem usually make use of the correlation between pixels in the image. It does not need label information about image pixels during training, and it cannot guarantee that the segmented regions will have the semantics that we hope to obtain during prediction. Taking the image in Fig. 14.9.1 as input, image segmentation may divide the dog into two regions: one covers the mouth and eyes which are mainly black, and the other covers the rest of the body which is mainly yellow.

이미지 분할은 이미지를 여러 구성 영역으로 나눕니다. 이러한 유형의 문제에 대한 방법은 일반적으로 이미지의 픽셀 간의 상관 관계를 사용합니다. 훈련하는 동안 이미지 픽셀에 대한 레이블 정보가 필요하지 않으며 분할된 영역이 예측 중에 얻고자 하는 의미 체계를 가질 것이라고 보장할 수 없습니다. 그림 14.9.1의 이미지를 입력으로 사용하면 이미지 분할은 개를 두 영역으로 나눌 수 있습니다. 하나는 주로 검은색인 입과 눈을 덮고 다른 하나는 주로 노란색인 몸의 나머지 부분을 덮습니다.
Instance segmentation is also called simultaneous detection and segmentation. It studies how to recognize the pixel-level regions of each object instance in an image. Different from semantic segmentation, instance segmentation needs to distinguish not only semantics, but also different object instances. For example, if there are two dogs in the image, instance segmentation needs to distinguish which of the two dogs a pixel belongs to.

인스턴스 분할은 동시 감지 및 분할이라고도 합니다. 이미지에서 각 개체 인스턴스의 픽셀 수준 영역을 인식하는 방법을 연구합니다. 시맨틱 분할과 달리 인스턴스 분할은 시맨틱뿐만 아니라 다른 객체 인스턴스도 구분해야 합니다. 예를 들어, 이미지에 두 마리의 개가 있는 경우 인스턴스 분할은 픽셀이 속한 두 개 중 어느 것을 구별해야 합니다.

14.9.2. The Pascal VOC2012 Semantic Segmentation Dataset

One of the most important semantic segmentation dataset is Pascal VOC2012. In the following, we will take a look at this dataset.

가장 중요한 시맨틱 분할 데이터 세트는 Pascal VOC2012입니다. 다음에서는 이 데이터 세트를 살펴보겠습니다.

%matplotlib inline
import os
import torch
import torchvision
from d2l import torch as d2l

The tar file of the dataset is about 2 GB, so it may take a while to download the file. The extracted dataset is located at ../data/VOCdevkit/VOC2012.

데이터셋의 tar 파일은 약 2GB이므로 파일을 다운로드하는데 다소 시간이 걸릴 수 있습니다. 추출된 데이터 세트는 ../data/VOCdevkit/VOC2012에 있습니다.

#@save
d2l.DATA_HUB['voc2012'] = (d2l.DATA_URL + 'VOCtrainval_11-May-2012.tar',
                           '4e443f8a2eca6b1dac8a6c57641b67dd40621a49')

voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
d2l.DATA_HUB['voc2012']: d2l.DATA_HUB 딕셔너리에 'voc2012'라는 키를 가진 항목을 추가합니다. 이 항목은 데이터 다운로드를 위한 URL 및 체크섬 정보를 포함하는 튜플입니다.
d2l.DATA_URL + 'VOCtrainval_11-May-2012.tar': d2l.DATA_URL과 문자열을 합쳐서 VOC2012 데이터의 URL을 생성합니다.
'4e443f8a2eca6b1dac8a6c57641b67dd40621a49': 다운로드한 파일의 체크섬(해시 값)입니다.
voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012'): 'voc2012' 데이터를 다운로드하고 압축을 해제하여 voc_dir 변수에 저장합니다. download_extract 함수는 주어진 키에 해당하는 데이터를 다운로드하고 압축을 해제합니다. 여기서 'voc2012' 데이터는 VOC2012 데이터셋을 의미하며, VOCdevkit/VOC2012 경로에 압축을 해제합니다.

이 코드는 'voc2012' 데이터를 다운로드하고 압축을 해제하여 데이터가 저장될 디렉토리 경로인 voc_dir을 설정하는 내용을 나타냅니다. VOC2012 데이터셋은 객체 검출과 이미지 분류 등 다양한 컴퓨터 비전 작업에 사용되는 데이터셋입니다.

Downloading ../data/VOCtrainval_11-May-2012.tar from http://d2l-data.s3-accelerate.amazonaws.com/VOCtrainval_11-May-2012.tar...

After entering the path ../data/VOCdevkit/VOC2012, we can see the different components of the dataset. The ImageSets/Segmentation path contains text files that specify training and test samples, while the JPEGImages and SegmentationClass paths store the input image and label for each example, respectively. The label here is also in the image format, with the same size as its labeled input image. Besides, pixels with the same color in any label image belong to the same semantic class. The following defines the read_voc_images function to read all the input images and labels into the memory.

../data/VOCdevkit/VOC2012 경로를 입력하면 데이터 세트의 다양한 구성 요소를 볼 수 있습니다. ImageSets/Segmentation 경로에는 교육 및 테스트 샘플을 지정하는 텍스트 파일이 포함되어 있으며 JPEGImages 및 SegmentationClass 경로는 각각 각 예제에 대한 입력 이미지와 레이블을 저장합니다. 여기의 레이블도 레이블이 지정된 입력 이미지와 크기가 같은 이미지 형식입니다. 게다가 어떤 레이블 이미지에서 같은 색상을 가진 픽셀은 같은 의미 클래스에 속합니다. 다음은 모든 입력 이미지와 레이블을 메모리로 읽어들이는 read_voc_images 함수를 정의합니다.

#@save
def read_voc_images(voc_dir, is_train=True):
    """Read all VOC feature and label images."""
    txt_fname = os.path.join(voc_dir, 'ImageSets', 'Segmentation',
                             'train.txt' if is_train else 'val.txt')
    mode = torchvision.io.image.ImageReadMode.RGB
    with open(txt_fname, 'r') as f:
        images = f.read().split()
    features, labels = [], []
    for i, fname in enumerate(images):
        features.append(torchvision.io.read_image(os.path.join(
            voc_dir, 'JPEGImages', f'{fname}.jpg')))
        labels.append(torchvision.io.read_image(os.path.join(
            voc_dir, 'SegmentationClass' ,f'{fname}.png'), mode))
    return features, labels

train_features, train_labels = read_voc_images(voc_dir, True)

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
def read_voc_images(voc_dir, is_train=True):: read_voc_images라는 함수를 정의합니다. 이 함수는 VOC 데이터셋의 특징 이미지와 라벨 이미지를 읽어오는 역할을 합니다.
- voc_dir: VOC 데이터셋이 저장된 디렉토리 경로입니다.
- is_train=True: 훈련 데이터인지 여부를 나타내는 변수입니다.
- txt_fname = os.path.join(voc_dir, 'ImageSets', 'Segmentation', 'train.txt' if is_train else 'val.txt'): 훈련 데이터 또는 검증 데이터 이미지 파일 이름이 나열된 텍스트 파일의 경로를 생성합니다.
- mode = torchvision.io.image.ImageReadMode.RGB: 읽어온 이미지를 RGB 형식으로 읽어오도록 설정합니다.
- with open(txt_fname, 'r') as f:: 텍스트 파일을 열고 f라는 이름으로 사용합니다.
  - images = f.read().split(): 파일에서 읽은 이미지 파일 이름들을 공백으로 분리하여 리스트로 저장합니다.
- features, labels = [], []: 특징 이미지와 라벨 이미지를 저장할 빈 리스트 features와 labels를 생성합니다.
- for i, fname in enumerate(images):: 이미지 파일 이름들을 반복하면서 처리합니다.
  - features.append(torchvision.io.read_image(os.path.join(voc_dir, 'JPEGImages', f'{fname}.jpg'))): 이미지 파일을 읽어와 특징 이미지 리스트에 추가합니다.
  - labels.append(torchvision.io.read_image(os.path.join(voc_dir, 'SegmentationClass' ,f'{fname}.png'), mode)): 라벨 이미지 파일을 읽어와 라벨 이미지 리스트에 추가합니다.
- return features, labels: 읽어온 특징 이미지와 라벨 이미지 리스트를 반환합니다.
train_features, train_labels = read_voc_images(voc_dir, True): VOC 데이터셋에서 훈련 데이터의 특징 이미지와 라벨 이미지를 읽어와 각각 train_features와 train_labels 변수에 할당합니다.

이 코드는 VOC 데이터셋에서 훈련 데이터의 특징 이미지와 라벨 이미지를 읽어오는 함수 read_voc_images를 정의하고 이를 호출하여 train_features와 train_labels를 생성하는 내용을 나타냅니다.

We draw the first five input images and their labels. In the label images, white and black represent borders and background, respectively, while the other colors correspond to different classes.

처음 5개의 입력 이미지와 레이블을 그립니다. 레이블 이미지에서 흰색과 검은색은 각각 테두리와 배경을 나타내고 나머지 색상은 다른 클래스에 해당합니다.

n = 5
imgs = train_features[:n] + train_labels[:n]
imgs = [img.permute(1,2,0) for img in imgs]
d2l.show_images(imgs, 2, n);

n = 5: 변수 n에 5를 할당합니다. 이는 표시할 이미지의 개수를 의미합니다.
imgs = train_features[:n] + train_labels[:n]: train_features 리스트에서 처음부터 n개의 이미지와 train_labels 리스트에서 처음부터 n개의 이미지를 선택하여 리스트 imgs에 결합합니다. 이렇게 하면 특징 이미지와 해당하는 라벨 이미지가 번갈아가며 포함된 리스트가 생성됩니다.
imgs = [img.permute(1,2,0) for img in imgs]: imgs 리스트 내의 이미지들의 차원을 변경합니다. 각 이미지의 차원을 (채널, 높이, 너비)에서 (높이, 너비, 채널)로 변경하면 이미지가 보다 일반적인 형태로 표현됩니다.
d2l.show_images(imgs, 2, n);: imgs 리스트에 포함된 이미지들을 2개의 열로 나누어서 n개씩 그립니다. d2l.show_images 함수는 이미지를 그리는 함수로서 여러 이미지를 그룹으로 나누어서 표시합니다.

이 코드는 훈련 데이터셋에서 특징 이미지와 라벨 이미지를 일정 개수만큼 선택하고, 차원을 변경한 후에 이를 그룹으로 나누어서 표시하는 내용을 나타냅니다.

Next, we enumerate the RGB color values and class names for all the labels in this dataset.

다음으로 이 데이터 세트의 모든 레이블에 대한 RGB 색상 값과 클래스 이름을 열거합니다.

#@save
VOC_COLORMAP = [[0, 0, 0], [128, 0, 0], [0, 128, 0], [128, 128, 0],
                [0, 0, 128], [128, 0, 128], [0, 128, 128], [128, 128, 128],
                [64, 0, 0], [192, 0, 0], [64, 128, 0], [192, 128, 0],
                [64, 0, 128], [192, 0, 128], [64, 128, 128], [192, 128, 128],
                [0, 64, 0], [128, 64, 0], [0, 192, 0], [128, 192, 0],
                [0, 64, 128]]

#@save
VOC_CLASSES = ['background', 'aeroplane', 'bicycle', 'bird', 'boat',
               'bottle', 'bus', 'car', 'cat', 'chair', 'cow',
               'diningtable', 'dog', 'horse', 'motorbike', 'person',
               'potted plant', 'sheep', 'sofa', 'train', 'tv/monitor']

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
VOC_COLORMAP: VOC 데이터셋에서 클래스 인덱스에 따라 색상을 부여하는 컬러맵을 정의한 변수입니다. 각 클래스에 대해 RGB 값으로 색상을 지정하여 클래스를 시각화할 때 사용됩니다.
VOC_CLASSES: VOC 데이터셋에서 정의된 클래스 이름들을 나열한 변수입니다. 이 변수에는 배경과 다양한 객체 클래스들의 이름이 포함되어 있습니다.

이 코드는 VOC 데이터셋에서 클래스들의 컬러맵과 클래스 이름들을 정의하는 내용을 나타냅니다. VOC 데이터셋은 객체 검출, 분할 등의 작업에 사용되는 데이터셋으로, 이러한 정보들은 시각화 및 레이블 관련 작업에서 사용됩니다.

With the two constants defined above, we can conveniently find the class index for each pixel in a label. We define the voc_colormap2label function to build the mapping from the above RGB color values to class indices, and the voc_label_indices function to map any RGB values to their class indices in this Pascal VOC2012 dataset.

위에서 정의한 두 상수를 사용하면 레이블의 각 픽셀에 대한 클래스 인덱스를 편리하게 찾을 수 있습니다. 우리는 voc_colormap2label 함수를 정의하여 위의 RGB 색상 값에서 클래스 인덱스로의 매핑을 구축하고 voc_label_indices 함수를 정의하여 RGB 값을 이 Pascal VOC2012 데이터 세트의 클래스 인덱스로 매핑합니다.

#@save
def voc_colormap2label():
    """Build the mapping from RGB to class indices for VOC labels."""
    colormap2label = torch.zeros(256 ** 3, dtype=torch.long)
    for i, colormap in enumerate(VOC_COLORMAP):
        colormap2label[
            (colormap[0] * 256 + colormap[1]) * 256 + colormap[2]] = i
    return colormap2label

#@save
def voc_label_indices(colormap, colormap2label):
    """Map any RGB values in VOC labels to their class indices."""
    colormap = colormap.permute(1, 2, 0).numpy().astype('int32')
    idx = ((colormap[:, :, 0] * 256 + colormap[:, :, 1]) * 256
           + colormap[:, :, 2])
    return colormap2label[idx]

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
voc_colormap2label(): VOC 데이터셋의 클래스 컬러맵을 이용하여 RGB 값에서 클래스 인덱스로 매핑하는 함수입니다. VOC 데이터셋의 클래스 컬러맵을 이용하여 RGB 값의 조합을 클래스 인덱스로 변환하는 매핑을 생성하고 반환합니다.
- colormap2label = torch.zeros(256 ** 3, dtype=torch.long): 256^3 크기의 모든 조합에 대한 클래스 인덱스를 담는 텐서를 생성합니다.
- for i, colormap in enumerate(VOC_COLORMAP):: VOC 컬러맵의 각 색상에 대해 반복합니다.
  - (colormap[0] * 256 + colormap[1]) * 256 + colormap[2]: RGB 값을 클래스 인덱스로 변환하여 텐서의 해당 위치에 클래스 인덱스를 저장합니다.
voc_label_indices(colormap, colormap2label): 주어진 VOC 라벨 컬러맵을 이용하여 RGB 값을 클래스 인덱스로 변환하는 함수입니다.
- colormap.permute(1, 2, 0).numpy().astype('int32'): 라벨 컬러맵을 높이-너비-채널 형태에서 너비-채널-높이 형태로 변경하고, 넘파이 배열로 변환하며 데이터 타입을 'int32'로 변환합니다.
- ((colormap[:, :, 0] * 256 + colormap[:, :, 1]) * 256 + colormap[:, :, 2]): RGB 값을 클래스 인덱스로 변환하는 계산을 수행하여 클래스 인덱스 텐서 idx를 생성합니다.
- return colormap2label[idx]: 생성한 클래스 인덱스 텐서 idx를 이용하여 RGB 값의 클래스 인덱스로 변환한 결과를 반환합니다.

이 코드는 VOC 데이터셋의 라벨 컬러맵과 RGB 값들을 클래스 인덱스로 변환하기 위한 함수들을 정의하는 내용을 나타냅니다. 이러한 함수들은 데이터셋 시각화 및 레이블 처리 작업에서 사용됩니다.

For example, in the first example image, the class index for the front part of the airplane is 1, while the background index is 0.

예를 들어, 첫 번째 예제 이미지에서 비행기 앞부분의 클래스 인덱스는 1이고 배경 인덱스는 0입니다.

y = voc_label_indices(train_labels[0], voc_colormap2label())
y[105:115, 130:140], VOC_CLASSES[1]

y = voc_label_indices(train_labels[0], voc_colormap2label()): train_labels 리스트에서 첫 번째 라벨 이미지를 이용하여 voc_colormap2label() 함수를 통해 라벨 컬러맵을 클래스 인덱스로 변환한 결과를 변수 y에 할당합니다.
y[105:115, 130:140], VOC_CLASSES[1]: y 변수의 특정 범위를 선택하여 그 값을 출력하고, VOC_CLASSES 리스트의 두 번째 클래스 이름을 출력합니다.

이 코드는 VOC 데이터셋의 첫 번째 라벨 이미지에 대해 라벨 컬러맵을 클래스 인덱스로 변환한 결과를 변수 y에 저장하고, 이 변수에서 특정 영역의 값을 선택하여 출력하는 내용을 나타냅니다. 또한, VOC_CLASSES 리스트에서 두 번째 클래스의 이름을 출력합니다. 이를 통해 라벨 이미지의 레이블 정보를 확인할 수 있습니다.

(tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
         [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
         [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
         [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]]),
 'aeroplane')

14.9.2.1. Data Preprocessing

In previous experiments such as in Section 8.1–Section 8.4, images are rescaled to fit the model’s required input shape. However, in semantic segmentation, doing so requires rescaling the predicted pixel classes back to the original shape of the input image. Such rescaling may be inaccurate, especially for segmented regions with different classes. To avoid this issue, we crop the image to a fixed shape instead of rescaling. Specifically, using random cropping from image augmentation, we crop the same area of the input image and the label.

섹션 8.1–섹션 8.4와 같은 이전 실험에서 이미지는 모델의 필수 입력 모양에 맞게 크기가 조정됩니다. 그러나 의미론적 분할에서 그렇게 하려면 예측된 픽셀 클래스를 입력 이미지의 원래 모양으로 다시 스케일링해야 합니다. 이러한 크기 조정은 특히 클래스가 다른 분할 영역의 경우 부정확할 수 있습니다. 이 문제를 방지하기 위해 크기를 다시 조정하는 대신 이미지를 고정된 모양으로 자릅니다. 특히, 이미지 확대에서 임의 자르기를 사용하여 입력 이미지와 레이블의 동일한 영역을 자릅니다.

#@save
def voc_rand_crop(feature, label, height, width):
    """Randomly crop both feature and label images."""
    rect = torchvision.transforms.RandomCrop.get_params(
        feature, (height, width))
    feature = torchvision.transforms.functional.crop(feature, *rect)
    label = torchvision.transforms.functional.crop(label, *rect)
    return feature, label

imgs = []
for _ in range(n):
    imgs += voc_rand_crop(train_features[0], train_labels[0], 200, 300)

imgs = [img.permute(1, 2, 0) for img in imgs]
d2l.show_images(imgs[::2] + imgs[1::2], 2, n);

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
voc_rand_crop(feature, label, height, width): 특징 이미지와 라벨 이미지를 주어진 높이와 너비에 맞춰 무작위로 잘라내는 함수입니다.
- rect = torchvision.transforms.RandomCrop.get_params(feature, (height, width)): 무작위로 자르기 위한 좌표 정보를 얻습니다. 이 때, get_params 함수는 무작위 자르기를 위해 필요한 좌표 정보를 반환합니다.
- feature = torchvision.transforms.functional.crop(feature, *rect): 주어진 좌표 정보를 이용하여 특징 이미지를 무작위로 잘라냅니다.
- label = torchvision.transforms.functional.crop(label, *rect): 주어진 좌표 정보를 이용하여 라벨 이미지도 무작위로 잘라냅니다.
imgs = []: 빈 리스트 imgs를 생성합니다.
for _ in range(n):: n번 반복하는 루프를 실행합니다.
- imgs += voc_rand_crop(train_features[0], train_labels[0], 200, 300): voc_rand_crop 함수를 사용하여 특징 이미지와 라벨 이미지를 200x300 크기로 무작위로 자르고, imgs 리스트에 추가합니다.
imgs = [img.permute(1, 2, 0) for img in imgs]: imgs 리스트 내의 이미지들의 차원을 변경합니다. 각 이미지의 차원을 (채널, 높이, 너비)에서 (높이, 너비, 채널)로 변경하면 이미지가 보다 일반적인 형태로 표현됩니다.
d2l.show_images(imgs[::2] + imgs[1::2], 2, n);: imgs 리스트에 포함된 이미지들을 두 개의 열로 나누어서 그룹으로 나눈 뒤에 표시합니다. 이 때, 첫 번째 그룹은 imgs의 짝수 인덱스 이미지들로, 두 번째 그룹은 홀수 인덱스 이미지들로 구성됩니다.

이 코드는 VOC 데이터셋의 첫 번째 특징 이미지와 라벨 이미지를 무작위로 자르고, 이를 시각화하여 표시하는 내용을 나타냅니다. 이를 통해 데이터 증강과 이미지 처리 과정을 이해할 수 있습니다.

14.9.2.2. Custom Semantic Segmentation Dataset Class

We define a custom semantic segmentation dataset class VOCSegDataset by inheriting the Dataset class provided by high-level APIs. By implementing the __getitem__ function, we can arbitrarily access the input image indexed as idx in the dataset and the class index of each pixel in this image. Since some images in the dataset have a smaller size than the output size of random cropping, these examples are filtered out by a custom filter function. In addition, we also define the normalize_image function to standardize the values of the three RGB channels of input images.

상위 수준 API에서 제공하는 Dataset 클래스를 상속하여 사용자 지정 시맨틱 세분화 데이터 세트 클래스 VOCSegDataset을 정의합니다. __getitem__ 함수를 구현하면 데이터셋에서 idx로 인덱싱된 입력 이미지와 이 이미지의 각 픽셀의 클래스 인덱스에 임의로 액세스할 수 있습니다. 데이터 세트의 일부 이미지는 임의 자르기의 출력 크기보다 크기가 작기 때문에 이러한 예는 사용자 정의 필터 기능으로 필터링됩니다. 또한 normalize_image 함수를 정의하여 입력 이미지의 세 RGB 채널 값을 표준화합니다.

#@save
class VOCSegDataset(torch.utils.data.Dataset):
    """A customized dataset to load the VOC dataset."""

    def __init__(self, is_train, crop_size, voc_dir):
        self.transform = torchvision.transforms.Normalize(
            mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        self.crop_size = crop_size
        features, labels = read_voc_images(voc_dir, is_train=is_train)
        self.features = [self.normalize_image(feature)
                         for feature in self.filter(features)]
        self.labels = self.filter(labels)
        self.colormap2label = voc_colormap2label()
        print('read ' + str(len(self.features)) + ' examples')

    def normalize_image(self, img):
        return self.transform(img.float() / 255)

    def filter(self, imgs):
        return [img for img in imgs if (
            img.shape[1] >= self.crop_size[0] and
            img.shape[2] >= self.crop_size[1])]

    def __getitem__(self, idx):
        feature, label = voc_rand_crop(self.features[idx], self.labels[idx],
                                       *self.crop_size)
        return (feature, voc_label_indices(label, self.colormap2label))

    def __len__(self):
        return len(self.features)

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
class VOCSegDataset(torch.utils.data.Dataset):: torch.utils.data.Dataset 클래스를 상속하여 커스텀 데이터셋 클래스 VOCSegDataset을 정의합니다. 이 클래스는 VOC 데이터셋을 불러오고 처리하기 위한 기능을 제공합니다.
- self.transform: VOC 데이터셋의 이미지를 정규화하는 변환을 수행하는 객체를 생성합니다.
- self.crop_size: 임의로 자를 크기를 저장하는 변수입니다.
- features, labels = read_voc_images(voc_dir, is_train=is_train): read_voc_images 함수를 이용하여 특징 이미지와 라벨 이미지를 불러옵니다.
- self.features = [self.normalize_image(feature) for feature in self.filter(features)]: 특징 이미지들을 필터링하고 정규화합니다.
- self.labels = self.filter(labels): 라벨 이미지들도 필터링합니다.
- self.colormap2label = voc_colormap2label(): 라벨 컬러맵을 클래스 인덱스로 매핑하는 매핑을 생성합니다.
- def normalize_image(self, img):: 이미지를 정규화하는 함수입니다.
- def filter(self, imgs):: 이미지들을 필터링하는 함수입니다.
- def __getitem__(self, idx):: 데이터셋에서 특정 인덱스의 아이템을 가져오는 메서드입니다.
- def __len__(self):: 데이터셋의 길이를 반환하는 메서드입니다.

이 코드는 VOC 데이터셋을 커스텀 데이터셋 클래스로 정의하는 내용을 나타냅니다. 이 클래스는 데이터셋을 불러오고 처리하는 기능을 제공하여 모델 학습 등에 활용될 수 있습니다.

14.9.2.3. Reading the Dataset

We use the custom VOCSegDataset class to create instances of the training set and test set, respectively. Suppose that we specify that the output shape of randomly cropped images is 320×480. Below we can view the number of examples that are retained in the training set and test set.

맞춤형 VOCSegDataset 클래스를 사용하여 훈련 세트와 테스트 세트의 인스턴스를 각각 생성합니다. 임의로 자른 이미지의 출력 모양이 320×480이라고 지정한다고 가정합니다. 아래에서 훈련 세트와 테스트 세트에 보관된 예의 수를 볼 수 있습니다.

crop_size = (320, 480)
voc_train = VOCSegDataset(True, crop_size, voc_dir)
voc_test = VOCSegDataset(False, crop_size, voc_dir)

crop_size = (320, 480): 임의로 자를 크기를 (320, 480)로 지정합니다.
voc_train = VOCSegDataset(True, crop_size, voc_dir): VOCSegDataset 클래스를 이용하여 VOC 데이터셋의 학습용 데이터셋 voc_train을 생성합니다. True는 학습 데이터셋을 의미하며, crop_size와 voc_dir은 해당 클래스의 생성자에 필요한 인자입니다.
voc_test = VOCSegDataset(False, crop_size, voc_dir): VOCSegDataset 클래스를 이용하여 VOC 데이터셋의 테스트용 데이터셋 voc_test를 생성합니다. False는 테스트 데이터셋을 의미하며, crop_size와 voc_dir은 해당 클래스의 생성자에 필요한 인자입니다.

이 코드는 학습용과 테스트용 VOC 데이터셋을 생성하는 내용을 나타냅니다. 이를 통해 데이터셋을 생성하여 모델 학습 및 평가에 활용할 수 있습니다.

read 1114 examples
read 1078 examples

Setting the batch size to 64, we define the data iterator for the training set. Let’s print the shape of the first minibatch. Different from in image classification or object detection, labels here are three-dimensional tensors.

배치 크기를 64로 설정하고 훈련 세트에 대한 데이터 반복자를 정의합니다. 첫 번째 미니 배치의 모양을 인쇄해 봅시다. 이미지 분류 또는 객체 감지와 달리 여기서 레이블은 3차원 텐서입니다.

batch_size = 64
train_iter = torch.utils.data.DataLoader(voc_train, batch_size, shuffle=True,
                                    drop_last=True,
                                    num_workers=d2l.get_dataloader_workers())
for X, Y in train_iter:
    print(X.shape)
    print(Y.shape)
    break

batch_size = 64: 배치 크기를 64로 지정합니다.
train_iter = torch.utils.data.DataLoader(voc_train, batch_size, shuffle=True, drop_last=True, num_workers=d2l.get_dataloader_workers()): torch.utils.data.DataLoader 클래스를 사용하여 학습용 데이터셋 voc_train을 불러오는 데이터 로더 train_iter를 생성합니다. 이 데이터 로더는 배치 크기를 batch_size로 설정하고, 데이터를 섞어서 가져오며(shuffle=True), 마지막 배치를 무시합니다(drop_last=True), 병렬로 데이터를 로딩할 때 사용할 워커의 수를 d2l.get_dataloader_workers()로 설정합니다.
for X, Y in train_iter:: train_iter에서 배치 단위로 데이터를 가져옵니다.
- print(X.shape): 현재 배치의 특징 이미지 X의 모양을 출력합니다.
- print(Y.shape): 현재 배치의 라벨 이미지 Y의 모양을 출력합니다.
break: 첫 번째 배치만 확인하고 반복을 종료합니다.

이 코드는 데이터 로더를 사용하여 학습용 데이터셋을 배치 단위로 불러와서 모양을 확인하는 내용을 나타냅니다. 이를 통해 데이터 로딩 및 배치 처리 과정을 이해할 수 있습니다.

torch.Size([64, 3, 320, 480])
torch.Size([64, 320, 480])

14.9.2.4. Putting It All Together

Finally, we define the following load_data_voc function to download and read the Pascal VOC2012 semantic segmentation dataset. It returns data iterators for both the training and test datasets.

마지막으로 Pascal VOC2012 시맨틱 분할 데이터 세트를 다운로드하고 읽기 위해 다음 load_data_voc 함수를 정의합니다. 교육 및 테스트 데이터 세트 모두에 대한 데이터 반복자를 반환합니다.

#@save
def load_data_voc(batch_size, crop_size):
    """Load the VOC semantic segmentation dataset."""
    voc_dir = d2l.download_extract('voc2012', os.path.join(
        'VOCdevkit', 'VOC2012'))
    num_workers = d2l.get_dataloader_workers()
    train_iter = torch.utils.data.DataLoader(
        VOCSegDataset(True, crop_size, voc_dir), batch_size,
        shuffle=True, drop_last=True, num_workers=num_workers)
    test_iter = torch.utils.data.DataLoader(
        VOCSegDataset(False, crop_size, voc_dir), batch_size,
        drop_last=True, num_workers=num_workers)
    return train_iter, test_iter

#@save: 주석으로, 코드 조각을 문서로 저장할 때 사용되는 주석입니다.
def load_data_voc(batch_size, crop_size):: load_data_voc 함수를 정의합니다. 이 함수는 VOC 시맨틱 세그멘테이션 데이터셋을 불러오는 역할을 합니다.
- voc_dir = d2l.download_extract('voc2012', os.path.join('VOCdevkit', 'VOC2012')): VOC 데이터셋을 다운로드하고 압축을 해제한 경로를 voc_dir에 저장합니다.
- num_workers = d2l.get_dataloader_workers(): 데이터 로더가 사용할 워커의 수를 가져옵니다.
- train_iter = torch.utils.data.DataLoader(VOCSegDataset(True, crop_size, voc_dir), batch_size, shuffle=True, drop_last=True, num_workers=num_workers): 학습 데이터셋을 위한 데이터 로더 train_iter를 생성합니다. True는 학습 데이터셋을 의미하며, VOCSegDataset 클래스를 사용하여 데이터셋을 불러옵니다.
- test_iter = torch.utils.data.DataLoader(VOCSegDataset(False, crop_size, voc_dir), batch_size, drop_last=True, num_workers=num_workers): 테스트 데이터셋을 위한 데이터 로더 test_iter를 생성합니다. False는 테스트 데이터셋을 의미하며, VOCSegDataset 클래스를 사용하여 데이터셋을 불러옵니다.
- return train_iter, test_iter: 학습 데이터셋과 테스트 데이터셋의 데이터 로더를 반환합니다.

이 코드는 VOC 시맨틱 세그멘테이션 데이터셋을 불러오기 위한 함수를 정의한 내용을 나타냅니다. 이 함수를 통해 모델 학습 및 평가에 활용할 수 있는 데이터 로더를 얻을 수 있습니다.

14.9.3. Summary

Semantic segmentation recognizes and understands what are in an image in pixel level by dividing the image into regions belonging to different semantic classes.

시맨틱 분할은 이미지를 서로 다른 시맨틱 클래스에 속하는 영역으로 나누어 이미지에 있는 것을 픽셀 수준으로 인식하고 이해합니다.
One of the most important semantic segmentation dataset is Pascal VOC2012.

가장 중요한 시맨틱 분할 데이터 세트 중 하나는 Pascal VOC2012입니다.
In semantic segmentation, since the input image and label correspond one-to-one on the pixel, the input image is randomly cropped to a fixed shape rather than rescaled.

시맨틱 분할에서 입력 이미지와 레이블은 픽셀에서 일대일로 대응하므로 입력 이미지는 크기를 다시 조정하지 않고 고정된 모양으로 무작위로 자릅니다.

14.9.4. Exercises

How can semantic segmentation be applied in autonomous vehicles and medical image diagnostics? Can you think of other applications?
Recall the descriptions of data augmentation in Section 14.1. Which of the image augmentation methods used in image classification would be infeasible to be applied in semantic segmentation?

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.12. Neural Style Transfer (1)	2023.08.21
D2L - 14.11. Fully Convolutional Networks (0)	2023.08.21
D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.8. Region-based CNNs (R-CNNs) (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19

Dive into Deep Learning/D2L Computer Vision

D2L - 14.8. Region-based CNNs (R-CNNs)

2023. 8. 20. 21:58 | Posted by 솔웅

import torch
import torchvision

X = torch.arange(16.).reshape(1, 1, 4, 4)
X

https://d2l.ai/chapter_computer-vision/rcnn.html

14.8. Region-based CNNs (R-CNNs) — Dive into Deep Learning 1.0.3 documentation

d2l.ai

14.8. Region-based CNNs (R-CNNs)

Besides single shot multibox detection described in Section 14.7, region-based CNNs or regions with CNN features (R-CNNs) are also among many pioneering approaches of applying deep learning to object detection (Girshick et al., 2014). In this section, we will introduce the R-CNN and its series of improvements: the fast R-CNN (Girshick, 2015), the faster R-CNN (Ren et al., 2015), and the mask R-CNN (He et al., 2017). Due to limited space, we will only focus on the design of these models.

섹션 14.7에서 설명한 단일 샷 멀티박스 감지 외에도 지역 기반 CNN 또는 CNN 기능이 있는 지역(R-CNN)도 객체 감지에 딥 러닝을 적용하는 많은 선구적인 접근 방식 중 하나입니다(Girshick et al., 2014). 이 섹션에서는 R-CNN과 일련의 개선 사항인 fast R-CNN(Girshick, 2015), faster R-CNN(Ren et al., 2015) 및 mask R-CNN(He 외, 2017). 제한된 공간으로 인해 이러한 모델의 디자인에만 집중할 것입니다.

14.8.1. R-CNNs

The R-CNN first extracts many (e.g., 2000) region proposals from the input image (e.g., anchor boxes can also be considered as region proposals), labeling their classes and bounding boxes (e.g., offsets). (Girshick et al., 2014)

R-CNN은 먼저 입력 이미지에서 많은(예: 2000개) region proposals을 추출하고(예: 앵커 상자도 region proposals으로 간주할 수 있음) 해당 클래스 및 경계 상자(예: 오프셋)에 레이블을 지정합니다.

Then a CNN is used to perform forward propagation on each region proposal to extract its features. Next, features of each region proposal are used for predicting the class and bounding box of this region proposal.

그런 다음 CNN을 사용하여 각 region proposal 에 대한 순방향 전파를 수행하여 해당 features을 추출합니다. 다음으로, 각 region proposal 의 features은 이 region proposal 의 클래스와 경계 상자를 예측하는 데 사용됩니다.

Fig. 14.8.1 shows the R-CNN model. More concretely, the R-CNN consists of the following four steps:

그림 14.8.1은 R-CNN 모델을 보여준다. 보다 구체적으로 R-CNN은 다음 네 단계로 구성됩니다.

Perform selective search to extract multiple high-quality region proposals on the input image (Uijlings et al., 2013). These proposed regions are usually selected at multiple scales with different shapes and sizes. Each region proposal will be labeled with a class and a ground-truth bounding box.

입력 이미지에서 여러 고품질 region proposals을 추출하기 위해 선택적 검색을 수행합니다(Uijlings et al., 2013). 이러한 제안된 영역은 일반적으로 모양과 크기가 다른 여러 척도에서 선택됩니다. 각 region proposals은 클래스와 ground-truth 경계 상자로 레이블이 지정됩니다.
Choose a pretrained CNN and truncate it before the output layer. Resize each region proposal to the input size required by the network, and output the extracted features for the region proposal through forward propagation.

사전 훈련된 CNN을 선택하고 출력 계층 앞에서 자릅니다. 각 region proposal의 크기를 네트워크에서 요구하는 입력 크기로 조정하고 순방향 전파를 통해 추출된 region proposal의 -features- 특징을 출력합니다.
Take the extracted features and labeled class of each region proposal as an example. Train multiple support vector machines to classify objects, where each support vector machine individually determines whether the example contains a specific class.

각 지역 제안 region proposal 의 추출된 기능 features 과 레이블이 지정된 클래스를 예로 들어 보겠습니다. 여러 서포트 벡터 머신을 훈련시켜 개체를 분류합니다. 여기서 각 서포트 벡터 머신은 예제에 특정 클래스가 포함되어 있는지 여부를 개별적으로 결정합니다.
Take the extracted features and labeled bounding box of each region proposal as an example. Train a linear regression model to predict the ground-truth bounding box.

각 지역 제안 region proposal의 추출된 특징과 레이블이 지정된 경계 상자를 예로 들어 보겠습니다. ground-truth 경계 상자를 예측하도록 선형 회귀 모델을 훈련합니다.

Although the R-CNN model uses pretrained CNNs to effectively extract image features, it is slow. Imagine that we select thousands of region proposals from a single input image: this requires thousands of CNN forward propagations to perform object detection. This massive computing load makes it infeasible to widely use R-CNNs in real-world applications.

R-CNN 모델은 사전 훈련된 CNN을 사용하여 이미지 특징을 효과적으로 추출하지만 속도가 느립니다. 단일 입력 이미지에서 수천 개의 지역 제안 region proposals을 선택한다고 상상해보십시오. 개체 감지를 수행하려면 수천 개의 CNN 정방향 전파가 필요합니다. 이 막대한 컴퓨팅 부하로 인해 실제 응용 프로그램에서 R-CNN을 널리 사용하는 것이 불가능합니다.

Region Proposal 이란?

A "Region Proposal" refers to a process in object detection tasks where potential regions or bounding boxes containing objects of interest are proposed within an image. In object detection, the goal is to identify and localize objects within an image, often by drawing bounding boxes around them. However, directly applying object detection algorithms to all possible image regions can be computationally expensive and inefficient.

"Region Proposal(영역 제안)"은 객체 검출 작업에서 이미지 내에 관심 대상 객체가 포함될 가능성이 있는 영역 또는 경계 상자를 제안하는 과정을 의미합니다. 객체 검출에서의 목표는 이미지 내에서 객체를 식별하고 위치를 파악하는 것인데, 모든 가능한 이미지 영역에 대해 직접 객체 검출 알고리즘을 적용하는 것은 계산적으로 부담스럽고 비효율적일 수 있습니다.

To address this, region proposal methods are used to narrow down the search space for object detection. These methods aim to generate a relatively small set of potential regions or bounding boxes that are likely to contain objects. These proposals serve as the input to the subsequent object detection algorithm, significantly reducing the computational burden.

이러한 문제를 해결하기 위해 영역 제안 방법을 사용하여 객체 검출을 위한 검색 공간을 좁힙니다. 이러한 방법은 객체가 포함될 가능성이 높은 상대적으로 작은 영역이나 경계 상자 세트를 생성하는 것을 목표로 합니다. 이러한 제안은 후속 객체 검출 알고리즘의 입력으로 사용되며, 계산 부담을 크게 줄여줍니다.

There are various techniques for generating region proposals. Some common methods include:

영역 제안을 생성하는 다양한 기법이 있습니다. 일반적인 방법은 다음과 같습니다:

Sliding Window Approach: A small window of fixed size is moved across the image at different scales and positions. Bounding boxes are generated for each window location. This approach can be computationally intensive due to the need to slide the window over the entire image.

슬라이딩 윈도우 접근법: 고정 크기의 작은 창을 이미지 상에서 다양한 스케일과 위치로 이동시키면서 각 창 위치에 대한 경계 상자를 생성합니다. 이 접근법은 전체 이미지에 대해 창을 이동시켜야 하기 때문에 계산적으로 비싼 경우가 있을 수 있습니다.
Selective Search: This method generates region proposals by grouping pixels based on their similarity in color, texture, and intensity. The algorithm combines regions of varying sizes and shapes to create a diverse set of proposals.

Selective Search(선택적 검색): 이 방법은 색상, 질감, 강도의 유사성을 기반으로 픽셀을 그룹화하여 영역 제안을 생성합니다. 알고리즘은 다양한 크기와 모양의 영역을 결합하여 다양한 제안 세트를 만듭니다.
EdgeBoxes: EdgeBoxes uses the edges in an image to generate object proposals. It identifies object-like regions by examining the number of bounding box edges that overlap with the object's edge.

EdgeBoxes: EdgeBoxes는 이미지 내의 가장자리를 활용하여 객체 제안을 생성합니다. 객체의 가장자리와 겹치는 경계 상자 가장자리의 수를 조사하여 객체와 유사한 영역을 식별합니다.
Region Proposal Networks (RPNs): RPNs are part of modern object detection frameworks like Faster R-CNN. They use convolutional neural networks to generate region proposals directly from the input image. RPNs learn to propose regions that are likely to contain objects.

Region Proposal Networks (RPNs): RPN은 Faster R-CNN과 같은 현대적인 객체 검출 프레임워크의 일부입니다. RPN은 컨볼루션 신경망을 사용하여 입력 이미지로부터 직접적으로 영역 제안을 생성합니다. RPN은 객체가 포함될 가능성이 높은 영역을 제안하도록 학습됩니다.

After generating region proposals, these proposals are typically ranked or scored based on their likelihood of containing objects. High-scoring proposals are then passed to the subsequent object classification and localization steps of the object detection pipeline.

영역 제안을 생성한 후, 이러한 제안은 일반적으로 객체 포함 가능성에 따라 순위 또는 점수를 매깁니다. 높은 점수를 받은 제안은 후속 객체 분류 및 위치 파악 단계로 전달됩니다.

In summary, "Region Proposal" refers to the process of generating potential bounding boxes or regions in an image that are likely to contain objects. This process helps reduce the search space for object detection tasks and improves the efficiency of object detection algorithms.

요약하면, "Region Proposal(영역 제안)"은 이미지 내에 있는 객체가 포함될 가능성이 높은 경계 상자나 영역을 생성하는 과정을 의미합니다. 이 과정을 통해 객체 검출 작업의 검색 공간이 축소되며, 객체 검출 알고리즘의 효율성이 향상됩니다.

14.8.2. Fast R-CNN

The main performance bottleneck of an R-CNN lies in the independent CNN forward propagation for each region proposal, without sharing computation. Since these regions usually have overlaps, independent feature extractions lead to much repeated computation. One of the major improvements of the fast R-CNN from the R-CNN is that the CNN forward propagation is only performed on the entire image (Girshick, 2015).

R-CNN의 주요 성능 병목 현상은 계산을 공유하지 않고 각 region proposal에 대한 독립적인 CNN 순방향 전파에 있습니다. 이러한 영역은 일반적으로 겹치기 때문에 독립적인 특징 추출은 많은 반복 계산으로 이어집니다. R-CNN에서 빠른 R-CNN의 주요 개선 사항 중 하나는 CNN 순방향 전파가 전체 이미지에서만 수행된다는 것입니다(Girshick, 2015).

Fig. 14.8.2 describes the fast R-CNN model. Its major computations are as follows:

그림 14.8.2는 빠른 R-CNN 모델을 설명합니다. 주요 계산은 다음과 같습니다.

Compared with the R-CNN, in the fast R-CNN the input of the CNN for feature extraction is the entire image, rather than individual region proposals. Moreover, this CNN is trainable. Given an input image, let the shape of the CNN output be 1×c×ℎ1×w1.

R-CNN과 비교할 때 빠른 R-CNN에서 특징 feature 추출을 위한 CNN의 입력은 개별 영역 제안region proposals 이 아닌 전체 이미지입니다. 게다가, 이 CNN은 학습이 가능합니다. 입력 이미지가 주어지면 CNN 출력의 모양을 1×c×ℎ1×w1이라고 합니다.
Suppose that selective search generates n region proposals. These region proposals (of different shapes) mark regions of interest (of different shapes) on the CNN output. Then these regions of interest further extract features of the same shape (say height ℎ2 and width w2 are specified) in order to be easily concatenated. To achieve this, the fast R-CNN introduces the region of interest (RoI) pooling layer: the CNN output and region proposals are input into this layer, outputting concatenated features of shape n×c×ℎ2×w2 that are further extracted for all the region proposals.

선택적 검색이 n개의 지역 제안 region proposals을 생성한다고 가정합니다. 이러한 영역 제안 region proposals(다른 모양)은 CNN 출력에서 관심 영역 regions of interest (다른 모양)을 표시합니다. 그런 다음 이러한 관심 영역은 쉽게 연결하기 위해 동일한 모양(높이 ℎ2 및 너비 w2가 지정됨)의 특징을 추가로 추출합니다. 이를 달성하기 위해 빠른 R-CNN은 관심 영역(regions of interest -RoI) 풀링 레이어를 도입합니다. CNN 출력 및 영역 제안 region proposa 이 이 레이어에 입력됩니다. 모든 영역 제안 region proposa에 대해 추가로 추출되는 shape n×c×ℎ2×w2의 연결된 특징 features 을 출력합니다.
Using a fully connected layer, transform the concatenated features into an output of shape n×d, where d depends on the model design.

완전 연결 레이어를 사용하여 연결된 피처를 모양 shape n×d의 출력으로 변환합니다. 여기서 d는 모델 디자인에 따라 다릅니다.
Predict the class and bounding box for each of the n region proposals. More concretely, in class and bounding box prediction, transform the fully connected layer output into an output of shape n×q (q is the number of classes) and an output of shape n×4, respectively. The class prediction uses softmax regression.

n개의 영역 제안 region proposals 각각에 대한 클래스 및 경계 상자를 예측합니다. 보다 구체적으로 클래스 및 경계 상자 예측에서 완전 연결 계층 출력을 각각 모양 n×q(q는 클래스 수) 및 모양 n×4의 출력으로 변환합니다. 클래스 예측은 softmax 회귀를 사용합니다.

The region of interest pooling layer proposed in the fast R-CNN is different from the pooling layer introduced in Section 7.5. In the pooling layer, we indirectly control the output shape by specifying sizes of the pooling window, padding, and stride. In contrast, we can directly specify the output shape in the region of interest pooling layer.

fast R-CNN에서 제안하는 region of interest pooling layer은 섹션 7.5에서 소개한 풀링 계층과 다릅니다. 풀링 레이어에서 풀링 창, 패딩 및 보폭의 크기를 지정하여 출력 모양을 간접적으로 제어합니다. 반대로 관심 영역 풀링 레이어에서 출력 모양을 직접 지정할 수 있습니다.

Region of Interest Pooling Layer (RoI Layer) 란?

The Region of Interest (RoI) pooling layer is a component commonly used in convolutional neural networks (CNNs) for object detection and instance segmentation tasks. Its primary purpose is to extract fixed-size feature representations from variable-sized regions of an input feature map. The RoI pooling layer plays a critical role in handling objects of different sizes and positions within an image and is crucial for enabling CNNs to perform accurate object localization and recognition.

관심 영역 풀링 레이어(Region of Interest pooling layer)는 객체 탐지 및 인스턴스 분할 작업에 널리 사용되는 컨볼루션 신경망(CNN) 구성 요소입니다. 주요 목적은 입력 특성 맵의 가변 크기 영역에서 고정 크기의 특성 표현을 추출하는 것입니다. 관심 영역 풀링 레이어는 이미지 내 다양한 크기와 위치의 객체를 처리하는 중요한 역할을 하며, 정확한 객체 지역화와 인식을 위한 CNN의 핵심 역할을 합니다.

In object detection tasks, a CNN generates a set of candidate bounding box predictions for potential objects within an image. These bounding boxes can vary in size and aspect ratio based on the object's location and scale. The RoI pooling layer is applied to these predicted bounding boxes to obtain a consistent spatial resolution of feature representations, which can then be fed into subsequent layers for classification and regression.

객체 탐지 작업에서 CNN은 이미지 내 잠재적인 객체에 대한 후보 바운딩 박스 예측 집합을 생성합니다. 이러한 바운딩 박스는 객체의 위치와 크기에 따라 크기와 종횡비가 다를 수 있습니다. 관심 영역 풀링 레이어는 이러한 예측된 바운딩 박스에 적용되어 고정된 공간 해상도(예: 7x7)의 특성 표현을 얻습니다. 이로써 후속 레이어로 분류 및 회귀 작업에 입력될 수 있습니다.

The RoI pooling layer operates as follows:

관심 영역 풀링 레이어는 다음과 같이 작동합니다:

Input: The input consists of a feature map generated by the preceding layers of the CNN and a set of predicted bounding boxes (RoIs).

입력: 입력은 CNN의 이전 레이어에서 생성된 특성 맵과 예측된 바운딩 박스(RoIs) 집합으로 구성됩니다.
Bounding Box Adaptation: Each RoI is adapted and aligned to a fixed spatial resolution (e.g., 7x7) using RoI pooling. This involves dividing the RoI into a grid of equal cells and selecting the maximum value from each cell. This pooling operation ensures that each RoI is transformed into a consistent feature map with the same spatial dimensions.

바운딩 박스 적: 각 RoI는 관심 영역 풀링을 사용하여 고정된 공간 해상도(예: 7x7)에 맞게 조정됩니다. 이는 RoI를 동일한 크기의 셀 그리드로 나누고 각 셀에서 최대 값을 선택하는 것을 포함합니다. 이 풀링 연산을 통해 각 RoI가 동일한 공간 차원을 가진 일관된 특성 맵으로 변환됩니다.
Feature Extraction: The RoI pooling operation generates fixed-size feature maps for each RoI. These feature maps can be directly fed into fully connected layers for classification or regression tasks.

특성 추출: 관심 영역 풀링 작업은 각 RoI에 대해 고정된 크기의 특성 맵을 생성합니다. 이러한 특성 맵은 분류나 회귀 작업을 위해 직접 완전 연결 레이어에 입력될 수 있습니다.

By using the RoI pooling layer, CNNs can effectively handle objects of different sizes, enabling the network to focus on relevant object details and reducing the impact of variations in object scale and position. This layer is a key component in modern object detection architectures like Faster R-CNN and Mask R-CNN.

관심 영역 풀링 레이어를 사용함으로써 CNN은 다양한 크기의 객체를 효과적(효율)으로 처리할 수 있으며, 네트워크는 관련성 있는 객체 세부 정보에 집중하고 객체 크기와 위치의 변동을 줄일 수 있습니다. 이 레이어는 Faster R-CNN 및 Mask R-CNN과 같은 현대적인 객체 탐지 아키텍처에서 핵심 구성 요소입니다.

Overall, the RoI pooling layer plays a pivotal role in enabling CNNs to process variable-sized regions of an input feature map, thereby facilitating accurate object detection, localization, and segmentation in complex scenes.

전반적으로, 관심 영역 풀링 레이어는 입력 특성 맵의 가변 크기 영역을 처리하는 데 중요한 역할을하여 복잡한 장면에서 정확한 객체 탐지, 지역화 및 분할을 용이하게 하는 데 중요한 역할을 합니다.

For example, let’s specify the output height and width for each region as ℎ2 and w2, respectively. For any region of interest window of shape ℎ×w, this window is divided into a ℎ2×w2 grid of subwindows, where the shape of each subwindow is approximately (ℎ/ℎ2)×(w/w2). In practice, the height and width of any subwindow shall be rounded up, and the largest element shall be used as output of the subwindow. Therefore, the region of interest pooling layer can extract features of the same shape even when regions of interest have different shapes.

예를 들어 각 영역의 출력 높이와 너비를 각각 ℎ2와 w2로 지정해 보겠습니다. ℎ×w 모양의 관심 영역 창에 대해 이 창은 ℎ2×w2 그리드의 하위 창으로 나뉘며 각 하위 창의 모양은 대략 (ℎ/ℎ2)×(w/w2)입니다. 실제로 모든 하위 창의 높이와 너비는 반올림되고 가장 큰 요소가 하위 창의 출력으로 사용됩니다. 따라서 관심 영역 풀링 계층은 관심 영역의 모양이 다른 경우에도 동일한 모양의 특징을 추출할 수 있습니다.

As an illustrative example, in Fig. 14.8.3, the upper-left 3×3 region of interest is selected on a 4×4 input. For this region of interest, we use a 2×2 region of interest pooling layer to obtain a 2×2 output. Note that each of the four divided subwindows contains elements 0, 1, 4, and 5 (5 is the maximum); 2 and 6 (6 is the maximum); 8 and 9 (9 is the maximum); and 10.

예를 들어, 그림 14.8.3에서 왼쪽 상단의 3×3 관심 영역이 4×4 입력에서 선택됩니다. 이 관심 영역에 대해 2×2 관심 영역 풀링 계층을 사용하여 2×2 출력을 얻습니다. 4개의 분할된 하위 창 각각에는 요소 0, 1, 4 및 5(5가 최대값)가 포함되어 있습니다. 2 및 6(6이 최대값); 8 및 9(9가 최대); 및 10.

Fig. 14.8.3  A 2×2 region of interest pooling layer.

Below we demonstrate the computation of the region of interest pooling layer. Suppose that the height and width of the CNN-extracted features X are both 4, and there is only a single channel.

아래에서는 관심 영역 풀링 계층의 계산을 보여줍니다. CNN에서 추출한 특징 X의 높이와 너비가 모두 4이고 채널이 하나만 있다고 가정합니다.

import torch
import torchvision

X = torch.arange(16.).reshape(1, 1, 4, 4)
X

import torch: 파이토치 라이브러리를 가져옵니다.
import torchvision: 파이토치 비전 라이브러리를 가져옵니다. 이미지 관련 작업을 위한 함수와 모델을 포함하고 있습니다.
X = torch.arange(16.).reshape(1, 1, 4, 4): 0부터 15까지의 숫자로 이루어진 배열을 생성합니다. .reshape(1, 1, 4, 4)를 사용하여 배열을 4x4 크기의 4차원 텐서로 변환합니다. 이렇게 변환하면 (batch_size, channels, height, width) 형태로 텐서를 표현하게 됩니다. 여기서 batch_size와 channels는 1로 설정되었습니다.
X: 생성된 텐서 X를 출력합니다.

이 코드는 4x4 크기의 1채널 이미지를 표현하는 4차원 텐서 X를 생성하고 출력하는 내용을 나타냅니다.

tensor([[[[ 0.,  1.,  2.,  3.],
          [ 4.,  5.,  6.,  7.],
          [ 8.,  9., 10., 11.],
          [12., 13., 14., 15.]]]])

Let’s further suppose that the height and width of the input image are both 40 pixels and that selective search generates two region proposals on this image. Each region proposal is expressed as five elements: its object class followed by the (x,y)-coordinates of its upper-left and lower-right corners.

또한 입력 이미지의 높이와 너비가 모두 40픽셀이고 선택적 검색이 이 이미지에 대해 두 개의 영역 제안region proposals을 생성한다고 가정해 보겠습니다. 각 영역 제안은 5개의 요소로 표현됩니다. 개체 클래스와 왼쪽 상단 및 오른쪽 하단 모서리의 (x,y) 좌표가 뒤따릅니다.

rois = torch.Tensor([[0, 0, 0, 20, 20], [0, 0, 10, 30, 30]])

rois: rois라는 변수에 텐서 값을 할당합니다.
torch.Tensor: 파이토치에서 텐서를 생성하는 클래스입니다. 행렬이나 다차원 배열과 유사한 데이터 구조를 나타냅니다.
[[0, 0, 0, 20, 20], [0, 0, 10, 30, 30]]: 2개의 리스트를 가진 리스트로 이루어진 텐서를 생성합니다. 각 리스트는 5개의 숫자로 이루어져 있으며, 각각의 리스트는 ROI(Region of Interest) 정보를 나타냅니다. 각 ROI 정보는 [batch index, class index, x1, y1, x2, y2] 형태로 표현됩니다. 여기서 batch index는 배치 내에서 몇 번째 이미지인지, class index는 객체 클래스를 의미하며, (x1, y1)은 좌상단 좌표, (x2, y2)는 우하단 좌표를 나타냅니다.
==> 요소가 5개인데 설명은 6개로 돼 있음. 교재 내용을 보면 object class,x1,y1,x2,y2 가 맞는 것 같

이 코드는 두 개의 ROI 정보를 담은 텐서 rois를 생성하는 내용을 나타냅니다.

Because the height and width of X are 1/10 of the height and width of the input image, the coordinates of the two region proposals are multiplied by 0.1 according to the specified spatial_scale argument. Then the two regions of interest are marked on X as X[:, :, 0:3, 0:3] and X[:, :, 1:4, 0:4], respectively. Finally in the 2×2 region of interest pooling, each region of interest is divided into a grid of sub-windows to further extract features of the same shape 2×2.

X의 높이와 너비는 입력 이미지의 높이와 너비의 1/10이기 때문에 두 영역 제안의 좌표는 지정된 spatial_scale 인수에 따라 0.1이 곱해집니다. 그런 다음 두 개의 관심 영역이 X에 각각 X[:, :, 0:3, 0:3] 및 X[:, :, 1:4, 0:4]로 표시됩니다. 마지막으로 2×2 관심 영역 풀링에서는 각 관심 영역을 하위 창의 그리드로 나누어 동일한 모양의 2×2 특징을 더 추출합니다.

torchvision.ops.roi_pool(X, rois, output_size=(2, 2), spatial_scale=0.1)

torchvision.ops.roi_pool: torchvision 패키지 내의 ops 모듈에서 roi_pool 함수를 호출합니다. 이 함수는 RoI (Region of Interest) Pooling 연산을 수행합니다.
X: 입력 데이터로서, RoI Pooling 연산을 수행할 대상 텐서입니다.
rois: RoI 정보를 담은 텐서입니다. 각 RoI 정보는 [batch index, class index, x1, y1, x2, y2] 형태로 구성되어 있습니다.
output_size=(2, 2): RoI Pooling 연산의 출력 크기를 지정합니다. 여기서는 2x2 크기의 출력을 생성하도록 지정되었습니다.
spatial_scale=0.1: RoI Pooling 연산에서 입력과 출력의 크기 비율을 나타내는 값입니다. 이 값은 입력 RoI의 좌표를 출력 RoI 좌표로 변환하는 데 사용됩니다.

이 코드는 입력 데이터 X와 RoI 정보를 사용하여 RoI Pooling 연산을 수행하는 내용을 나타냅니다. RoI Pooling은 주어진 RoI 영역을 정해진 크기로 출력하는 연산으로, 객체 검출과 같은 작업에서 주로 사용됩니다.

tensor([[[[ 5.,  6.],
          [ 9., 10.]]],


        [[[ 9., 11.],
          [13., 15.]]]])

14.8.3. Faster R-CNN

To be more accurate in object detection, the fast R-CNN model usually has to generate a lot of region proposals in selective search. To reduce region proposals without loss of accuracy, the faster R-CNN proposes to replace selective search with a region proposal network (Ren et al., 2015).

개체 감지에서 보다 정확하려면 fast R-CNN 모델은 일반적으로 선택적 검색에서 많은 영역 제안을 생성해야 합니다. 정확도 손실 없이 지역 제안 region proposals을 줄이기 위해 Faster R-CNN은 선택적 검색을 지역 제안 region proposals 네트워크로 대체할 것을 제안합니다(Ren et al., 2015).

Fig. 14.8.4 shows the faster R-CNN model. Compared with the fast R-CNN, the faster R-CNN only changes the region proposal method from selective search to a region proposal network. The rest of the model remain unchanged. The region proposal network works in the following steps:

그림 14.8.4는 더 빠른 R-CNN 모델을 보여줍니다. 빠른 R-CNN과 비교할 때 더 빠른 R-CNN은 지역 제안 region proposal 방법만 선택적 검색에서 지역 제안 region proposal 네트워크로 변경합니다. 나머지 모델은 변경되지 않습니다. 지역 제안 네트워크는 다음 단계로 작동합니다.

Use a 3×3 convolutional layer with padding of 1 to transform the CNN output to a new output with c channels. In this way, each unit along the spatial dimensions of the CNN-extracted feature maps gets a new feature vector of length c.

패딩이 1인 3×3 컨벌루션 레이어를 사용하여 CNN 출력을 c 채널이 있는 새 출력으로 변환합니다. 이러한 방식으로 CNN 추출 기능 맵의 공간 차원을 따라 각 단위는 길이가 c인 새로운 기능 벡터를 얻습니다.
Centered on each pixel of the feature maps, generate multiple anchor boxes of different scales and aspect ratios and label them.

기능 맵의 각 픽셀을 중심으로 다양한 크기와 가로 세로 비율의 여러 앵커 상자를 생성하고 레이블을 지정합니다.
Using the length-c feature vector at the center of each anchor box, predict the binary class (background or objects) and bounding box for this anchor box.

각 앵커 상자의 중심에 있는 길이-c 특징 벡터를 사용하여 이 앵커 상자의 이진 클래스(배경 또는 객체) 및 경계 상자를 예측합니다.
Consider those predicted bounding boxes whose predicted classes are objects. Remove overlapped results using non-maximum suppression. The remaining predicted bounding boxes for objects are the region proposals required by the region of interest pooling layer.

예측 클래스가 객체인 예측된 경계 상자를 고려하십시오. 최대가 아닌 억제를 사용하여 중첩된 결과를 제거합니다. 개체에 대한 나머지 예측 경계 상자는 관심 영역 풀링 레이어에 필요한 영역 제안입니다.

It is worth noting that, as part of the faster R-CNN model, the region proposal network is jointly trained with the rest of the model. In other words, the objective function of the faster R-CNN includes not only the class and bounding box prediction in object detection, but also the binary class and bounding box prediction of anchor boxes in the region proposal network. As a result of the end-to-end training, the region proposal network learns how to generate high-quality region proposals, so as to stay accurate in object detection with a reduced number of region proposals that are learned from data.

더 빠른 R-CNN 모델의 일부로 영역 제안 region proposal 네트워크가 모델의 나머지 부분과 공동으로 훈련된다는 점은 주목할 가치가 있습니다. 즉, 더 빠른 R-CNN의 목적 함수는 객체 감지에서 클래스 및 경계 상자 예측뿐만 아니라 영역 제안 region proposal 네트워크에서 앵커 상자의 이진 클래스 및 경계 상자 예측을 포함합니다. 종단 간 교육의 결과, 지역 제안 region proposal 네트워크는 고품질 지역 제안 region proposal을 생성하는 방법을 학습하여 데이터에서 학습된 지역 제안 region proposal의 수를 줄임과 동시 객체 감지에서 정확성을 유지합니다.

14.8.4. Mask R-CNN

In the training dataset, if pixel-level positions of object are also labeled on images, the mask R-CNN can effectively leverage such detailed labels to further improve the accuracy of object detection (He et al., 2017).

학습 데이터 세트에서 개체의 픽셀 수준 위치도 이미지에 레이블이 지정된 경우 마스크 R-CNN은 이러한 세부 레이블을 효과적으로 활용하여 개체 감지의 정확도를 더욱 향상시킬 수 있습니다(He et al., 2017).

As shown in Fig. 14.8.5, the mask R-CNN is modified based on the faster R-CNN. Specifically, the mask R-CNN replaces the region of interest pooling layer with the region of interest (RoI) alignment layer. This region of interest alignment layer uses bilinear interpolation to preserve the spatial information on the feature maps, which is more suitable for pixel-level prediction. The output of this layer contains feature maps of the same shape for all the regions of interest. They are used to predict not only the class and bounding box for each region of interest, but also the pixel-level position of the object through an additional fully convolutional network. More details on using a fully convolutional network to predict pixel-level semantics of an image will be provided in subsequent sections of this chapter.

그림 14.8.5와 같이 더 빠른 R-CNN을 기반으로 마스크 R-CNN이 수정되었습니다. 특히, 마스크 R-CNN은 관심 영역 풀링 레이어를 관심 영역(RoI) 정렬 레이어로 대체합니다. 이 관심 영역 정렬 레이어는 쌍선형 보간을 사용하여 픽셀 수준 예측에 더 적합한 기능 맵의 공간 정보를 보존합니다. 이 레이어의 출력에는 모든 관심 영역에 대해 동일한 모양의 기능 맵이 포함됩니다. 각 관심 영역에 대한 클래스 및 경계 상자뿐만 아니라 추가적인 완전 컨벌루션 네트워크를 통해 개체의 픽셀 수준 위치를 예측하는 데 사용됩니다. 이미지의 픽셀 수준 의미론을 예측하기 위해 완전 컨벌루션 네트워크를 사용하는 방법에 대한 자세한 내용은 이 장의 다음 섹션에서 제공됩니다.

14.8.5. Summary

The R-CNN extracts many region proposals from the input image, uses a CNN to perform forward propagation on each region proposal to extract its features, then uses these features to predict the class and bounding box of this region proposal.

R-CNN은 입력 이미지에서 많은 영역 제안region proposals을 추출하고, CNN을 사용하여 각 영역 제안에 대해 정방향 전파를 수행하여 해당 기능을 추출한 다음 이러한 기능을 사용하여 이 영역 제안의 클래스 및 경계 상자를 예측합니다.
One of the major improvements of the fast R-CNN from the R-CNN is that the CNN forward propagation is only performed on the entire image. It also introduces the region of interest pooling layer, so that features of the same shape can be further extracted for regions of interest that have different shapes.

R-CNN에서 빠른 R-CNN의 주요 개선 사항 중 하나는 CNN 순방향 전파가 전체 이미지에서만 수행된다는 것입니다. 또한 관심 영역 풀링 레이어를 도입하여 모양이 다른 관심 영역에 대해 동일한 모양의 특징을 추가로 추출할 수 있습니다.
The faster R-CNN replaces the selective search used in the fast R-CNN with a jointly trained region proposal network, so that the former can stay accurate in object detection with a reduced number of region proposals.

더 빠른 R-CNN은 빠른 R-CNN에서 사용되는 선택적 검색을 공동으로 훈련된 영역 제안 네트워크로 대체하므로 전자는 감소된 수의 영역 제안으로 객체 감지에서 정확성을 유지할 수 있습니다.
Based on the faster R-CNN, the mask R-CNN additionally introduces a fully convolutional network, so as to leverage pixel-level labels to further improve the accuracy of object detection.

더 빠른 R-CNN을 기반으로 하는 마스크 R-CNN은 픽셀 수준 레이블을 활용하여 개체 감지의 정확도를 더욱 향상시키기 위해 완전히 컨벌루션 네트워크를 추가로 도입합니다.

14.8.6. Exercises

Can we frame object detection as a single regression problem, such as predicting bounding boxes and class probabilities? You may refer to the design of the YOLO model (Redmon et al., 2016).
Compare single shot multibox detection with the methods introduced in this section. What are their major differences? You may refer to Figure 2 of Zhao et al. (2019).

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

D2L - 14.12. Neural Style Transfer (1)	2023.08.21
D2L - 14.11. Fully Convolutional Networks (0)	2023.08.21
D2L - 14.10. Transposed Convolution (1)	2023.08.21
D2L - 14.9. Semantic Segmentation and the Dataset (0)	2023.08.20
D2L - 14.7. Single Shot Multibox Detection (0)	2023.08.19
D2L - 14.6. The Object Detection Dataset (0)	2023.08.19
D2L - 14.5. Multiscale Object Detection (0)	2023.08.19
D2L - 14.4. Anchor Boxes (0)	2023.08.19
D2L - 14.3. Object Detection and Bounding Boxes (0)	2023.08.19
D2L - 14.2. Fine-Tuning (0)	2023.08.19

1 ··· 19 20 21 22 23 24 25 ··· 162

공지사항

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

카테고리

'분류 전체보기'에 해당되는 글 1620건

15.4. Pretraining word2vec

15.4.1. The Skip-Gram Model

15.4.1.1. Embedding Layer

15.4.1.2. Defining the Forward Propagation

15.4.2. Training

15.4.2.1. Binary Cross-Entropy Loss

15.4.2.2. Initializing Model Parameters

15.4.2.3. Defining the Training Loop

15.4.3. Applying Word Embeddings

15.4.4. Summary

15.4.5. Exercises

'Dive into Deep Learning > D2L Natural language Processing' 카테고리의 다른 글

15.3. The Dataset for Pretraining Word Embeddings

15.3.1. Reading the Dataset

15.3.2. Subsampling

15.3.3. Extracting Center Words and Context Words

15.3.4. Negative Sampling

15.3.5. Loading Training Examples in Minibatches

15.3.6. Putting It All Together

15.3.7. Summary

15.3.8. Exercises¶

'Dive into Deep Learning > D2L Natural language Processing' 카테고리의 다른 글

15.2.1. Negative Sampling

15.2.2. Hierarchical Softmax

15.2.3. Summary

15.2.4. Exercises

'Dive into Deep Learning > D2L Natural language Processing' 카테고리의 다른 글

15.1. Word Embedding (word2vec)

15.1.1. One-Hot Vectors Are a Bad Choice

15.1.2. Self-Supervised word2vec

15.1.3. The Skip-Gram Model

15.1.3.1. Training

15.1.4. The Continuous Bag of Words (CBOW) Model

15.1.4.1. Training

5.1.5. Summary

15.1.6. Exercises

'Dive into Deep Learning > D2L Natural language Processing' 카테고리의 다른 글

15. Natural Language Processing: Pretraining

'Dive into Deep Learning > D2L Natural language Processing' 카테고리의 다른 글

14.12. Neural Style Transfer

14.12.1. Method

14.12.2. Reading the Content and Style Images

14.12.3. Preprocessing and Postprocessing

14.12.4. Extracting Features

14.12.5. Defining the Loss Function

14.12.5.1. Content Loss

14.12.5.2. Style Loss

14.12.5.3. Total Variation Loss

14.12.5.4. Loss Function

14.12.6. Initializing the Synthesized Image

14.12.7. Training

14.12.8. Summary

14.12.9. Exercises

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

14.11.1. The Model

14.11.2. Initializing Transposed Convolutional Layers

14.11.3. Reading the Dataset

14.11.4. Training

14.11.5. Prediction

14.11.6. Summary

14.11.7. Exercises

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

14.10. Transposed Convolution

14.10.1. Basic Operation

14.10.2. Padding, Strides, and Multiple Channels

14.10.3. Connection to Matrix Transposition

14.10.4. Summary

14.10.5. Exercises

'Dive into Deep Learning > D2L Computer Vision' 카테고리의 다른 글

14.9. Semantic Segmentation and the Dataset

14.9.1. Image Segmentation and Instance Segmentation

14.9.2. The Pascal VOC2012 Semantic Segmentation Dataset

14.9.2.1. Data Preprocessing