D2L- 15.2. Approximate Training

Dive into Deep Learning/D2L Natural language Processing

D2L- 15.2. Approximate Training

2023. 8. 28. 00:07 | Posted by 솔웅

15.2. Approximate Training — Dive into Deep Learning 1.0.3 documentation (d2l.ai)

15.2. Approximate Training — Dive into Deep Learning 1.0.3 documentation

d2l.ai

Recall our discussions in Section 15.1. The main idea of the skip-gram model is using softmax operations to calculate the conditional probability of generating a context word wo based on the given center word wc in (15.1.4), whose corresponding logarithmic loss is given by the opposite of (15.1.7).

섹션 15.1에서 논의한 내용을 기억해 보십시오. 스킵 그램 모델의 주요 아이디어는 소프트맥스 연산을 사용하여 문맥 단어 wo를 생성하는 조건부 확률을 계산하는 것입니다. 이것은 15.1.4에서 주어진 중심 단어 wc에 기초하여, 해당 로그 손실은 (15.1.7)의 반대에 의해 제공 됩니다.

Due to the nature of the softmax operation, since a context word may be anyone in the dictionary V, the opposite of (15.1.7) contains the summation of items as many as the entire size of the vocabulary. Consequently, the gradient calculation for the skip-gram model in (15.1.8) and that for the continuous bag-of-words model in (15.1.15) both contain the summation. Unfortunately, the computational cost for such gradients that sum over a large dictionary (often with hundreds of thousands or millions of words) is huge!

소프트맥스 연산의 특성상 문맥 단어는 사전 V에 있는 누구라도 될 수 있으므로 (15.1.7)의 반대는 전체 어휘 크기만큼 항목의 합을 포함한다. 결과적으로 (15.1.8)의 스킵 그램 모델에 대한 기울기 계산과 (15.1.15)의 연속 단어 백 모델에 대한 기울기 계산에는 모두 합산이 포함됩니다. 불행하게도, 큰 사전(종종 수십만 또는 수백만 단어로 구성됨)에 걸쳐 합산되는 그러한 그라디언트의 계산 비용은 엄청납니다!

15.2.1. Negative Sampling

Negative sampling modifies the original objective function. Given the context window of a center word wc, the fact that any (context) word wo comes from this context window is considered as an event with the probability modeled by

음수 샘플링 Negative sampling 은 원래 목적 함수를 수정합니다. 중앙 단어 wc의 컨텍스트 창이 주어지면 이 컨텍스트 창에서 임의의 (컨텍스트) 단어 wo가 나온다는 사실은 다음과 같이 모델링된 확률을 갖는 이벤트로 간주됩니다.

where σ (sigma) uses the definition of the sigmoid activation function:

여기서 σ는 시그모이드 활성화 함수activation function의 정의를 사용합니다.

Let’s begin by maximizing the joint probability of all such events in text sequences to train word embeddings. Specifically, given a text sequence of length T, denote by w**(t) the word at time step t and let the context window size be m, consider maximizing the joint probability

단어 임베딩을 훈련하기 위해 텍스트 시퀀스에서 이러한 모든 이벤트의 결합 확률을 최대화하는 것부터 시작해 보겠습니다. 구체적으로, 길이 T의 텍스트 시퀀스가 주어지면 시간 단계 t의 단어를 w**(t)로 표시하고 컨텍스트 창 크기를 m으로 두고 결합 확률을 최대화하는 것을 고려하십시오.

However, (15.2.3) only considers those events that involve positive examples. As a result, the joint probability in (15.2.3) is maximized to 1 only if all the word vectors are equal to infinity. Of course, such results are meaningless. To make the objective function more meaningful, negative sampling adds negative examples sampled from a predefined distribution.

그러나 (15.2.3)은 긍정적인 예를 포함하는 이벤트만 고려합니다. 결과적으로 (15.2.3)의 결합 확률은 모든 단어 벡터가 무한대인 경우에만 1로 최대화됩니다. 물론 그런 결과는 의미가 없다. 목적 함수를 더욱 의미 있게 만들기 위해 음수 샘플링은 사전 정의된 분포에서 샘플링된 음수 예를 추가합니다.

Denote by S the event that a context word wo comes from the context window of a center word wc. For this event involving wo, from a predefined distribution P(w) sample K noise words that are not from this context window. Denote by Nk the event that a noise word wk (k=1,...,K) does not come from the context window of wc. Assume that these events involving both the positive example and negative examples S,N1,...,Nk are mutually independent. Negative sampling rewrites the joint probability (involving only positive examples) in (15.2.3) as

문맥 단어 wo가 중심 단어 wc의 문맥 창에서 나오는 이벤트를 S로 표시합니다. wo와 관련된 이 이벤트의 경우 사전 정의된 분포 P(w)에서 이 컨텍스트 창에 속하지 않는 K개의 노이즈 단어를 샘플링합니다. 의미 없는 단어 wk(k=1,...,K)가 wc의 컨텍스트 창에서 나오지 않는 이벤트를 Nk로 나타냅니다. 긍정적인 예와 부정적인 예 S,N1,...,Nk를 모두 포함하는 이러한 이벤트가 상호 독립적이라고 가정합니다. 음수 샘플링은 (15.2.3)의 결합 확률(양수 예만 포함)을 다음과 같이 다시 작성합니다.

where the conditional probability is approximated through events S,N1,...,Nk:

여기서 조건부 확률은 사건 S,N1,...,Nk를 통해 근사됩니다.

Denote by it and ℎk the indices of a word w**(t) at time step t of a text sequence and a noise word wk, respectively. The logarithmic loss with respect to the conditional probabilities in (15.2.5) is

텍스트 시퀀스의 시간 단계 t에서 단어 w**(t)와 의미 없는 단어 wk의 인덱스를 각각 it과 ℎk로 표시합니다. (15.2.5)의 조건부 확률에 대한 로그 손실은 다음과 같습니다.

We can see that now the computational cost for gradients at each training step has nothing to do with the dictionary size, but linearly depends on K. When setting the hyperparameter K to a smaller value, the computational cost for gradients at each training step with negative sampling is smaller.

이제 각 훈련 단계에서 그래디언트의 계산 비용은 사전 크기와 관련이 없고 K에 선형적으로 의존한다는 것을 알 수 있습니다. 하이퍼파라미터 K를 더 작은 값으로 설정하면 음수 샘플링을 사용하는 각 훈련 단계의 기울기에 대한 계산 비용이 더 작아집니다.

NLP에서 Negative Sampling이란?

In Natural Language Processing (NLP), "negative sampling" refers to a technique used in training word embeddings, particularly in the context of models like Word2Vec. Word embeddings are dense vector representations of words that capture semantic relationships between words in a continuous vector space. Negative sampling is employed to efficiently train these embeddings by focusing on both positive examples (words that co-occur) and negative examples (randomly selected words that do not co-occur).

자연어 처리(NLP)에서 "부정 샘플링"은 주로 Word2Vec과 같은 모델의 단어 임베딩을 훈련하는 데 사용되는 기술로, 특히 단어 임베딩을 효과적으로 훈련시키기 위해 사용됩니다. 단어 임베딩은 단어 간의 의미적 관계를 연속적인 벡터 공간에서 포착하는 밀집 벡터 표현입니다. 부정 샘플링은 이러한 임베딩을 훈련하기 위해 양성 예제(동시에 발생하는 단어)와 음성 예제(동시에 발생하지 않는 임의로 선택된 단어)에 초점을 맞추는 데 사용됩니다.

The main idea behind negative sampling is to simplify the training process by turning it into a binary classification task. Instead of considering all possible words as negative examples, negative sampling randomly selects a small subset of words that do not appear in the context of the given word. For each positive example (word pair that co-occurs), several negative examples are chosen.

부정 샘플링의 주요 아이디어는 훈련 과정을 이진 분류 작업으로 단순화시키는 것입니다. 모든 가능한 단어를 음성 예제로 고려하는 대신, 부정 샘플링은 주어진 단어와 동시에 나타나지 않는 작은 하위 집합의 단어를 임의로 선택합니다. 각 양성 예제(동시에 발생하는 단어 쌍)에 대해 여러 음성 예제가 선택됩니다.

The training objective is to predict whether a given word pair is positive (co-occurring) or negative (randomly selected). This approach makes the training process computationally more efficient and scalable, as it reduces the need to consider all possible negative examples in each training iteration.

훈련 목표는 주어진 단어 쌍이 양성(동시에 발생)인지 음성(임의로 선택된)인지 예측하는 것입니다. 이 접근 방식은 훈련 과정을 계산적으로 더 효율적이고 확장 가능하게 만들며, 각 훈련 반복에서 모든 가능한 부정 예제를 고려할 필요가 줄어듭니다.

In negative sampling, the goal is to optimize the word embeddings in such a way that words that often co-occur in similar contexts are represented closer in the embedding space, while words that rarely or never co-occur are separated. This helps the embeddings capture semantic relationships and contextual information between words.

부정 샘플링에서의 목표는 단어 임베딩을 최적화하여 비슷한 문맥에서 자주 공존하는 단어가 임베딩 공간에서 가깝게 표현되고, 거의 또는 결코 함께 발생하지 않는 단어는 분리되도록 하는 것입니다. 이를 통해 임베딩은 단어 간의 의미적 관계와 문맥 정보를 효과적으로 포착할 수 있습니다.

Negative sampling has been widely used to train word embeddings, improving the efficiency of training while still achieving meaningful and useful representations of words in vector space.

부정 샘플링은 단어 임베딩을 훈련하는 데 널리 사용되며, 훈련의 효율성을 향상시키면서도 단어를 벡터 공간에서 의미 있는 유용한 표현으로 얻을 수 있게 해주는 장점을 가지고 있습니다.

15.2.2. Hierarchical Softmax

As an alternative approximate training method, hierarchical softmax uses the binary tree, a data structure illustrated in Fig. 15.2.1, where each leaf node of the tree represents a word in dictionary V.

대체 근사 훈련 방법 alternative approximate training method으로 계층적 소프트맥스 hierarchical softmax는 그림 15.2.1에 표시된 데이터 구조인 이진 트리를 사용합니다. 여기서 트리의 각 리프 노드는 사전 V의 단어를 나타냅니다.

Fig. 15.2.1  Hierarchical softmax for approximate training, where each leaf node of the tree represents a word in the dictionary.

Denote by L(w) the number of nodes (including both ends) on the path from the root node to the leaf node representing word w in the binary tree. Let n(w,j) be the jth node on this path, with its context word vector being un(w,j). For example, L(w3)=4 in Fig. 15.2.1. Hierarchical softmax approximates the conditional probability in (15.1.4) as

이진 트리에서 단어 w를 나타내는 루트 노드에서 리프 노드까지의 경로에 있는 노드 수(양 끝 포함)를 L(w)로 표시합니다. n(w,j)가 이 경로의 j번째 노드이고 해당 컨텍스트 단어 벡터가 un(w,j)라고 가정합니다. 예를 들어, 그림 15.2.1에서 L(w3)=4이다. 계층적 소프트맥스는 (15.1.4)의 조건부 확률을 다음과 같이 근사화합니다.

where function σ is defined in (15.2.2), and leftChild(n) is the left child node of node n: if x is true, [[x]]=1; otherwise [[x]]=−1.

여기서 함수 σ는 (15.2.2)에 정의되어 있고 leftChild(n)은 노드 n의 왼쪽 자식 노드입니다. x가 참이면 [[x]]=1; 그렇지 않으면 [[x]]=−1입니다.

To illustrate, let’s calculate the conditional probability of generating word w3 given word wc in Fig. 15.2.1. This requires dot products between the word vector vc of wc and non-leaf node vectors on the path (the path in bold in Fig. 15.2.1) from the root to w3, which is traversed left, right, then left:

설명을 위해 그림 15.2.1의 단어 wc가 주어졌을 때 단어 w3을 생성할 조건부 확률을 계산해 보겠습니다. 이를 위해서는 루트에서 w3까지 왼쪽, 오른쪽, 왼쪽으로 순회하는 경로(그림 15.2.1에서 굵은 글씨로 표시된 경로)에서 wc의 단어 벡터 vc와 리프가 아닌 노드 벡터 간의 내적이 필요합니다.

Since σ(x)+σ(−x)=1, it holds that the conditional probabilities of generating all the words in dictionary V based on any word wc sum up to one:

σ(x)+σ(−x)=1이므로 임의의 단어 wc를 기반으로 사전 V의 모든 단어를 생성하는 조건부 확률의 합은 1이 됩니다.

Fortunately, since L(wo)−1 is on the order of O(log2|V|) due to the binary tree structure, when the dictionary size V is huge, the computational cost for each training step using hierarchical softmax is significantly reduced compared with that without approximate training.

다행스럽게도 이진 트리 구조로 인해 L(wo)−1은 O(log2|V|) 정도이므로 사전 크기 V가 클 경우 계층적 소프트맥스를 사용한 각 학습 단계의 계산 비용은 이에 비해 크게 줄어듭니다. 대략적인 훈련 없이도 말이죠.

Hierarchical Softmax 이란?

"Hierarchical Softmax" is a technique used in natural language processing (NLP) to improve the efficiency of training large vocabulary language models, such as neural network-based language models. The goal of hierarchical softmax is to address the computational complexity and memory requirements associated with traditional softmax when dealing with a large output vocabulary.

"계층 소프트맥스(Hierarchical Softmax)"는 자연어 처리(NLP)에서 대규모 어휘 언어 모델의 효율적인 훈련을 개선하기 위해 사용되는 기술로, 신경망 기반 언어 모델과 같은 모델에서 큰 출력 어휘를 다룰 때 발생하는 계산 복잡성과 메모리 요구 사항에 대응합니다. 계층 소프트맥스의 목표는 큰 어휘를 처리할 때 전통적인 소프트맥스의 계산 복잡성과 메모리 요구 사항을 해결하는 것입니다.

In traditional softmax, when calculating the probabilities of all possible words in the vocabulary for a given input, the model needs to compute an exponential number of probabilities, making it computationally expensive and memory-intensive, especially for large vocabularies. Hierarchical softmax offers a more efficient alternative by breaking down the vocabulary into a hierarchy or tree structure.

전통적인 소프트맥스에서는 주어진 입력에 대해 어휘 내 모든 가능한 단어의 확률을 계산할 때, 모델은 지수 개의 확률을 계산해야 하기 때문에 계산적으로 매우 비용이 많이 들며, 특히 큰 어휘에 대해 메모리를 많이 소비합니다. 계층 소프트맥스는 어휘를 계층 또는 트리 구조로 분해하여 효율적인 대안을 제공합니다.

The basic idea of hierarchical softmax is to represent the vocabulary as a binary tree or a hierarchical structure, where each word corresponds to a leaf node. This tree structure allows the model to traverse a specific path from the root to a leaf node, effectively reducing the number of probabilities that need to be computed. Each internal node in the tree represents a binary decision, determining whether to move to the left or right child node based on the probability distribution. The probabilities are computed incrementally along the chosen path, significantly reducing the computational burden.

계층 소프트맥스의 기본 아이디어는 어휘를 이진 트리 또는 계층 구조로 표현하는 것으로, 각 단어가 잎 노드에 해당합니다. 이 트리 구조를 통해 모델은 루트에서 잎 노드로 특정 경로를 따라 이동할 수 있으며, 계산해야 할 확률의 수를 효과적으로 줄일 수 있습니다. 트리 내의 각 내부 노드는 이진 결정을 나타내며, 선택된 경로를 따라 왼쪽 또는 오른쪽 자식 노드로 이동할지 여부를 확률 분포를 기반으로 결정합니다. 확률은 선택한 경로를 따라 점진적으로 계산되어 계산 부담을 크게 줄입니다.

By using hierarchical softmax, the computational complexity of calculating the probabilities is reduced from O(V) in the case of traditional softmax (where V is the size of the vocabulary) to O(log V), making it more feasible to handle large vocabularies. This technique is especially useful when training models on massive datasets with extensive vocabularies.

계층 소프트맥스를 사용하면 확률을 계산하는 계산 복잡성이 전통적인 소프트맥스의 경우(O(V), 여기서 V는 어휘 크기)와 비교하여 O(log V)로 줄어듭니다. 따라서 대규모 어휘를 처리하는 것이 더 가능해집니다. 이러한 기술은 대량의 데이터셋과 광범위한 어휘를 처리하려는 모델에서 특히 유용합니다.

Hierarchical softmax is a trade-off between computational efficiency and model accuracy. While it sacrifices some accuracy compared to the full softmax, it allows training larger models on larger datasets more efficiently. It is commonly used in language models like Word2Vec and FastText, which aim to learn high-quality word representations while efficiently handling large vocabularies.

계층 소프트맥스는 계산 효율성과 모델 정확성 사이의 절충안입니다. 전체 소프트맥스와 비교하여 일부 정확성을 희생하면서 더 큰 모델을 더 효율적으로 대규모 데이터셋에 훈련시키는 것을 가능하게 합니다. Word2Vec 및 FastText와 같은 언어 모델에서 널리 사용되며, 고품질 단어 표현을 학습하면서 큰 어휘를 효율적으로 처리하는 데 활용됩니다.

http://uponthesky.tistory.com/15

계층적 소프트맥스(Hierarchical Softmax, HS) in word2vec

계층적 소프트맥스(Hierarchical Softmax, HS)란? 기존 softmax의 계산량을 현격히 줄인, softmax에 근사시키는 방법론이다. Word2Vec에서 skip-gram방법으로 모델을 훈련시킬 때 네거티브 샘플링(negative sampling)

uponthesky.tistory.com

15.2.3. Summary

Negative sampling constructs the loss function by considering mutually independent events that involve both positive and negative examples. The computational cost for training is linearly dependent on the number of noise words at each step.
네거티브 샘플링은 긍정적인 사례와 부정적인 사례를 모두 포함하는 상호 독립적인 이벤트를 고려하여 손실 함수를 구성합니다. 훈련을 위한 계산 비용은 각 단계의 노이즈 단어 수에 선형적으로 의존합니다.
Hierarchical softmax constructs the loss function using the path from the root node to the leaf node in the binary tree. The computational cost for training is dependent on the logarithm of the dictionary size at each step.

계층적 소프트맥스는 이진 트리의 루트 노드에서 리프 노드까지의 경로를 사용하여 손실 함수를 구성합니다. 훈련을 위한 계산 비용은 각 단계에서 사전 크기의 로그에 따라 달라집니다.

15.2.4. Exercises

How can we sample noise words in negative sampling?
Verify that (15.2.9) holds.
How to train the continuous bag of words model using negative sampling and hierarchical softmax, respectively?

'Dive into Deep Learning > D2L Natural language Processing' 카테고리의 다른 글

D2L - 15.10. Pretraining BERT (0)	2023.08.30
D2L - 15.9. The Dataset for Pretraining BERT (0)	2023.08.30
D2L - 15.8. Bidirectional Encoder Representations from Transformers (BERT) (0)	2023.08.30
D2L - 15.7. Word Similarity and Analogy (0)	2023.08.30
D2L - 15.6. Subword Embedding (0)	2023.08.30
D2L - 15.5. Word Embedding with Global Vectors (GloVe) (0)	2023.08.29
D2L - 15.4. Pretraining word2vec (0)	2023.08.29
D2L - 15.3. The Dataset for Pretraining Word Embeddings (0)	2023.08.29
D2L- 15.1. Word Embedding (word2vec) (0)	2023.08.25
D2L - 15. Natural Language Processing: Pretraining (0)	2023.08.24

IT 기술 따라잡기

공지사항

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

카테고리