Dive into Deep Learning/D2L Natural language Processing

D2L - 15.7. Word Similarity and Analogy

2023. 8. 30. 02:01 | Posted by 솔웅

15.7. Word Similarity and Analogy — Dive into Deep Learning 1.0.3 documentation

d2l.ai

15.7. Word Similarity and Analogy

In Section 15.4, we trained a word2vec model on a small dataset, and applied it to find semantically similar words for an input word. In practice, word vectors that are pretrained on large corpora can be applied to downstream natural language processing tasks, which will be covered later in Section 16. To demonstrate semantics of pretrained word vectors from large corpora in a straightforward way, let’s apply them in the word similarity and analogy tasks.

섹션 15.4에서는 작은 데이터 세트에 대해 word2vec 모델을 훈련하고 이를 적용하여 입력 단어에 대해 의미상 유사한 단어를 찾았습니다. 실제로 큰 말뭉치에서 사전 훈련된 단어 벡터는 다운스트림 자연어 처리 작업에 적용될 수 있으며 이에 대해서는 섹션 16에서 나중에 다룰 것입니다. 큰 말뭉치에서 사전 훈련된 단어 벡터의 의미를 간단한 방법으로 보여주기 위해 이것들을 유사성 및 비유 작업이라는 단어에 적용해 보겠습니다.

import os
import torch
from torch import nn
from d2l import torch as d2l

15.7.1. Loading Pretrained Word Vectors

Below lists pretrained GloVe embeddings of dimension 50, 100, and 300, which can be downloaded from the GloVe website. The pretrained fastText embeddings are available in multiple languages. Here we consider one English version (300-dimensional “wiki.en”) that can be downloaded from the fastText website.

아래에는 GloVe 웹사이트에서 다운로드할 수 있는 차원 50, 100, 300의 사전 훈련된 GloVe 임베딩이 나열되어 있습니다. 사전 훈련된 fastText 임베딩은 여러 언어로 제공됩니다. 여기에서는 fastText 웹사이트에서 다운로드할 수 있는 하나의 영어 버전(300차원 “wiki.en”)을 고려합니다.

#@save
d2l.DATA_HUB['glove.6b.50d'] = (d2l.DATA_URL + 'glove.6B.50d.zip',
                                '0b8703943ccdb6eb788e6f091b8946e82231bc4d')

#@save
d2l.DATA_HUB['glove.6b.100d'] = (d2l.DATA_URL + 'glove.6B.100d.zip',
                                 'cd43bfb07e44e6f27cbcc7bc9ae3d80284fdaf5a')

#@save
d2l.DATA_HUB['glove.42b.300d'] = (d2l.DATA_URL + 'glove.42B.300d.zip',
                                  'b5116e234e9eb9076672cfeabf5469f3eec904fa')

#@save
d2l.DATA_HUB['wiki.en'] = (d2l.DATA_URL + 'wiki.en.zip',
                           'c1816da3821ae9f43899be655002f6c723e91b88')

이 코드는 다양한 단어 임베딩 데이터셋에 대한 정보를 d2l.DATA_HUB 딕셔너리에 저장하는 역할을 합니다. 이 데이터셋들은 GloVe(글로브)와 위키백과 데이터의 영어 언어 버전입니다.

d2l.DATA_HUB['glove.6b.50d']:
- 'glove.6b.50d' 데이터셋의 정보를 저장합니다.
- 데이터의 다운로드 경로와 해시값이 튜플로 저장됩니다.
d2l.DATA_HUB['glove.6b.100d']:
- 'glove.6b.100d' 데이터셋의 정보를 저장합니다.
d2l.DATA_HUB['glove.42b.300d']:
- 'glove.42b.300d' 데이터셋의 정보를 저장합니다.
d2l.DATA_HUB['wiki.en']:
- 'wiki.en' 데이터셋의 정보를 저장합니다.

이 코드는 데이터셋의 이름과 다운로드 경로, 해시값을 d2l.DATA_HUB 딕셔너리에 저장하여 데이터를 관리하는 데 사용됩니다.

To load these pretrained GloVe and fastText embeddings, we define the following TokenEmbedding class.

사전 학습된 GloVe 및 fastText 임베딩을 로드하기 위해 다음 TokenEmbedding 클래스를 정의합니다.

#@save
class TokenEmbedding:
    """Token Embedding."""
    def __init__(self, embedding_name):
        self.idx_to_token, self.idx_to_vec = self._load_embedding(
            embedding_name)
        self.unknown_idx = 0
        self.token_to_idx = {token: idx for idx, token in
                             enumerate(self.idx_to_token)}

    def _load_embedding(self, embedding_name):
        idx_to_token, idx_to_vec = ['<unk>'], []
        data_dir = d2l.download_extract(embedding_name)
        # GloVe website: https://nlp.stanford.edu/projects/glove/
        # fastText website: https://fasttext.cc/
        with open(os.path.join(data_dir, 'vec.txt'), 'r') as f:
            for line in f:
                elems = line.rstrip().split(' ')
                token, elems = elems[0], [float(elem) for elem in elems[1:]]
                # Skip header information, such as the top row in fastText
                if len(elems) > 1:
                    idx_to_token.append(token)
                    idx_to_vec.append(elems)
        idx_to_vec = [[0] * len(idx_to_vec[0])] + idx_to_vec
        return idx_to_token, torch.tensor(idx_to_vec)

    def __getitem__(self, tokens):
        indices = [self.token_to_idx.get(token, self.unknown_idx)
                   for token in tokens]
        vecs = self.idx_to_vec[torch.tensor(indices)]
        return vecs

    def __len__(self):
        return len(self.idx_to_token)

이 코드는 토큰 임베딩(Token Embedding)을 관리하는 TokenEmbedding 클래스를 정의합니다.

class TokenEmbedding::
- TokenEmbedding 클래스를 정의합니다.
def __init__(self, embedding_name)::
- 클래스의 초기화 함수입니다. embedding_name을 받아 해당 이름의 임베딩 데이터를 로드하고 초기화합니다.
- idx_to_token, idx_to_vec, unknown_idx, token_to_idx 등의 속성을 초기화합니다.
def _load_embedding(self, embedding_name)::
- 임베딩 데이터를 로드하는 내부 함수입니다.
- 주어진 embedding_name을 이용하여 해당 임베딩 데이터를 다운로드하고 읽어들입니다.
- 임베딩 데이터의 내용을 idx_to_token, idx_to_vec 형태로 저장합니다.
def __getitem__(self, tokens)::
- 특정 토큰들에 대한 임베딩 벡터를 반환하는 함수입니다.
- 토큰들을 인덱스로 변환한 후, 해당 인덱스에 해당하는 임베딩 벡터를 반환합니다.
def __len__(self)::
- 토큰의 개수를 반환하는 함수입니다.

이 클래스는 주어진 임베딩 데이터 이름에 따라 토큰 임베딩을 생성하고 관리하는 역할을 합니다.

Below we load the 50-dimensional GloVe embeddings (pretrained on a Wikipedia subset). When creating the TokenEmbedding instance, the specified embedding file has to be downloaded if it was not yet.

아래에서는 50차원 GloVe 임베딩(Wikipedia 하위 집합에서 사전 훈련됨)을 로드합니다. TokenEmbedding 인스턴스를 생성할 때 지정된 임베딩 파일이 아직 다운로드되지 않은 경우 다운로드해야 합니다.

glove_6b50d = TokenEmbedding('glove.6b.50d')

이 코드는 TokenEmbedding 클래스를 사용하여 'glove.6b.50d' 임베딩 데이터를 로드하고, glove_6b50d 객체를 생성하는 과정을 나타내고 있습니다.

glove_6b50d = TokenEmbedding('glove.6b.50d'):
- 'glove.6b.50d'라는 임베딩 데이터셋 이름을 가지고 TokenEmbedding 클래스의 객체 glove_6b50d를 생성합니다.
- 이 객체는 'glove.6b.50d' 데이터셋의 토큰 임베딩을 관리하는 역할을 합니다.

이 코드는 'glove.6b.50d' 데이터셋의 토큰 임베딩을 TokenEmbedding 클래스를 사용하여 로드하고, 해당 임베딩을 관리하기 위한 객체를 생성하는 과정을 나타냅니다

Downloading ../data/glove.6B.50d.zip from http://d2l-data.s3-accelerate.amazonaws.com/glove.6B.50d.zip...

Output the vocabulary size. The vocabulary contains 400000 words (tokens) and a special unknown token.

어휘 크기를 출력합니다. 어휘에는 400,000개의 단어(토큰)와 특별한 알려지지 않은 토큰이 포함되어 있습니다.

len(glove_6b50d)

We can get the index of a word in the vocabulary, and vice versa.

어휘에서 단어의 색인을 얻을 수 있고 그 반대의 경우도 마찬가지입니다.

glove_6b50d.token_to_idx['beautiful'], glove_6b50d.idx_to_token[3367]

이 코드는 'glove.6b.50d' 임베딩 데이터셋에서 'beautiful' 토큰의 인덱스와 인덱스 3367에 해당하는 토큰을 조회하는 과정을 나타내고 있습니다.

glove_6b50d.token_to_idx['beautiful']:
- 'beautiful' 토큰의 인덱스를 glove_6b50d 객체의 token_to_idx 속성을 사용하여 조회합니다.
- 이 결과는 해당 토큰의 임베딩 벡터를 찾을 때 사용될 인덱스입니다.
glove_6b50d.idx_to_token[3367]:
- 인덱스 3367에 해당하는 토큰을 glove_6b50d 객체의 idx_to_token 속성을 사용하여 조회합니다.
- 이 결과는 해당 인덱스가 어떤 토큰을 나타내는지를 알려줍니다.

이 코드는 'glove.6b.50d' 임베딩 데이터셋에서 'beautiful' 토큰의 인덱스를 조회하고, 인덱스 3367에 해당하는 토큰을 찾는 과정을 나타냅니다.

(3367, 'beautiful')

15.7.2. Applying Pretrained Word Vectors

Using the loaded GloVe vectors, we will demonstrate their semantics by applying them in the following word similarity and analogy tasks.

로드된 GloVe 벡터를 사용하여 다음 단어 유사성 및 유추 작업에 적용하여 의미를 보여줍니다.

15.7.2.1. Word Similarity

Similar to Section 15.4.3, in order to find semantically similar words for an input word based on cosine similarities between word vectors, we implement the following knn (k-nearest neighbors) function.

15.4.3절과 마찬가지로 단어 벡터 간의 코사인 유사성을 기반으로 입력 단어에 대해 의미상 유사한 단어를 찾기 위해 다음 knn(k-최근접 이웃) 함수를 구현합니다.

def knn(W, x, k):
    # Add 1e-9 for numerical stability
    cos = torch.mv(W, x.reshape(-1,)) / (
        torch.sqrt(torch.sum(W * W, axis=1) + 1e-9) *
        torch.sqrt((x * x).sum()))
    _, topk = torch.topk(cos, k=k)
    return topk, [cos[int(i)] for i in topk]

이 코드는 K 최근접 이웃 (K Nearest Neighbors, KNN) 알고리즘을 구현한 함수를 정의합니다.

def knn(W, x, k)::
- knn 함수를 정의합니다. 함수는 임베딩 벡터들의 유사도를 계산하고, 가장 유사한 상위 k개의 인덱스와 유사도를 반환합니다.
- 함수는 세 개의 인자를 받습니다:
  - W: 임베딩 벡터들을 포함한 행렬입니다.
  - x: 주어진 임베딩 벡터입니다.
  - k: 찾고자 하는 최근접 이웃의 개수입니다.
cos = torch.mv(W, x.reshape(-1,)) / (...):
- 주어진 임베딩 벡터 x와 모든 임베딩 벡터들을 행렬 W와 내적하여 코사인 유사도를 계산합니다.
- 수치 안정성을 위해 분모에 작은 값 1e-9를 더해줍니다.
_, topk = torch.topk(cos, k=k):
- 계산한 코사인 유사도 cos에서 가장 큰 k개의 값과 해당 값의 인덱스를 반환합니다.
return topk, [cos[int(i)] for i in topk]:
- 상위 k개의 인덱스와 해당 인덱스에 대한 코사인 유사도 값을 반환합니다.

이 함수는 주어진 임베딩 벡터와 행렬에서 K 최근접 이웃 알고리즘을 사용하여 가장 유사한 상위 k개의 임베딩 벡터 인덱스와 그에 해당하는 코사인 유사도 값을 찾아 반환합니다.

Then, we search for similar words using the pretrained word vectors from the TokenEmbedding instance embed.

그런 다음 TokenEmbedding 인스턴스 삽입에서 사전 학습된 단어 벡터를 사용하여 유사한 단어를 검색합니다.

def get_similar_tokens(query_token, k, embed):
    topk, cos = knn(embed.idx_to_vec, embed[[query_token]], k + 1)
    for i, c in zip(topk[1:], cos[1:]):  # Exclude the input word
        print(f'cosine sim={float(c):.3f}: {embed.idx_to_token[int(i)]}')

이 코드는 주어진 토큰에 대해 유사한 상위 k개의 토큰을 검색하여 출력하는 함수를 정의합니다.

def get_similar_tokens(query_token, k, embed)::
- get_similar_tokens 함수를 정의합니다. 함수는 주어진 토큰에 대해 유사한 상위 k개의 토큰을 검색하여 출력합니다.
- 함수는 세 개의 인자를 받습니다:
  - query_token: 유사한 토큰을 검색하고자 하는 입력 토큰입니다.
  - k: 반환할 유사한 토큰의 개수입니다.
  - embed: 임베딩을 관리하는 TokenEmbedding 객체입니다.
topk, cos = knn(embed.idx_to_vec, embed[[query_token]], k + 1):
- 주어진 입력 토큰에 대한 임베딩 벡터를 embed에서 조회하고, K 최근접 이웃 알고리즘인 knn 함수를 사용하여 유사한 상위 k개의 인덱스와 코사인 유사도 값을 가져옵니다.
- k + 1을 사용하여 입력 토큰 자체를 제외하고 유사한 토큰을 가져옵니다.
for i, c in zip(topk[1:], cos[1:])::
- 유사한 토큰의 인덱스와 해당 토큰에 대한 코사인 유사도 값을 반복적으로 순회합니다.
- topk[1:]와 cos[1:]를 사용하여 입력 토큰을 제외하고 순회합니다.
print(f'cosine sim={float(c):.3f}: {embed.idx_to_token[int(i)]}'):
- 각 유사한 토큰에 대해 코사인 유사도 값을 출력합니다.
- float(c)를 사용하여 코사인 유사도 값을 소수점 세 자리로 출력하고, 해당 인덱스에 대응하는 토큰을 embed.idx_to_token을 사용하여 출력합니다.

이 함수는 주어진 입력 토큰에 대해 유사한 상위 k개의 토큰과 해당 토큰에 대한 코사인 유사도 값을 출력합니다.

The vocabulary of the pretrained word vectors in glove_6b50d contains 400000 words and a special unknown token. Excluding the input word and unknown token, among this vocabulary let’s find three most semantically similar words to word “chip”.

Glove_6b50d의 사전 훈련된 단어 벡터의 어휘에는 400,000개의 단어와 특별한 알려지지 않은 토큰이 포함되어 있습니다. 입력 단어와 알 수 없는 토큰을 제외하고 이 어휘 중에서 "chip"이라는 단어와 의미상 가장 유사한 단어 3개를 찾아보겠습니다.

get_similar_tokens('chip', 3, glove_6b50d)

이 코드는 'glove.6b.50d' 임베딩 데이터셋을 사용하여 'chip' 토큰과 유사한 상위 3개의 토큰을 검색하여 출력하는 과정을 나타내고 있습니다.

get_similar_tokens('chip', 3, glove_6b50d):
- get_similar_tokens 함수를 호출하여 'chip' 토큰과 유사한 상위 3개의 토큰을 검색하고 출력합니다.
- 'chip'은 검색하고자 하는 입력 토큰이며, 3은 반환할 유사한 토큰의 개수입니다.
- glove_6b50d는 'glove.6b.50d' 임베딩 데이터셋을 나타내는 TokenEmbedding 객체입니다.

함수는 'chip' 토큰과 유사한 상위 3개의 토큰을 검색하고, 각 토큰에 대한 코사인 유사도 값을 출력합니다

cosine sim=0.856: chips
cosine sim=0.749: intel
cosine sim=0.749: electronics

Below outputs similar words to “baby” and “beautiful”.

아래에서는 "baby" 및 "beautiful"과 유사한 단어가 출력됩니다.

get_similar_tokens('baby', 3, glove_6b50d)

cosine sim=0.839: babies
cosine sim=0.800: boy
cosine sim=0.792: girl

get_similar_tokens('beautiful', 3, glove_6b50d)

cosine sim=0.921: lovely
cosine sim=0.893: gorgeous
cosine sim=0.830: wonderful

15.7.2.2. Word Analogy

Besides finding similar words, we can also apply word vectors to word analogy tasks. For example, “man”:“woman”::“son”:“daughter” is the form of a word analogy: “man” is to “woman” as “son” is to “daughter”. Specifically, the word analogy completion task can be defined as: for a word analogy a:b::c:d, given the first three words a, b and c, find d. Denote the vector of word w by vec(w). To complete the analogy, we will find the word whose vector is most similar to the result of vec(c)+vec(b)−vec(a).

유사한 단어를 찾는 것 외에도 단어 유추 작업에 단어 벡터를 적용할 수도 있습니다. 예를 들어, “man”:“woman”::“son”:“daughter”는 단어 유추의 형태입니다. “man”은 “woman”을 의미하고 “son”은 “daughter”를 의미합니다. 구체적으로 단어 유추 완료 작업은 다음과 같이 정의할 수 있습니다. 단어 유추 a:b::c:d에 대해 처음 세 단어 a, b 및 c가 주어지면 d를 찾습니다. 단어 w의 벡터를 vec(w)로 나타냅니다. 유추를 완성하기 위해 벡터가 vec(c)+vec(b)−vec(a)의 결과와 가장 유사한 단어를 찾습니다.

def get_analogy(token_a, token_b, token_c, embed):
    vecs = embed[[token_a, token_b, token_c]]
    x = vecs[1] - vecs[0] + vecs[2]
    topk, cos = knn(embed.idx_to_vec, x, 1)
    return embed.idx_to_token[int(topk[0])]  # Remove unknown words

이 코드는 임베딩을 사용하여 단어 간의 유추(Analogy)를 수행하는 함수를 정의하고 있습니다.

def get_analogy(token_a, token_b, token_c, embed)::
- get_analogy 함수를 정의합니다. 함수는 주어진 세 개의 토큰 token_a, token_b, token_c를 이용하여 유추 결과를 반환합니다.
- 함수는 네 개의 인자를 받습니다:
  - token_a, token_b, token_c: 유추를 위한 입력 토큰들입니다.
  - embed: 임베딩을 관리하는 TokenEmbedding 객체입니다.
vecs = embed[[token_a, token_b, token_c]]:
- 주어진 입력 토큰들의 임베딩 벡터들을 embed에서 조회합니다.
- token_a, token_b, token_c에 대한 임베딩 벡터들이 vecs에 저장됩니다.
x = vecs[1] - vecs[0] + vecs[2]:
- 세 개의 임베딩 벡터를 이용하여 유추 벡터 x를 계산합니다.
- vecs[1] - vecs[0] + vecs[2]를 통해 유추 벡터를 구합니다.
topk, cos = knn(embed.idx_to_vec, x, 1):
- 주어진 유추 벡터 x와 모든 임베딩 벡터 사이의 코사인 유사도를 계산하여 가장 유사한 토큰의 인덱스를 찾습니다.
- knn 함수를 사용하여 가장 유사한 토큰의 인덱스와 해당 토큰과의 코사인 유사도 값을 얻습니다.
return embed.idx_to_token[int(topk[0])] # Remove unknown words:
- 유사한 토큰의 인덱스를 이용하여 해당 토큰을 찾아 반환합니다.
- embed.idx_to_token을 사용하여 인덱스에 대응하는 토큰을 찾아 반환하고, [int(topk[0])]을 사용하여 가장 유사한 토큰의 인덱스를 가져옵니다.
- 반환 시 '[UNK]'와 같은 미지 토큰은 제외합니다.

이 함수는 입력 토큰 token_a, token_b, token_c를 사용하여 유추한 결과를 반환합니다

Let’s verify the “male-female” analogy using the loaded word vectors.

로드된 단어 벡터를 사용하여 "남성-여성" 비유를 검증해 보겠습니다.

get_analogy('man', 'woman', 'son', glove_6b50d)

이 코드는 'glove.6b.50d' 임베딩 데이터셋을 사용하여 'man'과 'woman' 간의 관계를 이용하여 'son'과 유추한 결과를 출력하는 과정을 나타내고 있습니다.

get_analogy('man', 'woman', 'son', glove_6b50d):
- get_analogy 함수를 호출하여 'man'과 'woman' 간의 관계를 이용하여 'son'과 유추한 결과를 출력합니다.
- 'man'은 첫 번째 입력 토큰, 'woman'은 두 번째 입력 토큰, 'son'은 세 번째 입력 토큰입니다.
- glove_6b50d는 'glove.6b.50d' 임베딩 데이터셋을 나타내는 TokenEmbedding 객체입니다.

함수는 'man'과 'woman' 간의 관계를 이용하여 'son'과 유추한 결과를 출력합니다. 결과는 'man'에서 'woman'을 빼고 'son'을 더한 결과에 가장 유사한 단어를 찾아 반환합니다

'daughter'

Below completes a “capital-country” analogy: “beijing”:“china”::“tokyo”:“japan”. This demonstrates semantics in the pretrained word vectors.

아래는 "수도-국가" 비유를 완성합니다: "베이징":"중국"::"도쿄":"일본". 이는 사전 학습된 단어 벡터의 의미를 보여줍니다.

get_analogy('beijing', 'china', 'tokyo', glove_6b50d)

'japan'

For the “adjective-superlative adjective” analogy such as “bad”:“worst”::“big”:“biggest”, we can see that the pretrained word vectors may capture the syntactic information.

"bad":"worst"::"big":"biggest"와 같은 "형용사-최상급" 비유의 경우 사전 훈련된 단어 벡터가 구문 정보를 캡처할 수 있음을 알 수 있습니다.

get_analogy('bad', 'worst', 'big', glove_6b50d)

'biggest'

To show the captured notion of past tense in the pretrained word vectors, we can test the syntax using the “present tense-past tense” analogy: “do”:“did”::“go”:“went”.

미리 훈련된 단어 벡터에서 캡처된 과거 시제 개념을 표시하기 위해 "현재 시제-과거 시제" 비유("do":"did"::"go":"went")를 사용하여 구문을 테스트할 수 있습니다.

get_analogy('do', 'did', 'go', glove_6b50d)

'went'

15.7.3. Summary

In practice, word vectors that are pretrained on large corpora can be applied to downstream natural language processing tasks.

실제로 대규모 말뭉치에 대해 사전 훈련된 단어 벡터는 다운스트림 자연어 처리 작업에 적용될 수 있습니다.
Pretrained word vectors can be applied to the word similarity and analogy tasks.

사전 훈련된 단어 벡터는 단어 유사성 및 유추 작업에 적용될 수 있습니다.

15.7.4. Exercises

Test the fastText results using TokenEmbedding('wiki.en').
When the vocabulary is extremely large, how can we find similar words or complete a word analogy faster?

'Dive into Deep Learning > D2L Natural language Processing' 카테고리의 다른 글

D2L - 16.1. Sentiment Analysis and the Dataset (0)	2023.09.01
D2L - 16. Natural Language Processing: Applications (0)	2023.09.01
D2L - 15.10. Pretraining BERT (0)	2023.08.30
D2L - 15.9. The Dataset for Pretraining BERT (0)	2023.08.30
D2L - 15.8. Bidirectional Encoder Representations from Transformers (BERT) (0)	2023.08.30
D2L - 15.6. Subword Embedding (0)	2023.08.30
D2L - 15.5. Word Embedding with Global Vectors (GloVe) (0)	2023.08.29
D2L - 15.4. Pretraining word2vec (0)	2023.08.29
D2L - 15.3. The Dataset for Pretraining Word Embeddings (0)	2023.08.29
D2L- 15.2. Approximate Training (1)	2023.08.28

IT 기술 따라잡기

공지사항

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

카테고리