HF-NLP-Transformer models : How do Transformers work?

2023. 12. 24. 05:18 | Posted by 솔웅

https://huggingface.co/learn/nlp-course/chapter1/4?fw=pt

How do Transformers work? - Hugging Face NLP Course

2. Using 🤗 Transformers 3. Fine-tuning a pretrained model 4. Sharing models and tokenizers 5. The 🤗 Datasets library 6. The 🤗 Tokenizers library 9. Building and sharing demos new

huggingface.co

How do Transformers work?

In this section, we will take a high-level look at the architecture of Transformer models.

이 섹션에서는 Transformer 모델의 아키텍처를 high-level look 해 보겠습니다.

A bit of Transformer history

Here are some reference points in the (short) history of Transformer models:

다음은 Transformer 모델의 (짧은) 역사에 대한 몇 가지 참고 사항입니다.

The Transformer architecture was introduced in June 2017. The focus of the original research was on translation tasks. This was followed by the introduction of several influential models, including:

Transformer 아키텍처는 2017년 6월에 도입되었습니다. 원래 연구의 초점은 번역 작업이었습니다. 그 후 다음을 포함한 여러 영향력 있는 모델이 도입되었습니다.

June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results
2018년 6월: 최초의 사전 훈련된 Transformer 모델인 GPT는 다양한 NLP 작업의 미세 조정에 사용되어 최첨단 결과를 얻었습니다.
October 2018: BERT, another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!)
2018년 10월: 또 다른 대형 사전 학습 모델인 BERT는 더 나은 문장 요약을 생성하도록 설계되었습니다(자세한 내용은 다음 장에서!).
February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns
2019년 2월: 윤리적 문제로 인해 즉시 공개되지 않은 개선된(및 더 큰) 버전의 GPT인 GPT-2
October 2019: DistilBERT, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance
2019년 10월: 60% 더 빠르고 메모리는 40% 가벼우면서도 여전히 BERT 성능의 97%를 유지하는 BERT의 증류 버전인 DitilBERT
October 2019: BART and T5, two large pretrained models using the same architecture as the original Transformer model (the first to do so)
2019년 10월: BART 및 T5, 원래 Transformer 모델과 동일한 아키텍처를 사용하는 두 개의 사전 훈련된 모델(최초)
May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning)
2020년 5월, 미세 조정 없이 다양한 작업을 잘 수행할 수 있는 GPT-2의 더 큰 버전인 GPT-3(제로샷 학습이라고 함)

This list is far from comprehensive, and is just meant to highlight a few of the different kinds of Transformer models. Broadly, they can be grouped into three categories:

이 목록은 포괄적이지 않으며 단지 다양한 종류의 Transformer 모델 중 몇 가지를 강조하기 위한 것입니다. 크게는 세 가지 범주로 분류할 수 있습니다.

GPT-like (also called auto-regressive Transformer models)
BERT-like (also called auto-encoding Transformer models)
BART/T5-like (also called sequence-to-sequence Transformer models)

We will dive into these families in more depth later on.

나중에 이러한 계열에 대해 더 자세히 살펴보겠습니다.

Transformers are language models

All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models. This means they have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!

위에서 언급한 모든 Transformer 모델(GPT, BERT, BART, T5 등)은 언어 모델로 학습되었습니다. 이는 그들이 자기 감독 방식으로 대량의 원시 텍스트에 대해 훈련을 받았다는 것을 의미합니다. 자기 지도 학습은 모델의 입력으로부터 목표가 자동으로 계산되는 훈련 유형입니다. 이는 데이터에 라벨을 붙이는 데 사람이 필요하지 않다는 것을 의미합니다!

This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.

이러한 유형의 모델은 훈련된 언어에 대한 통계적 이해를 발전시키지만 특정 실제 작업에는 그다지 유용하지 않습니다. 이로 인해 일반 사전 학습 모델은 전이 학습이라는 과정을 거칩니다. 이 프로세스 동안 모델은 주어진 작업에 대해 지도 방식, 즉 사람이 주석을 추가한 레이블을 사용하여 미세 조정됩니다.

An example of a task is predicting the next word in a sentence having read the n previous words. This is called causal language modeling because the output depends on the past and present inputs, but not the future ones.

작업의 예는 n개의 이전 단어를 읽고 문장의 다음 단어를 예측하는 것입니다. 출력이 과거 및 현재 입력에 따라 달라지지만 미래 입력에는 영향을 받지 않기 때문에 이를 인과 언어 모델링이라고 합니다.

Another example is masked language modeling, in which the model predicts a masked word in the sentence.

또 다른 예는 모델이 문장에서 마스크된 단어를 예측하는 마스크된 언어 모델링입니다.

Transformers are big models

Apart from a few outliers (like DistilBERT), the general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pretrained on.

DistilBERT와 같은 몇 가지 이상값을 제외하고 더 나은 성능을 달성하기 위한 일반적인 전략은 모델의 크기와 사전 학습된 데이터의 양을 늘리는 것입니다.

Unfortunately, training a model, especially a large one, requires a large amount of data. This becomes very costly in terms of time and compute resources. It even translates to environmental impact, as can be seen in the following graph.

불행하게도 모델, 특히 대규모 모델을 훈련하려면 많은 양의 데이터가 필요합니다. 이는 시간과 컴퓨팅 리소스 측면에서 매우 비용이 많이 듭니다. 다음 그래프에서 볼 수 있듯이 이는 환경에 미치는 영향까지 해석됩니다.

https://youtu.be/ftWlj4FBHTg?si=WsaLjWZkkn_UNqcO

And this is showing a project for a (very big) model led by a team consciously trying to reduce the environmental impact of pretraining. The footprint of running lots of trials to get the best hyperparameters would be even higher.

그리고 이것은 사전 훈련이 환경에 미치는 영향을 의식적으로 줄이기 위해 노력하는 팀이 이끄는 (매우 큰) 모델에 대한 프로젝트를 보여줍니다. 최상의 초매개변수를 얻기 위해 수많은 시도를 실행하는 데 드는 공간은 훨씬 더 커질 것입니다.

Imagine if each time a research team, a student organization, or a company wanted to train a model, it did so from scratch. This would lead to huge, unnecessary global costs!

연구팀, 학생 단체 또는 회사가 모델을 훈련하려고 할 때마다 처음부터 그렇게 했다고 상상해 보십시오. 이로 인해 막대하고 불필요한 글로벌 비용이 발생하게 됩니다!

This is why sharing language models is paramount: sharing the trained weights and building on top of already trained weights reduces the overall compute cost and carbon footprint of the community.

이것이 바로 언어 모델 공유가 중요한 이유입니다. 훈련된 가중치를 공유하고 이미 훈련된 가중치 위에 구축하면 커뮤니티의 전체 컴퓨팅 비용과 탄소 배출량이 줄어듭니다.

By the way, you can evaluate the carbon footprint of your models’ training through several tools. For example ML CO2 Impact or Code Carbon which is integrated in 🤗 Transformers. To learn more about this, you can read this blog post which will show you how to generate an emissions.csv file with an estimate of the footprint of your training, as well as the documentation of 🤗 Transformers addressing this topic.

그런데 여러 도구를 통해 모델 학습의 탄소 배출량을 평가할 수 있습니다. 예를 들어 🤗 Transformers에 통합된 ML CO2 Impact 또는 Code Carbon이 있습니다. 이에 대해 자세히 알아보려면 이 주제를 다루는 🤗 Transformers 문서뿐만 아니라 훈련 공간의 추정치를 포함하여 Emission.csv 파일을 생성하는 방법을 보여주는 이 블로그 게시물을 읽어보세요.

Transfer Learning

https://youtu.be/BqqfQnyjmgg?si=0PU8ouwdHZNxvzh0

retraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.

사전 훈련은 모델을 처음부터 훈련하는 행위입니다. 가중치는 무작위로 초기화되고 사전 지식 없이 훈련이 시작됩니다.

This pretraining is usually done on very large amounts of data. Therefore, it requires a very large corpus of data, and training can take up to several weeks.

이 사전 훈련은 일반적으로 매우 많은 양의 데이터에 대해 수행됩니다. 따라서 매우 많은 양의 데이터가 필요하며 훈련에는 최대 몇 주가 걸릴 수 있습니다.

Fine-tuning, on the other hand, is the training done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task. Wait — why not simply train directly for the final task? There are a couple of reasons:

반면, 미세 조정은 모델이 사전 훈련된 후에 수행되는 훈련입니다. 미세 조정을 수행하려면 먼저 사전 훈련된 언어 모델을 획득한 다음 작업과 관련된 데이터 세트를 사용하여 추가 훈련을 수행합니다. 잠깐만요. 최종 작업을 위해 직접 훈련하면 어떨까요? 몇 가지 이유가 있습니다:

The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take advantage of knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task).
사전 훈련된 모델은 미세 조정 데이터 세트와 일부 유사한 데이터 세트에 대해 이미 훈련되었습니다. 따라서 미세 조정 프로세스는 사전 훈련 중에 초기 모델에서 얻은 지식을 활용할 수 있습니다(예를 들어 NLP 문제의 경우 사전 훈련된 모델은 작업에 사용하는 언어에 대해 일종의 통계적 이해를 갖습니다).
Since the pretrained model was already trained on lots of data, the fine-tuning requires way less data to get decent results.
사전 훈련된 모델은 이미 많은 데이터에 대해 훈련되었으므로 미세 조정에 적절한 결과를 얻으려면 훨씬 적은 양의 데이터가 필요합니다.
For the same reason, the amount of time and resources needed to get good results are much lower.
같은 이유로 좋은 결과를 얻는 데 필요한 시간과 자원의 양은 훨씬 적습니다.

For example, one could leverage a pretrained model trained on the English language and then fine-tune it on an arXiv corpus, resulting in a science/research-based model. The fine-tuning will only require a limited amount of data: the knowledge the pretrained model has acquired is “transferred,” hence the term transfer learning.

예를 들어, 영어로 훈련된 사전 훈련된 모델을 활용한 다음 arXiv 코퍼스에서 이를 미세 조정하여 과학/연구 기반 모델을 만들 수 있습니다. 미세 조정에는 제한된 양의 데이터만 필요합니다. 사전 훈련된 모델이 획득한 지식은 "전송"되므로 전이 학습이라는 용어가 사용됩니다.

Fine-tuning a model therefore has lower time, data, financial, and environmental costs. It is also quicker and easier to iterate over different fine-tuning schemes, as the training is less constraining than a full pretraining.

따라서 모델을 미세 조정하면 시간, 데이터, 재정, 환경 비용이 절감됩니다. 또한 훈련이 전체 사전 훈련보다 덜 제한적이므로 다양한 미세 조정 방식을 반복하는 것이 더 빠르고 쉽습니다.

This process will also achieve better results than training from scratch (unless you have lots of data), which is why you should always try to leverage a pretrained model — one as close as possible to the task you have at hand — and fine-tune it.

또한 이 프로세스는 처음부터 훈련하는 것보다 더 나은 결과를 얻을 수 있으므로(데이터가 많지 않은 경우) 항상 사전 훈련된 모델을 활용하려고 노력해야 합니다.— 현재 진행 중인 작업에 최대한 가깝게 설정하고 미세 조정하세요.

General architecture

In this section, we’ll go over the general architecture of the Transformer model. Don’t worry if you don’t understand some of the concepts; there are detailed sections later covering each of the components.

이 섹션에서는 Transformer 모델의 일반적인 아키텍처를 살펴보겠습니다. 일부 개념을 이해하지 못하더라도 걱정하지 마세요. 나중에 각 구성 요소를 다루는 자세한 섹션이 있습니다.

https://youtu.be/H39Z_720T5s?si=NTFJwNVw7AuGgE6D

Introduction

The model is primarily composed of two blocks:

이 모델은 기본적으로 두 개의 블록으로 구성됩니다.

Encoder (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
인코더(왼쪽): 인코더는 입력을 수신하고 이에 대한 representation (해당 features )을 작성합니다. 이는 모델이 입력으로부터 이해를 얻도록 최적화되었음을 의미합니다.
Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.
디코더(오른쪽): 디코더는 다른 입력과 함께 인코더의 representation ( features )을 사용하여 대상 시퀀스를 생성합니다. 이는 모델이 출력 생성에 최적화되어 있음을 의미합니다.

Each of these parts can be used independently, depending on the task:

이러한 각 부분은 작업에 따라 독립적으로 사용될 수 있습니다.

Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
인코더 전용 모델: 문장 분류 및 명명된 엔터티 인식과 같이 입력에 대한 이해가 필요한 작업에 적합합니다.
Decoder-only models: Good for generative tasks such as text generation.
디코더 전용 모델: 텍스트 생성과 같은 생성 작업에 적합합니다.
Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.
인코더-디코더 모델 또는 시퀀스-시퀀스 모델: 번역 또는 요약과 같이 입력이 필요한 생성 작업에 적합합니다.

We will dive into those architectures independently in later sections.

이후 섹션에서 이러한 아키텍처에 대해 독립적으로 살펴보겠습니다.

Attention layers

A key feature of Transformer models is that they are built with special layers called attention layers. In fact, the title of the paper introducing the Transformer architecture was “Attention Is All You Need”! We will explore the details of attention layers later in the course; for now, all you need to know is that this layer will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word.

Transformer 모델의 주요 특징은 Attention 레이어라는 특수 레이어로 구축된다는 것입니다. 실제로 Transformer 아키텍처를 소개하는 논문의 제목은 "Attention Is All You Need"였습니다! 이 과정의 뒷부분에서 Attention 레이어에 대한 세부 사항을 살펴보겠습니다. 지금으로서 알아야 할 것은 이 레이어가 각 단어의 표현을 처리할 때 전달한 문장의 특정 단어에 특별한 주의를 기울이고 다른 단어는 거의 무시하도록 모델에 지시한다는 것입니다.

To put this into context, consider the task of translating text from English to French. Given the input “You like this course”, a translation model will need to also attend to the adjacent word “You” to get the proper translation for the word “like”, because in French the verb “like” is conjugated differently depending on the subject. The rest of the sentence, however, is not useful for the translation of that word. In the same vein, when translating “this” the model will also need to pay attention to the word “course”, because “this” translates differently depending on whether the associated noun is masculine or feminine. Again, the other words in the sentence will not matter for the translation of “this”. With more complex sentences (and more complex grammar rules), the model would need to pay special attention to words that might appear farther away in the sentence to properly translate each word.

이를 맥락에 맞게 이해하려면 텍스트를 영어에서 프랑스어로 번역하는 작업을 고려해 보세요. "You like thiscourse"라는 입력이 주어지면 번역 모델은 "like"라는 단어에 대한 적절한 번역을 얻기 위해 인접한 단어 "You"에도 주의를 기울여야 합니다. 왜냐하면 프랑스어에서는 동사 "like"가 주제에 따라 다르게 활용되기 때문입니다. 그러나 문장의 나머지 부분은 해당 단어를 번역하는 데 유용하지 않습니다. 같은 맥락에서, "this"를 번역할 때 모델은 "course"라는 단어에도 주의를 기울여야 합니다. "this"는 관련 명사가 남성인지 여성인지에 따라 다르게 번역되기 때문입니다. 다시 말하지만, 문장의 다른 단어는 "this"를 번역하는 데 중요하지 않습니다. 더 복잡한 문장(및 더 복잡한 문법 규칙)의 경우 모델은 각 단어를 적절하게 번역하기 위해 문장에서 더 멀리 나타날 수 있는 단어에 특별한 주의를 기울여야 합니다.

The same concept applies to any task associated with natural language: a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied.

Now that you have an idea of what attention layers are all about, let’s take a closer look at the Transformer architecture.

동일한 개념이 자연어와 관련된 모든 작업에 적용됩니다. 단어 자체에는 의미가 있지만 해당 의미는 연구되는 단어 전후의 다른 단어(또는 단어)일 수 있는 문맥에 의해 깊은 영향을 받습니다.

이제 어텐션 레이어가 무엇인지 알았으니 Transformer 아키텍처를 자세히 살펴보겠습니다.

The original architecture

The Transformer architecture was originally designed for translation. During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language. In the encoder, the attention layers can use all the words in a sentence (since, as we just saw, the translation of a given word can be dependent on what is after as well as before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.

Transformer 아키텍처는 원래 번역용으로 설계되었습니다. 학습 중에 인코더는 특정 언어로 된 입력(문장)을 수신하고 디코더는 원하는 대상 언어로 동일한 문장을 수신합니다. 인코더에서 어텐션 레이어는 문장의 모든 단어를 사용할 수 있습니다(방금 본 것처럼 주어진 단어의 번역은 문장의 앞과 뒤의 내용에 따라 달라질 수 있기 때문입니다). 그러나 디코더는 순차적으로 작동하며 이미 번역된 문장의 단어에만 주의를 기울일 수 있습니다. 즉, 현재 생성되고 있는 단어 앞의 단어에만 주의를 기울일 수 있습니다. 예를 들어, 번역된 대상의 처음 세 단어를 예측한 경우 이를 디코더에 제공하고 디코더는 인코더의 모든 입력을 사용하여 네 번째 단어를 예측하려고 시도합니다.

To speed things up during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!). For instance, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3.

훈련 중에 작업 속도를 높이기 위해(모델이 대상 문장에 액세스할 수 있는 경우) 디코더에 전체 대상이 제공되지만 미래 단어를 사용할 수는 없습니다(예측을 시도할 때 위치 2의 단어에 액세스한 경우). 위치 2에 단어가 있으면 문제는 그리 어렵지 않을 것입니다!). 예를 들어, 네 번째 단어를 예측하려고 할 때 Attention 레이어는 위치 1~3의 단어에만 접근할 수 있습니다.

The original Transformer architecture looked like this, with the encoder on the left and the decoder on the right:

원래 Transformer 아키텍처는 다음과 같았습니다. 인코더는 왼쪽에 디코더는 오른쪽에 있습니다.

Note that the first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word. This is very useful as different languages can have grammatical rules that put the words in different orders, or some context provided later in the sentence may be helpful to determine the best translation of a given word.

디코더 블록의 첫 번째 attention layer 는 디코더에 대한 모든(과거) 입력에 주의를 기울이지만 두 번째 attention layer 는 인코더의 출력을 사용합니다. 따라서 전체 입력 문장에 액세스하여 현재 단어를 가장 잘 예측할 수 있습니다. 이는 다양한 언어가 단어의 순서를 다르게 지정하는 문법 규칙을 가질 수 있거나 문장의 뒷부분에 제공되는 일부 컨텍스트가 특정 단어의 최상의 번역을 결정하는 데 도움이 될 수 있으므로 매우 유용합니다.

The attention mask can also be used in the encoder/decoder to prevent the model from paying attention to some special words — for instance, the special padding word used to make all the inputs the same length when batching together sentences.

어텐션 마스크는 모델이 일부 특수 단어(예: 문장을 일괄 처리할 때 모든 입력을 동일한 길이로 만드는 데 사용되는 특수 패딩 단어)에 주의를 기울이는 것을 방지하기 위해 인코더/디코더에서 사용할 수도 있습니다.

Architectures vs. checkpoints

As we dive into Transformer models in this course, you’ll see mentions of architectures and checkpoints as well as models. These terms all have slightly different meanings:

이 과정에서 Transformer 모델을 자세히 살펴보면 모델뿐만 아니라 아키텍처와 체크포인트에 대한 언급도 볼 수 있습니다. 이러한 용어는 모두 약간 다른 의미를 갖습니다.

Architecture: This is the skeleton of the model — the definition of each layer and each operation that happens within the model.
아키텍처: 이는 모델의 뼈대입니다. 즉, 각 레이어의 정의와 모델 내에서 발생하는 각 작업입니다.
Checkpoints: These are the weights that will be loaded in a given architecture.
체크포인트: 특정 아키텍처에 로드될 가중치입니다.
Model: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity.
모델: 이는 "아키텍처"나 "체크포인트"만큼 정확하지 않은 포괄적인 용어입니다. 두 가지 모두를 의미할 수 있습니다. 이 과정에서는 모호성을 줄이는 것이 중요한 경우 아키텍처 또는 체크포인트를 지정합니다.

For example, BERT is an architecture while bert-base-cased, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.”

예를 들어, BERT는 아키텍처인 반면 Google 팀이 BERT의 첫 번째 릴리스를 위해 훈련한 가중치 세트인 bert-base-cased는 체크포인트입니다. 그러나 "BERT 모델"과 "bert-base-cased 모델"이라고 말할 수 있습니다.

저작자표시 (새창열림)

'Hugging Face > NLP Course' 카테고리의 다른 글

HF-NLP-Transformer models : End-of-chapter quiz (1)	2023.12.24
HF-NLP-Transformer models : Summary (0)	2023.12.24
HF-NLP-Transformer models : Bias and limitations (1)	2023.12.24
HF-NLP-Transformer models : Sequence-to-sequence models[sequence-to-sequence-models] (1)	2023.12.24
HF-NLP-Transformer models : Decoder models (1)	2023.12.24
HF-NLP-Transformer models : Encoder models (1)	2023.12.24
HF-NLP-Transformer models : Transformers, what can they do? (0)	2023.12.23
HF-NLP-Transformer models : Natural Language Processing (0)	2023.12.19
HF-NLP-Transformer models : Introduction (0)	2023.12.19
HF-NLP-Setup Introduction (0)	2023.12.19

IT 기술 따라잡기

공지사항

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

카테고리