Guides - Embeddings

2023. 1. 10. 08:36 | Posted by 솔웅





An API for accessing new AI models developed by OpenAI




What are embeddings?

OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are most commonly used for:

  • Search (where results are ranked by relevance to a query string)
  • Clustering (where text strings are grouped by similarity)
  • Recommendations (where items with related text strings are recommended)
  • Anomaly detection (where outliers with little relatedness are identified)
  • Diversity measurement (where similarity distributions are analyzed)
  • Classification (where text strings are classified by their most similar label)


OpenAI의 텍스트 임베딩은 텍스트 문자열의 관련성을 측정합니다. 임베딩은 다음 용도로 가장 일반적으로 사용됩니다.
* 검색(쿼리 문자열과의 관련성에 따라 결과 순위가 매겨짐)
* 클러스터링(텍스트 문자열이 유사성에 따라 그룹화됨)
* 권장 사항(관련 텍스트 문자열이 있는 항목이 권장되는 경우)
* 이상 감지(관련성이 거의 없는 이상값이 식별되는 경우)
* 다양성 측정(유사성 분포가 분석되는 경우)
* 분류(여기서 텍스트 문자열은 가장 유사한 레이블로 분류됨)


An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

Visit our pricing page to learn about Embeddings pricing. Requests are billed based on the number of tokens in the input sent.


임베딩은 부동 소수점 숫자의 벡터(목록)입니다. 두 벡터 사이의 거리는 관련성을 측정합니다. 작은 거리는 높은 관련성을 나타내고 먼 거리는 낮은 관련성을 나타냅니다.
임베딩 가격에 대해 알아보려면 가격 페이지를 방문하세요. 요청은 전송된 입력의 토큰 수에 따라 요금이 청구됩니다.


To see embeddings in action, check out our code samples

  • Classification
  • Topic clustering
  • Search
  • Recommendations
Browse Samples‍


How to get embeddings

To get an embedding, send your text string to the embeddings API endpoint along with a choice of embedding model ID (e.g., text-embedding-ada-002). The response will contain an embedding, which you can extract, save, and use.


임베딩을 받으려면 임베딩 모델 ID(예: text-embedding-ada-002) 선택과 함께 텍스트 문자열을 임베딩 API 엔드포인트로 보냅니다. 응답에는 추출, 저장 및 사용할 수 있는 임베딩이 포함됩니다.


Example requests:


response = openai.Embedding.create(
    input="Your text string goes here",
embeddings = response['data'][0]['embedding']



Example response:


  "data": [
      "embedding": [
      "index": 0,
      "object": "embedding"
  "model": "text-embedding-ada-002",
  "object": "list",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5


See more Python code examples in the OpenAI Cookbook.

When using OpenAI embeddings, please keep in mind their limitations and risks.


Embedding models

OpenAI offers one second-generation embedding model (denoted with -002 in the model ID) and sixteen first-generation models (denoted with -001 in the model ID).

We recommend using text-embedding-ada-002 for nearly all use cases. It’s better, cheaper, and simpler to use. Read the blog post announcement.


OpenAI는 1개의 2세대 임베딩 모델(모델 ID에 -002로 표시됨)과 16개의 1세대 모델(모델 ID에 -001로 표시됨)을 제공합니다.
거의 모든 사용 사례에 대해 text-embedding-ada-002를 사용하는 것이 좋습니다. 더 좋고, 더 저렴하고, 더 간단하게 사용할 수 있습니다. 블로그 게시물 공지사항을 읽어보세요.


MODEL GENERATION.  TOKENIZER.                                           MAX INPUT TOKENS.  KNOWLEDGE CUTOFF

V2 cl100k_base 8191 Sep 2021
V1 GPT-2/GPT-3 2046 Aug 2020


Usage is priced per input token, at a rate of $0.0004 per 1000 tokens, or about ~3,000 pages per US dollar (assuming ~800 tokens per page):


사용량은 입력 토큰당 1,000개 토큰당 $0.0004 또는 미국 달러당 약 3,000페이지(페이지당 800개 토큰으로 가정)의 비율로 가격이 책정됩니다.

First-generation models (not recommended)
1 세대 모델 (권장 하지 않음)

All first-generation models (those ending in -001) use the GPT-3 tokenizer and have a max input of 2046 tokens.

모든 1세대 모델(-001로 끝나는 모델)은 GPT-3 토크나이저를 사용하며 최대 입력값은 2046개입니다.


First-generation embeddings are generated by five different model families tuned for three different tasks: text search, text similarity and code search. The search models come in pairs: one for short queries and one for long documents. Each family includes up to four models on a spectrum of quality and speed:


1세대 임베딩은 텍스트 검색, 텍스트 유사성 및 코드 검색의 세 가지 작업에 맞게 조정된 다섯 가지 모델군에 의해 생성됩니다. 검색 모델은 쌍으로 제공됩니다. 하나는 짧은 쿼리용이고 다른 하나는 긴 문서용입니다. 각 제품군에는 다양한 품질과 속도에 대해 최대 4개의 모델이 포함됩니다.


MODEL                                                                                                  OUTPUT DIMENSIONS
Ada 1024
Babbage 2048
Curie 4096
Davinci 12288


Davinci is the most capable, but is slower and more expensive than the other models. Ada is the least capable, but is significantly faster and cheaper.


Davinci는 가장 유능하지만 다른 모델보다 느리고 비쌉니다. Ada는 성능이 가장 낮지만 훨씬 빠르고 저렴합니다.


Similarity embeddings

Similarity models are best at capturing semantic similarity between pieces of text.

유사성 모델은 텍스트 조각 간의 의미론적 유사성을 포착하는 데 가장 적합합니다.


USE CASES                                                                                                            AVAILABLE MODELS
Clustering, regression, anomaly detection, visualization text-similarity-ada-001


Text search embeddings


Text search models help measure which long documents are most relevant to a short search query. Two models are used: one for embedding the search query and one for embedding the documents to be ranked. The document embeddings closest to the query embedding should be the most relevant.


텍스트 검색 모델은 짧은 검색 쿼리와 가장 관련성이 높은 긴 문서를 측정하는 데 도움이 됩니다. 두 가지 모델이 사용됩니다. 하나는 검색 쿼리를 포함하기 위한 것이고 다른 하나는 순위를 매길 문서를 포함하기 위한 것입니다. 쿼리 임베딩에 가장 가까운 문서 임베딩이 가장 관련성이 높아야 합니다.



USE CASES                                                                                               AVAILABLE MODELS
Search, context relevance, information retrieval text-search-ada-doc-001


Code search embeddings

Similarly to search embeddings, there are two types: one for embedding natural language search queries and one for embedding code snippets to be retrieved.

검색 임베딩과 유사하게 두 가지 유형이 있습니다. 하나는 자연어 검색 쿼리를 포함하는 것이고 다른 하나는 검색할 코드 스니펫을 포함하는 것입니다.


USE CASES                                                                           AVAILABLE MODELS
Code search and relevance code-search-ada-code-001

With the -001 text embeddings (not -002, and not code embeddings), we suggest replacing newlines (\n) in your input with a single space, as we have seen worse results when newlines are present.

-001 텍스트 임베딩(-002 및 코드 임베딩이 아님)을 사용하는 경우 입력의 줄 바꿈(\n)을 단일 공백으로 바꾸는 것이 좋습니다. 줄 바꿈이 있을 때 더 나쁜 결과가 나타났기 때문입니다.



Use cases

Here we show some representative use cases. We will use the Amazon fine-food reviews dataset for the following examples.

여기서는 몇 가지 대표적인 사용 사례를 보여줍니다. 다음 예제에서는 Amazon 고급 식품 리뷰 데이터 세트를 사용합니다.


Obtaining the embeddings

The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text). For example:

데이터 세트에는 2012년 10월까지 Amazon 사용자가 남긴 총 568,454개의 음식 리뷰가 포함되어 있습니다. 설명을 위해 가장 최근 리뷰 1,000개의 하위 집합을 사용합니다. 리뷰는 영어로 되어 있으며 긍정적이거나 부정적인 경향이 있습니다. 각 리뷰에는 ProductId, UserId, 점수, 리뷰 제목(요약) 및 리뷰 본문(텍스트)이 있습니다. 예를 들어:


We will combine the review summary and review text into a single combined text. The model will encode this combined text and output a single vector embedding.

리뷰 요약과 리뷰 텍스트를 하나의 결합된 텍스트로 결합합니다. 모델은 이 결합된 텍스트를 인코딩하고 단일 벡터 임베딩을 출력합니다.


def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
df['ada_embedding'] = df.combined.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
df.to_csv('output/embedded_1k_reviews.csv', index=False)

To load the data from a saved file, you can run the following:

저장된 파일로부터 데이터를 로드 하려면 아래를 실행하면 됩니다.


import pandas as pd
df = pd.read_csv('output/embedded_1k_reviews.csv')
df['ada_embedding'] = df.ada_embedding.apply(eval).apply(np.array)

Data visualization in 2D

The size of the embeddings varies with the complexity of the underlying model. In order to visualize this high dimensional data we use the t-SNE algorithm to transform the data into two dimensions.

임베딩의 크기는 기본 모델의 복잡성에 따라 다릅니다. 이 고차원 데이터를 시각화하기 위해 t-SNE 알고리즘을 사용하여 데이터를 2차원으로 변환합니다.


We colour the individual reviews based on the star rating which the reviewer has given:

리뷰어가 부여한 별점에 따라 개별 리뷰에 색상을 지정합니다.

  • 1-star: red
  • 2-star: dark orange
  • 3-star: gold
  • 4-star: turquoise
  • 5-star: dark green

The visualization seems to have produced roughly 3 clusters, one of which has mostly negative reviews.

시각화는 대략 3개의 클러스터를 생성한 것으로 보이며 그 중 하나는 대부분 부정적인 리뷰를 가지고 있습니다.


import pandas as pd
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import matplotlib
df = pd.read_csv('output/embedded_1k_reviews.csv')
matrix = df.ada_embedding.apply(eval).to_list()
# Create a t-SNE model and transform the data
tsne = TSNE(n_components=2, perplexity=15, random_state=42, init='random', learning_rate=200)
vis_dims = tsne.fit_transform(matrix)
colors = ["red", "darkorange", "gold", "turquiose", "darkgreen"]
x = [x for x,y in vis_dims]
y = [y for x,y in vis_dims]
color_indices = df.Score.values - 1
colormap = matplotlib.colors.ListedColormap(colors)
plt.scatter(x, y, c=color_indices, cmap=colormap, alpha=0.3)
plt.title("Amazon ratings visualized in language using t-SNE")

Embedding as a text feature encoder for ML algorithms

An embedding can be used as a general free-text feature encoder within a machine learning model. Incorporating embeddings will improve the performance of any machine learning model, if some of the relevant inputs are free text. An embedding can also be used as a categorical feature encoder within a ML model. This adds most value if the names of categorical variables are meaningful and numerous, such as job titles. Similarity embeddings generally perform better than search embeddings for this task.

임베딩은 기계 학습 모델 내에서 일반 자유 텍스트 기능 인코더로 사용할 수 있습니다. 임베딩을 통합하면 관련 입력 중 일부가 자유 텍스트인 경우 기계 학습 모델의 성능이 향상됩니다. 포함은 ML 모델 내에서 범주형 기능 인코더로 사용할 수도 있습니다. 이것은 범주형 변수의 이름이 직위와 같이 의미 있고 많은 경우 가장 큰 가치를 추가합니다. 유사성 임베딩은 일반적으로 이 작업에서 검색 임베딩보다 성능이 좋습니다.


We observed that generally the embedding representation is very rich and information dense. For example, reducing the dimensionality of the inputs using SVD or PCA, even by 10%, generally results in worse downstream performance on specific tasks.


우리는 일반적으로 임베딩 표현이 매우 풍부하고 정보 밀도가 높다는 것을 관찰했습니다. 예를 들어 SVD 또는 PCA를 사용하여 입력의 차원을 10%까지 줄이면 일반적으로 특정 작업에서 다운스트림 성능이 저하됩니다.


This code splits the data into a training set and a testing set, which will be used by the following two use cases, namely regression and classification.


이 코드는 데이터를 학습 세트와 테스트 세트로 분할하며, 회귀 및 분류라는 두 가지 사용 사례에서 사용됩니다.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    test_size = 0.2,


Regression using the embedding features

Embeddings present an elegant way of predicting a numerical value. In this example we predict the reviewer’s star rating, based on the text of their review. Because the semantic information contained within embeddings is high, the prediction is decent even with very few reviews.

임베딩은 숫자 값을 예측하는 우아한 방법을 제공합니다. 이 예에서는 리뷰 텍스트를 기반으로 리뷰어의 별점을 예측합니다. 임베딩에 포함된 의미론적 정보가 높기 때문에 리뷰가 거의 없어도 예측이 괜찮습니다.


We assume the score is a continuous variable between 1 and 5, and allow the algorithm to predict any floating point value. The ML algorithm minimizes the distance of the predicted value to the true score, and achieves a mean absolute error of 0.39, which means that on average the prediction is off by less than half a star.


우리는 점수가 1과 5 사이의 연속 변수라고 가정하고 알고리즘이 부동 소수점 값을 예측할 수 있도록 합니다. ML 알고리즘은 예측 값과 실제 점수의 거리를 최소화하고 평균 절대 오차 0.39를 달성합니다.

from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=100)
rfr.fit(X_train, y_train)
preds = rfr.predict(X_test)


Classification using the embedding features

This time, instead of having the algorithm predict a value anywhere between 1 and 5, we will attempt to classify the exact number of stars for a review into 5 buckets, ranging from 1 to 5 stars.

이번에는 알고리즘이 1에서 5 사이의 값을 예측하는 대신 검토를 위한 정확한 별 수를 1에서 5개 범위의 5개 버킷으로 분류하려고 합니다.


After the training, the model learns to predict 1 and 5-star reviews much better than the more nuanced reviews (2-4 stars), likely due to more extreme sentiment expression.


학습 후 모델은 보다 극단적인 감정 표현으로 인해 미묘한 차이가 있는 리뷰(2~4개)보다 별 1개 및 5개 리뷰를 훨씬 더 잘 예측하는 방법을 학습합니다.


from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
Zero-shot classification

We can use embeddings for zero shot classification without any labeled training data. For each class, we embed the class name or a short description of the class. To classify some new text in a zero-shot manner, we compare its embedding to all class embeddings and predict the class with the highest similarity.

라벨이 지정된 학습 데이터 없이 제로샷 분류에 임베딩을 사용할 수 있습니다. 각 클래스에 대해 클래스 이름 또는 클래스에 대한 간단한 설명을 포함합니다. 새로운 텍스트를 제로 샷 방식으로 분류하기 위해 임베딩을 모든 클래스 임베딩과 비교하고 유사도가 가장 높은 클래스를 예측합니다.


from openai.embeddings_utils import cosine_similarity, get_embedding
df= df[df.Score!=3]
df['sentiment'] = df.Score.replace({1:'negative', 2:'negative', 4:'positive', 5:'positive'})
labels = ['negative', 'positive']
label_embeddings = [get_embedding(label, model=model) for label in labels]
def label_score(review_embedding, label_embeddings):
   return cosine_similarity(review_embedding, label_embeddings[1]) - cosine_similarity(review_embedding, label_embeddings[0])
prediction = 'positive' if label_score('Sample Review', label_embeddings) > 0 else 'negative'


Obtaining user and product embeddings for cold-start recommendation

We can obtain a user embedding by averaging over all of their reviews. Similarly, we can obtain a product embedding by averaging over all the reviews about that product. In order to showcase the usefulness of this approach we use a subset of 50k reviews to cover more reviews per user and per product.

모든 리뷰를 평균하여 임베딩하는 사용자를 얻을 수 있습니다. 마찬가지로 해당 제품에 대한 모든 리뷰를 평균화하여 제품 포함을 얻을 수 있습니다. 이 접근 방식의 유용성을 보여주기 위해 50,000개 리뷰의 하위 집합을 사용하여 사용자 및 제품당 더 많은 리뷰를 다루었습니다.


We evaluate the usefulness of these embeddings on a separate test set, where we plot similarity of the user and product embedding as a function of the rating. Interestingly, based on this approach, even before the user receives the product we can predict better than random whether they would like the product.

우리는 별도의 테스트 세트에서 이러한 임베딩의 유용성을 평가합니다. 여기서 사용자와 제품 임베딩의 유사성을 등급의 함수로 표시합니다. 흥미롭게도 이 접근 방식을 기반으로 사용자가 제품을 받기 전에도 사용자가 제품을 좋아할지 무작위보다 더 잘 예측할 수 있습니다.

user_embeddings = df.groupby('UserId').ada_embedding.apply(np.mean)
prod_embeddings = df.groupby('ProductId').ada_embedding.apply(np.mean)



Clustering is one way of making sense of a large volume of textual data. Embeddings are useful for this task, as they provide semantically meaningful vector representations of each text. Thus, in an unsupervised way, clustering will uncover hidden groupings in our dataset.

클러스터링은 대량의 텍스트 데이터를 이해하는 한 가지 방법입니다. 임베딩은 각 텍스트의 의미론적으로 의미 있는 벡터 표현을 제공하므로 이 작업에 유용합니다. 따라서 감독되지 않은 방식으로 클러스터링은 데이터 세트에서 숨겨진 그룹을 발견합니다.


In this example, we discover four distinct clusters: one focusing on dog food, one on negative reviews, and two on positive reviews.

이 예에서 우리는 4개의 뚜렷한 클러스터를 발견합니다. 하나는 개 사료에 초점을 맞추고, 하나는 부정적인 리뷰에 초점을 맞추고, 다른 하나는 긍정적인 리뷰에 초점을 맞춥니다.


import numpy as np
from sklearn.cluster import KMeans
matrix = np.vstack(df.ada_embedding.values)
n_clusters = 4
kmeans = KMeans(n_clusters = n_clusters, init='k-means++', random_state=42)
df['Cluster'] = kmeans.labels_


Text search using embeddings

To retrieve the most relevant documents we use the cosine similarity between the embedding vectors of the query and each document, and return the highest scored documents.

가장 관련성이 높은 문서를 검색하기 위해 쿼리와 각 문서의 임베딩 벡터 간의 코사인 유사성을 사용하고 점수가 가장 높은 문서를 반환합니다.


from openai.embeddings_utils import get_embedding, cosine_similarity
def search_reviews(df, product_description, n=3, pprint=True):
   embedding = get_embedding(product_description, model='text-embedding-ada-002')
   df['similarities'] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
   res = df.sort_values('similarities', ascending=False).head(n)
   return res
res = search_reviews(df, 'delicious beans', n=3)


Code search using embeddings

Code search works similarly to embedding-based text search. We provide a method to extract Python functions from all the Python files in a given repository. Each function is then indexed by the text-embedding-ada-002 model.

코드 검색은 임베딩 기반 텍스트 검색과 유사하게 작동합니다. 주어진 리포지토리의 모든 Python 파일에서 Python 함수를 추출하는 방법을 제공합니다. 그런 다음 각 함수는 text-embedding-ada-002 모델에 의해 인덱싱됩니다.


To perform a code search, we embed the query in natural language using the same model. Then we calculate cosine similarity between the resulting query embedding and each of the function embeddings. The highest cosine similarity results are most relevant.


코드 검색을 수행하기 위해 동일한 모델을 사용하여 자연어로 쿼리를 포함합니다. 그런 다음 결과 쿼리 임베딩과 각 함수 임베딩 간의 코사인 유사성을 계산합니다. 가장 높은 코사인 유사성 결과가 가장 적합합니다.


from openai.embeddings_utils import get_embedding, cosine_similarity
df['code_embedding'] = df['code'].apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
def search_functions(df, code_query, n=3, pprint=True, n_lines=7):
   embedding = get_embedding(code_query, model='text-embedding-ada-002')
   df['similarities'] = df.code_embedding.apply(lambda x: cosine_similarity(x, embedding))
   res = df.sort_values('similarities', ascending=False).head(n)
   return res
res = search_functions(df, 'Completions API tests', n=3)


Recommendations using embeddings

Because shorter distances between embedding vectors represent greater similarity, embeddings can be useful for recommendation.

임베딩 벡터 사이의 거리가 짧을수록 유사성이 더 높기 때문에 임베딩은 추천에 유용할 수 있습니다.


Below, we illustrate a basic recommender. It takes in a list of strings and one 'source' string, computes their embeddings, and then returns a ranking of the strings, ranked from most similar to least similar. As a concrete example, the linked notebook below applies a version of this function to the AG news dataset (sampled down to 2,000 news article descriptions) to return the top 5 most similar articles to any given source article.

아래에서는 기본 추천자를 설명합니다. 문자열 목록과 하나의 '소스' 문자열을 가져와 임베딩을 계산한 다음 가장 유사한 항목부터 가장 유사한 항목 순으로 순위가 매겨진 문자열 순위를 반환합니다. 구체적인 예로서, 아래 링크된 노트북은 이 기능의 버전을 AG 뉴스 데이터 세트(2,000개의 뉴스 기사 설명으로 샘플링됨)에 적용하여 주어진 소스 기사와 가장 유사한 상위 5개 기사를 반환합니다.

def recommendations_from_strings(
   strings: List[str],
   index_of_source_string: int,
) -> List[int]:
   """Return nearest neighbors of a given string."""

   # get embeddings for all strings
   embeddings = [embedding_from_string(string, model=model) for string in strings]
   # get the embedding of the source string
   query_embedding = embeddings[index_of_source_string]
   # get distances between the source embedding and other embeddings (function from embeddings_utils.py)
   distances = distances_from_embeddings(query_embedding, embeddings, distance_metric="cosine")
   # get indices of nearest neighbors (function from embeddings_utils.py)
   indices_of_nearest_neighbors = indices_of_nearest_neighbors_from_distances(distances)
   return indices_of_nearest_neighbors


Limitations & risks

Our embedding models may be unreliable or pose social risks in certain cases, and may cause harm in the absence of mitigations.

당사의 임베딩 모델은 신뢰할 수 없거나 경우에 따라 사회적 위험을 초래할 수 있으며 완화 조치가 없을 경우 해를 끼칠 수 있습니다.


Social bias

Limitation: The models encode social biases, e.g. via stereotypes or negative sentiment towards certain groups.

We found evidence of bias in our models via running the SEAT (May et al, 2019) and the Winogender (Rudinger et al, 2018) benchmarks. Together, these benchmarks consist of 7 tests that measure whether models contain implicit biases when applied to gendered names, regional names, and some stereotypes.

우리는 SEAT(May et al, 2019) 및 Winogender(Rudinger et al, 2018) 벤치마크를 실행하여 모델에서 편향의 증거를 발견했습니다. 이 벤치마크는 성별 이름, 지역 이름 및 일부 고정관념에 적용될 때 모델에 암시적 편향이 포함되어 있는지 여부를 측정하는 7가지 테스트로 구성됩니다.


For example, we found that our models more strongly associate (a) European American names with positive sentiment, when compared to African American names, and (b) negative stereotypes with black women.


예를 들어, 우리 모델은 (a) 아프리카계 미국인 이름과 비교할 때 긍정적인 정서가 있는 유럽계 미국인 이름 및 (b) 흑인 여성에 대한 부정적인 고정관념과 더 강하게 연관되어 있음을 발견했습니다.


These benchmarks are limited in several ways: (a) they may not generalize to your particular use case, and (b) they only test for a very small slice of possible social bias.


이러한 벤치마크는 다음과 같은 몇 가지 방식으로 제한됩니다. (a) 특정 사용 사례에 대해 일반화할 수 없으며 (b) 가능한 사회적 편향의 아주 작은 조각에 대해서만 테스트합니다.


These tests are preliminary, and we recommend running tests for your specific use cases. These results should be taken as evidence of the existence of the phenomenon, not a definitive characterization of it for your use case. Please see our usage policies for more details and guidance.


이러한 테스트는 예비 테스트이며 특정 사용 사례에 대한 테스트를 실행하는 것이 좋습니다. 이러한 결과는 사용 사례에 대한 결정적인 특성이 아니라 현상의 존재에 대한 증거로 간주되어야 합니다. 자세한 내용과 지침은 당사의 사용 정책을 참조하십시오.


Please reach out to embeddings@openai.com if you have any questions; we are happy to advise on this.


질문이 있는 경우 embeddings@openai.com으로 문의하십시오. 우리는 이에 대해 기꺼이 조언합니다.


English only

Limitation: Models are most reliable for mainstream English that is typically found on the Internet. Our models may perform poorly on regional or group dialects.


Researchers have found (Blodgett & O’Connor, 2017) that common NLP systems don’t perform as well on African American English as they do on mainstream American English. Our models may similarly perform poorly on dialects or uses of English that are not well represented on the Internet.


연구자들은 (Blodgett & O'Connor, 2017) 일반적인 NLP 시스템이 주류 미국 영어에서처럼 아프리카계 미국인 영어에서 잘 수행되지 않는다는 사실을 발견했습니다. 우리의 모델은 인터넷에서 잘 표현되지 않는 방언이나 영어 사용에 대해 제대로 작동하지 않을 수 있습니다.


Blindness to recent events

Limitation: Models lack knowledge of events that occurred after August 2020.

Our models are trained on datasets that contain some information about real world events up until 8/2020. If you rely on the models representing recent events, then they may not perform well.

우리 모델은 2020년 8월까지 실제 이벤트에 대한 일부 정보가 포함된 데이터 세트에서 학습됩니다. 최근 이벤트를 나타내는 모델에 의존하는 경우 성능이 좋지 않을 수 있습니다.


Frequently asked questions

How can I tell how many tokens a string will have before I embed it?

For second-generation embedding models, as of Dec 2022, there is not yet a way to count tokens locally. The only way to get total token counts is to submit an API request.

2세대 임베딩 모델의 경우 2022년 12월 현재 로컬에서 토큰을 계산하는 방법이 아직 없습니다. 총 토큰 수를 얻는 유일한 방법은 API 요청을 제출하는 것입니다.


  • If the request succeeds, you can extract the number of tokens from the response: response[“usage”][“total_tokens”]
  • If the request fails for having too many tokens, you can extract the number of tokens from the error message: e.g., This model's maximum context length is 8191 tokens, however you requested 10000 tokens (10000 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

* 요청이 성공하면 응답에서 토큰 수를 추출할 수 있습니다. response[“usage”][“total_tokens”]
* 토큰이 너무 많아 요청이 실패하는 경우 오류 메시지에서 토큰 수를 추출할 수 있습니다. 예: 이 모델의 최대 컨텍스트 길이는 8191 토큰이지만 10000 토큰을 요청했습니다(프롬프트에서 10000, 완료를 위해 0). 프롬프트를 줄이십시오. 또는 완료 길이.


For first-generation embedding models, which are based on GPT-2/GPT-3 tokenization, you can count tokens in a few ways:

GPT-2/GPT-3 토큰화를 기반으로 하는 1세대 임베딩 모델의 경우 몇 가지 방법으로 토큰을 계산할 수 있습니다.

* 일회성 확인의 경우 OpenAI 토크나이저 페이지가 편리합니다.
* Python에서 transformers.GPT2TokenizerFast(GPT-2 토크나이저는 GPT-3과 동일함)
* JavaScript에서 gpt-3-encoder

Python example:


from transformers import GPT2TokenizerFast

def num_tokens_from_string(string: str, tokenizer) -> int:
    return len(tokenizer.encode(string))

string = "your text here"
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

num_tokens_from_string(string, tokenizer)



How can I retrieve K nearest embedding vectors quickly?

For searching over many vectors quickly, we recommend using a vector database.

많은 벡터를 빠르게 검색하려면 벡터 데이터베이스를 사용하는 것이 좋습니다.


Vector database options include:

  • Pinecone, a fully managed vector database
  • Weaviate, an open-source vector search engine
  • Faiss, a vector search algorithm by Facebook

Which distance function should I use?

We recommend cosine similarity. The choice of distance function typically doesn’t matter much.

코사인 유사성을 권장합니다. 거리 함수의 선택은 일반적으로 그다지 중요하지 않습니다.


OpenAI embeddings are normalized to length 1, which means that:

  • Cosine similarity can be computed slightly faster using just a dot product
  • Cosine similarity and Euclidean distance will result in the identical rankings

OpenAI 임베딩은 길이 1로 정규화되며 이는 다음을 의미합니다.

* 코사인 유사도는 내적만 사용하여 약간 더 빠르게 계산할 수 있습니다.
* 코사인 유사성과 유클리드 거리는 동일한 순위가 됩니다.


Guides - Fine tuning

2023. 1. 10. 00:21 | Posted by 솔웅





An API for accessing new AI models developed by OpenAI





Learn how to customize a model for your application.


당신의 어플리케이션을 위해 어떻게 모델을 커스터마이징 하는지 배워 봅니다.



Fine-tuning lets you get more out of the models available through the API by providing:

  1. Higher quality results than prompt design
  2. Ability to train on more examples than can fit in a prompt
  3. Token savings due to shorter prompts
  4. Lower latency requests

미세 조정(fine-tuning)을 통해 다음을 제공하여 API를 통해 사용 가능한 모델에서 더 많은 것을 얻을 수 있습니다.
1. 초기 설계보다 높은 품질
2. 초기에 맞출 수 있는 것보다 더 많은 예를 훈련하는 능력
3. 짧은 프롬프트로 인한 토큰 절약
4. 낮은 대기 시간 요청


GPT-3 has been pre-trained on a vast amount of text from the open internet. When given a prompt with just a few examples, it can often intuit what task you are trying to perform and generate a plausible completion. This is often called "few-shot learning."

Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide examples in the prompt anymore. This saves costs and enables lower-latency requests.


GPT-3는 개방형 인터넷의 방대한 양의 텍스트에 대해 사전 훈련되었습니다. 몇 가지 예와 함께 프롬프트가 제공되면 종종 수행하려는 작업을 직관적으로 파악하고 그럴듯한 완료를 생성할 수 있습니다. 이를 종종 "퓨샷 학습"이라고 합니다.
미세 조정은 프롬프트에 맞출 수 있는 것보다 더 많은 예를 훈련하여 소수 학습을 개선하여 다양한 작업에서 더 나은 결과를 얻을 수 있도록 합니다. 모델이 미세 조정되면 더 이상 프롬프트에 예제를 제공할 필요가 없습니다. 이렇게 하면 비용이 절감되고 지연 시간이 짧은 요청이 가능합니다.


At a high level, fine-tuning involves the following steps:

  1. Prepare and upload training data
  2. Train a new fine-tuned model
  3. Use your fine-tuned model

높은 수준에서 미세 조정에는 다음 단계가 포함됩니다.
1. 학습 데이터 준비 및 업로드
2. 미세 조정된 새 모델 학습
3. 미세 조정된 모델 사용


Visit our pricing page to learn more about how fine-tuned model training and usage are billed.


Pricing page로 가셔서 fine-tuned model 트레이닝과 사용이 어떻게 비용이 발생하는가에 대해 알아보세요.



We recommend using our OpenAI command-line interface (CLI). To install this, run


설치할 때는 우리의 OpenAI command-line 인터페이스를 사용하실 것을 추천 드립니다.

pip install --upgrade openai

(The following instructions work for version 0.9.4 and up. Additionally, the OpenAI CLI requires python 3.)

Set your OPENAI_API_KEY environment variable by adding the following line into your shell initialization script (e.g. .bashrc, zshrc, etc.) or running it in the command line before the fine-tuning command:


(다음 지침은 버전 0.9.4 이상에서 작동합니다. 또한 OpenAI CLI에는 Python 3이 필요합니다.)
셸 초기화 스크립트(예: .bashrc, zshrc 등)에 다음 줄을 추가하거나 미세 조정 명령 전에 명령줄에서 실행하여 OPENAI_API_KEY 환경 변수를 설정합니다.





Prepare training data


Training data is how you teach GPT-3 what you'd like it to say.

Your data must be a JSONL document, where each line is a prompt-completion pair corresponding to a training example. You can use our CLI data preparation tool to easily convert your data into this file format.


교육 데이터는 GPT-3에게 원하는 내용을 가르치는 방법입니다.
데이터는 JSONL 문서여야 하며 각 라인은 교육 예제에 해당하는 프롬프트-완성 쌍입니다. CLI 데이터 준비 도구를 사용하여 데이터를 이 파일 형식으로 쉽게 변환할 수 있습니다.


{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}


Designing your prompts and completions for fine-tuning is different from designing your prompts for use with our base models (Davinci, Curie, Babbage, Ada). In particular, while prompts for base models often consist of multiple examples ("few-shot learning"), for fine-tuning, each training example generally consists of a single input example and its associated output, without the need to give detailed instructions or include multiple examples in the same prompt.

For more detailed guidance on how to prepare training data for various tasks, please refer to our preparing your dataset guide.

The more training examples you have, the better. We recommend having at least a couple hundred examples. In general, we've found that each doubling of the dataset size leads to a linear increase in model quality.


미세 조정을 위한 프롬프트 및 완성을 디자인하는 것은 기본 모델(Davinci, Curie, Babbage, Ada)에서 사용할 프롬프트를 디자인하는 것과 다릅니다. 특히, 기본 모델에 대한 프롬프트는 종종 여러 예제("몇 번 학습")로 구성되지만 미세 조정을 위해 각 교육 예제는 일반적으로 단일 입력 예제와 관련 출력으로 구성되며 자세한 지침이나 설명을 제공할 필요가 없습니다. 동일한 프롬프트에 여러 예를 포함합니다.
다양한 작업을 위한 교육 데이터를 준비하는 방법에 대한 자세한 지침은 데이터 세트 준비 가이드를 참조하세요.
학습 예제가 많을수록 좋습니다. 최소한 수백 개의 예가 있는 것이 좋습니다. 일반적으로 데이터 세트 크기가 두 배가 될 때마다 모델 품질이 선형적으로 증가한다는 사실을 발견했습니다.


CLI data preparation tool

We developed a tool which validates, gives suggestions and reformats your data:


데이터를 검증하고, 제안하고, 형식을 다시 지정하는 도구를 개발했습니다.


openai tools fine_tunes.prepare_data -f <LOCAL_FILE>


This tool accepts different formats, with the only requirement that they contain a prompt and a completion column/key. You can pass a CSV, TSV, XLSX, JSON or JSONL file, and it will save the output into a JSONL file ready for fine-tuning, after guiding you through the process of suggested changes.


이 도구는 프롬프트와 완료 열/키를 포함하는 유일한 요구 사항과 함께 다양한 형식을 허용합니다. CSV, TSV, XLSX, JSON 또는 JSONL 파일을 전달할 수 있으며 제안된 변경 프로세스를 안내한 후 미세 조정 준비가 된 JSONL 파일에 출력을 저장합니다.


Create a fine-tuned model

The following assumes you've already prepared training data following the above instructions.

Start your fine-tuning job using the OpenAI CLI:


다음은 위의 지침에 따라 훈련 데이터를 이미 준비했다고 가정합니다.
OpenAI CLI를 사용하여 미세 조정 작업을 시작합니다.


openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>


Where BASE_MODEL is the name of the base model you're starting from (ada, babbage, curie, or davinci). You can customize your fine-tuned model's name using the suffix parameter.

Running the above command does several things:


여기서 BASE_MODEL은 시작하는 기본 모델의 이름입니다(ada, babbage, curie 또는 davinci). 접미사 매개변수를 사용하여 미세 조정된 모델의 이름을 사용자 정의할 수 있습니다.
위의 명령을 실행하면 여러 작업이 수행됩니다.



Running the above command does several things:

  1. Uploads the file using the files API (or uses an already-uploaded file)
  2. Creates a fine-tune job
  3. Streams events until the job is done (this often takes minutes, but can take hours if there are many jobs in the queue or your dataset is large)

위의 명령을 실행하면 여러 작업이 수행됩니다.
1. 파일 API를 사용하여 파일 업로드(또는 이미 업로드된 파일 사용)
2. 미세 조정 작업 생성
3. 작업이 완료될 때까지 이벤트를 스트리밍합니다(종종 몇 분 정도 걸리지만 대기열에 작업이 많거나 데이터 세트가 큰 경우 몇 시간이 걸릴 수 있음)


Every fine-tuning job starts from a base model, which defaults to curie. The choice of model influences both the performance of the model and the cost of running your fine-tuned model. Your model can be one of: ada, babbage, curie, or davinci. Visit our pricing page for details on fine-tune rates.

After you've started a fine-tune job, it may take some time to complete. Your job may be queued behind other jobs on our system, and training our model can take minutes or hours depending on the model and dataset size. If the event stream is interrupted for any reason, you can resume it by running:


모든 미세 조정 작업은 기본 모델인 큐리에서 시작됩니다. 모델 선택은 모델의 성능과 미세 조정된 모델을 실행하는 비용 모두에 영향을 미칩니다. 모델은 ada, babbage, curie 또는 davinci 중 하나일 수 있습니다. 미세 조정 요율에 대한 자세한 내용은 가격 책정 페이지를 참조하십시오.
미세 조정 작업을 시작한 후 완료하는 데 약간의 시간이 걸릴 수 있습니다. 귀하의 작업은 시스템의 다른 작업 뒤에 대기할 수 있으며 모델을 교육하는 데 모델 및 데이터 세트 크기에 따라 몇 분 또는 몇 시간이 걸릴 수 있습니다. 어떤 이유로든 이벤트 스트림이 중단된 경우 다음을 실행하여 재개할 수 있습니다.


openai api fine_tunes.follow -i <YOUR_FINE_TUNE_JOB_ID>

When the job is done, it should display the name of the fine-tuned model.

In addition to creating a fine-tune job, you can also list existing jobs, retrieve the status of a job, or cancel a job.


작업이 완료되면 미세 조정된 모델의 이름이 표시되어야 합니다.
미세 조정 작업을 생성하는 것 외에도 기존 작업을 나열하거나 작업 상태를 검색하거나 작업을 취소할 수 있습니다.


# List all created fine-tunes
openai api fine_tunes.list

# Retrieve the state of a fine-tune. The resulting object includes
# job status (which can be one of pending, running, succeeded, or failed)
# and other information
openai api fine_tunes.get -i <YOUR_FINE_TUNE_JOB_ID>

# Cancel a job
openai api fine_tunes.cancel -i <YOUR_FINE_TUNE_JOB_ID>


Use a fine-tuned model

When a job has succeeded, the fine_tuned_model field will be populated with the name of the model. You may now specify this model as a parameter to our Completions API, and make requests to it using the Playground.

After your job first completes, it may take several minutes for your model to become ready to handle requests. If completion requests to your model time out, it is likely because your model is still being loaded. If this happens, try again in a few minutes.

You can start making requests by passing the model name as the model parameter of a completion request:



작업이 성공하면 fine_tuned_model 필드가 모델 이름으로 채워집니다. 이제 이 모델을 Completions API의 매개변수로 지정하고 플레이그라운드를 사용하여 요청할 수 있습니다.
작업이 처음 완료된 후 모델이 요청을 처리할 준비가 되는 데 몇 분 정도 걸릴 수 있습니다. 모델에 대한 완료 요청 시간이 초과되면 모델이 아직 로드 중이기 때문일 수 있습니다. 이 경우 몇 분 후에 다시 시도하십시오.
완료 요청의 모델 매개변수로 모델 이름을 전달하여 요청을 시작할 수 있습니다.


openai api completions.create -m <FINE_TUNED_MODEL> -p <YOUR_PROMPT>


curl https://api.openai.com/v1/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": YOUR_PROMPT, "model": FINE_TUNED_MODEL}'


import openai


const response = await openai.createCompletion({
  prompt: YOUR_PROMPT,

You may continue to use all the other Completions parameters like temperature, frequency_penalty, presence_penalty, etc, on these requests to fine-tuned models.


미세 조정된 모델에 대한 이러한 요청에서 온도, 주파수 페널티, 프레즌스 페널티 등과 같은 다른 모든 완료 매개변수를 계속 사용할 수 있습니다.


Delete a fine-tuned model

To delete a fine-tuned model, you must be designated an "owner" within your organization.


미세 조정된 모델을 삭제하려면 조직 내에서 "소유자"로 지정되어야 합니다.



openai api models.delete -i <FINE_TUNED_MODEL>


curl -X "DELETE" https://api.openai.com/v1/models/ \
  -H "Authorization: Bearer $OPENAI_API_KEY"


import openai


Preparing your dataset

Fine-tuning is a powerful technique to create a new model that's specific to your use case. Before fine-tuning your model, we strongly recommend reading these best practices and specific guidelines for your use case below.


미세 조정은 사용 사례에 맞는 새 모델을 만드는 강력한 기술입니다. 모델을 미세 조정하기 전에 아래의 사용 사례에 대한 모범 사례 및 특정 지침을 읽는 것이 좋습니다.


Data formatting

To fine-tune a model, you'll need a set of training examples that each consist of a single input ("prompt") and its associated output ("completion"). This is notably different from using our base models, where you might input detailed instructions or multiple examples in a single prompt.

  • Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is \n\n###\n\n. The separator should not appear elsewhere in any prompt.
  • Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace.
  • Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion.
  • For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.

모델을 미세 조정하려면 각각 단일 입력("prompt") 및 관련 출력("completion")으로 구성된 일련의 교육 예제가 필요합니다. 이는 단일 프롬프트에 자세한 지침이나 여러 예를 입력할 수 있는 기본 모델을 사용하는 것과는 현저하게 다릅니다.
* 각 프롬프트는 고정 구분 기호로 끝나야 프롬프트가 끝나고 완료가 시작되는 시점을 모델에 알려야 합니다. 일반적으로 잘 작동하는 간단한 구분 기호는 \n\n###\n\n입니다. 구분 기호는 프롬프트의 다른 곳에 표시되어서는 안 됩니다.
* 선행 공백으로 대부분의 단어를 토큰화하는 토큰화로 인해 각 완성은 공백으로 시작해야 합니다.
* 각 완료는 완료가 종료될 때 모델에 알리기 위해 고정된 정지 시퀀스로 끝나야 합니다. 중지 시퀀스는 \n, ### 또는 완료에 나타나지 않는 다른 토큰일 수 있습니다.
* 추론을 위해 동일한 구분 기호를 포함하여 훈련 데이터 세트를 생성할 때와 동일한 방식으로 프롬프트의 형식을 지정해야 합니다. 또한 완료를 적절하게 자르려면 동일한 중지 시퀀스를 지정하십시오.


General best practices

Fine-tuning performs better with more high-quality examples. To fine-tune a model that performs better than using a high-quality prompt with our base models, you should provide at least a few hundred high-quality examples, ideally vetted by human experts. From there, performance tends to linearly increase with every doubling of the number of examples. Increasing the number of examples is usually the best and most reliable way of improving performance.

Classifiers are the easiest models to get started with. For classification problems we suggest using ada, which generally tends to perform only very slightly worse than more capable models once fine-tuned, whilst being significantly faster and cheaper.

If you are fine-tuning on a pre-existing dataset rather than writing prompts from scratch, be sure to manually review your data for offensive or inaccurate content if possible, or review as many random samples of the dataset as possible if it is large.


미세 조정은 더 높은 품질의 예제에서 더 잘 수행됩니다. 기본 모델에 고품질 프롬프트를 사용하는 것보다 더 나은 성능을 발휘하는 모델을 미세 조정하려면 인간 전문가가 이상적으로 검토한 고품질 예제를 수백 개 이상 제공해야 합니다. 여기서부터는 예제 수가 두 배가 될 때마다 성능이 선형적으로 증가하는 경향이 있습니다. 예의 수를 늘리는 것이 일반적으로 성능을 향상시키는 가장 좋고 가장 신뢰할 수 있는 방법입니다.
분류자는 시작하기 가장 쉬운 모델입니다. 분류 문제의 경우 ada를 사용하는 것이 좋습니다. ada는 일반적으로 일단 미세 조정되면 더 유능한 모델보다 성능이 약간 떨어지는 경향이 있지만 훨씬 더 빠르고 저렴합니다.
프롬프트를 처음부터 작성하는 대신 기존 데이터 세트를 미세 조정하는 경우 가능하면 공격적이거나 부정확한 콘텐츠가 있는지 데이터를 수동으로 검토하거나 데이터 세트가 큰 경우 가능한 한 많은 무작위 샘플을 검토해야 합니다.


Specific guidelines

Fine-tuning can solve a variety of problems, and the optimal way to use it may depend on your specific use case. Below, we've listed the most common use cases for fine-tuning and corresponding guidelines.


미세 조정을 통해 다양한 문제를 해결할 수 있으며 이를 사용하는 최적의 방법은 특정 사용 사례에 따라 달라질 수 있습니다. 아래에는 미세 조정 및 해당 지침에 대한 가장 일반적인 사용 사례가 나열되어 있습니다.




In classification problems, each input in the prompt should be classified into one of the predefined classes. For this type of problem, we recommend:

  • Use a separator at the end of the prompt, e.g. \n\n###\n\n. Remember to also append this separator when you eventually make requests to your model.
  • Choose classes that map to a single token. At inference time, specify max_tokens=1 since you only need the first token for classification.
  • Ensure that the prompt + completion doesn't exceed 2048 tokens, including the separator
  • Aim for at least ~100 examples per class
  • To get class log probabilities you can specify logprobs=5 (for 5 classes) when using your model
  • Ensure that the dataset used for finetuning is very similar in structure and type of task as what the model will be used for

분류 문제에서 프롬프트의 각 입력은 미리 정의된 클래스 중 하나로 분류되어야 합니다. 이러한 유형의 문제에는 다음을 권장합니다.
* 프롬프트 끝에 구분 기호를 사용하십시오. \n\n###\n\n. 결국 모델에 요청할 때 이 구분 기호도 추가해야 합니다.
* 단일 토큰에 매핑되는 클래스를 선택합니다. 분류에는 첫 번째 토큰만 필요하므로 추론 시 max_tokens=1을 지정합니다.
* 프롬프트 + 완료가 구분 기호를 포함하여 2048 토큰을 초과하지 않는지 확인하십시오.
* 클래스당 최소 ~100개의 예시를 목표로 하세요.
* 클래스 로그 확률을 얻으려면 모델을 사용할 때 logprobs=5(5개 클래스에 대해)를 지정할 수 있습니다.
* 미세 조정에 사용되는 데이터 세트가 모델이 사용될 작업의 구조 및 유형과 매우 유사해야 합니다.


Case study: Is the model making untrue statements?

Let's say you'd like to ensure that the text of the ads on your website mention the correct product and company. In other words, you want to ensure the model isn't making things up. You may want to fine-tune a classifier which filters out incorrect ads.

The dataset might look something like the following:


웹사이트의 광고 텍스트에 올바른 제품과 회사가 언급되어 있는지 확인하고 싶다고 가정해 보겠습니다. 즉, 모델이 일을 구성하지 않도록 해야 합니다. 잘못된 광고를 걸러내는 분류자를 미세 조정할 수 있습니다.
데이터 세트는 다음과 같이 표시될 수 있습니다.


{"prompt":"Company: BHFF insurance\nProduct: allround insurance\nAd:One stop shop for all your insurance needs!\nSupported:", "completion":" yes"}
{"prompt":"Company: Loft conversion specialists\nProduct: -\nAd:Straight teeth in weeks!\nSupported:", "completion":" no"}


In the example above, we used a structured input containing the name of the company, the product, and the associated ad. As a separator we used \nSupported: which clearly separated the prompt from the completion. With a sufficient number of examples, the separator doesn't make much of a difference (usually less than 0.4%) as long as it doesn't appear within the prompt or the completion.

For this use case we fine-tuned an ada model since it will be faster and cheaper, and the performance will be comparable to larger models because it is a classification task.

Now we can query our model by making a Completion request.


위의 예에서는 회사 이름, 제품 및 관련 광고를 포함하는 구조화된 입력을 사용했습니다. 구분자로 \nSupported:를 사용하여 프롬프트와 완료를 명확하게 구분했습니다. 충분한 수의 예를 사용하면 구분 기호가 프롬프트 또는 완료 내에 표시되지 않는 한 큰 차이(일반적으로 0.4% 미만)를 만들지 않습니다.
이 사용 사례에서는 ada 모델이 더 빠르고 저렴하기 때문에 미세 조정했으며 성능은 분류 작업이기 때문에 더 큰 모델과 비슷할 것입니다.
이제 완료 요청을 만들어 모델을 쿼리할 수 있습니다.


curl https://api.openai.com/v1/completions \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -d '{
  "prompt": "Company: Reliable accountants Ltd\nProduct: Personal Tax help\nAd:Best advice in town!\nSupported:",
  "max_tokens": 1,


Which will return either yes or no.


예스 , 노 중에서 어떤것이 리턴 될까요?


Case study: Sentiment analysis

Let's say you'd like to get a degree to which a particular tweet is positive or negative. The dataset might look something like the following:


특정 트윗이 긍정적이거나 부정적인 정도를 알고 싶다고 가정해 보겠습니다. 데이터 세트는 다음과 같이 표시될 수 있습니다.


{"prompt":"Overjoyed with the new iPhone! ->", "completion":" positive"}
{"prompt":"@lakers disappoint for a third straight night https://t.co/38EFe43 ->", "completion":" negative"}


Once the model is fine-tuned, you can get back the log probabilities for the first completion token by setting logprobs=2 on the completion request. The higher the probability for positive class, the higher the relative sentiment.


Now we can query our model by making a Completion request.


모델이 미세 조정되면 완료 요청에서 logprobs=2를 설정하여 첫 번째 완료 토큰에 대한 로그 확률을 다시 얻을 수 있습니다. 포지티브 클래스의 확률이 높을수록 상대 감정이 높아집니다.

이제 완료 요청을 만들어 모델을 쿼리할 수 있습니다.

curl https://api.openai.com/v1/completions \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -d '{
  "prompt": "https://t.co/f93xEd2 Excited to share my latest blog post! ->",
  "max_tokens": 1,

Which will return:

  "id": "cmpl-COMPLETION_ID",
  "object": "text_completion",
  "created": 1589498378,
  "choices": [
      "logprobs": {
        "text_offset": [
        "token_logprobs": [
        "tokens": [
          " positive"
        "top_logprobs": [
            " negative": -4.9785037,
            " positive": -0.03597255

      "text": " positive",
      "index": 0,
      "finish_reason": "length"


Case study: Categorization for Email triage

Let's say you'd like to categorize incoming email into one of a large number of predefined categories. For classification into a large number of categories, we recommend you convert those categories into numbers, which will work well up to ~500 categories. We've observed that adding a space before the number sometimes slightly helps the performance, due to tokenization. You may want to structure your training data as follows:


수신 이메일을 미리 정의된 많은 범주 중 하나로 분류하고 싶다고 가정해 보겠습니다. 많은 범주로 분류하려면 해당 범주를 최대 500개 범주까지 잘 작동하는 숫자로 변환하는 것이 좋습니다. 토큰화로 인해 숫자 앞에 공백을 추가하면 성능에 약간 도움이 되는 경우가 있습니다. 학습 데이터를 다음과 같이 구조화할 수 있습니다.


{"prompt":"Subject: <email_subject>\nFrom:<customer_name>\nDate:<date>\nContent:<email_body>\n\n###\n\n", "completion":" <numerical_category>"}




For example:

{"prompt":"Subject: Update my address\nFrom:Joe Doe\nTo:support@ourcompany.com\nDate:2021-06-03\nContent:Hi,\nI would like to update my billing address to match my delivery address.\n\nPlease let me know once done.\n\nThanks,\nJoe\n\n###\n\n", "completion":" 4"}




In the example above we used an incoming email capped at 2043 tokens as input. (This allows for a 4 token separator and a one token completion, summing up to 2048.) As a separator we used \n\n###\n\n and we removed any occurrence of ### within the email.


위의 예에서 우리는 2043 토큰으로 제한되는 수신 이메일을 입력으로 사용했습니다. (이렇게 하면 4개의 토큰 구분 기호와 1개의 토큰 완료가 허용되며 합계는 2048이 됩니다.) 구분 기호로 \n\n###\n\n을 사용했고 이메일 내에서 ### 발생을 제거했습니다.


Conditional generation

Conditional generation is a problem where the content needs to be generated given some kind of input. This includes paraphrasing, summarizing, entity extraction, product description writing given specifications, chatbots and many others. For this type of problem we recommend:

  • Use a separator at the end of the prompt, e.g. \n\n###\n\n. Remember to also append this separator when you eventually make requests to your model.
  • Use an ending token at the end of the completion, e.g. END
  • Remember to add the ending token as a stop sequence during inference, e.g. stop=[" END"]
  • Aim for at least ~500 examples
  • Ensure that the prompt + completion doesn't exceed 2048 tokens, including the separator
  • Ensure the examples are of high quality and follow the same desired format
  • Ensure that the dataset used for finetuning is very similar in structure and type of task as what the model will be used for
  • Using Lower learning rate and only 1-2 epochs tends to work better for these use cases

조건부 생성은 어떤 종류의 입력이 주어지면 콘텐츠를 생성해야 하는 문제입니다. 여기에는 의역, 요약, 엔티티 추출, 지정된 사양 작성, 챗봇 및 기타 여러 제품 설명이 포함됩니다. 이러한 유형의 문제에는 다음을 권장합니다.
* 프롬프트 끝에 구분 기호를 사용하십시오. \n\n###\n\n. 결국 모델에 요청할 때 이 구분 기호도 추가해야 합니다.
* 완료가 끝나면 종료 토큰을 사용하십시오. 끝
* 추론하는 동안 중지 시퀀스로 종료 토큰을 추가해야 합니다. 정지=["종료"]
* 최소 ~500개의 예시를 목표로 하세요.
* 프롬프트 + 완료가 구분 기호를 포함하여 2048 토큰을 초과하지 않는지 확인하십시오.
* 예제가 고품질인지 확인하고 원하는 동일한 형식을 따르십시오.
* 미세 조정에 사용되는 데이터 세트가 모델이 사용될 작업의 구조 및 유형과 매우 유사해야 합니다.
* 낮은 학습률과 1-2 에포크만 사용하는 것이 이러한 사용 사례에 더 잘 작동하는 경향이 있습니다.


Case study: Write an engaging ad based on a Wikipedia article

This is a generative use case so you would want to ensure that the samples you provide are of the highest quality, as the fine-tuned model will try to imitate the style (and mistakes) of the given examples. A good starting point is around 500 examples. A sample dataset might look like this:


이는 생성적 사용 사례이므로 미세 조정된 모델이 주어진 예제의 스타일(및 실수)을 모방하려고 시도하므로 제공하는 샘플이 최고 품질인지 확인해야 합니다. 좋은 시작점은 약 500개의 예제입니다. 샘플 데이터 세트는 다음과 같습니다.


{"prompt":"<Product Name>\n<Wikipedia description>\n\n###\n\n", "completion":" <engaging ad> END"}



For example:

{"prompt":"Samsung Galaxy Feel\nThe Samsung Galaxy Feel is an Android smartphone developed by Samsung Electronics exclusively for the Japanese market. The phone was released in June 2017 and was sold by NTT Docomo. It runs on Android 7.0 (Nougat), has a 4.7 inch display, and a 3000 mAh battery.\nSoftware\nSamsung Galaxy Feel runs on Android 7.0 (Nougat), but can be later updated to Android 8.0 (Oreo).\nHardware\nSamsung Galaxy Feel has a 4.7 inch Super AMOLED HD display, 16 MP back facing and 5 MP front facing cameras. It has a 3000 mAh battery, a 1.6 GHz Octa-Core ARM Cortex-A53 CPU, and an ARM Mali-T830 MP1 700 MHz GPU. It comes with 32GB of internal storage, expandable to 256GB via microSD. Aside from its software and hardware specifications, Samsung also introduced a unique a hole in the phone's shell to accommodate the Japanese perceived penchant for personalizing their mobile phones. The Galaxy Feel's battery was also touted as a major selling point since the market favors handsets with longer battery life. The device is also waterproof and supports 1seg digital broadcasts using an antenna that is sold separately.\n\n###\n\n", "completion":"Looking for a smartphone that can do it all? Look no further than Samsung Galaxy Feel! With a slim and sleek design, our latest smartphone features high-quality picture and video capabilities, as well as an award winning battery life. END"}




Here we used a multi line separator, as Wikipedia articles contain multiple paragraphs and headings. We also used a simple end token, to ensure that the model knows when the completion should finish.


여기서는 Wikipedia 문서에 여러 단락과 제목이 포함되어 있으므로 여러 줄 구분 기호를 사용했습니다. 또한 간단한 종료 토큰을 사용하여 완료가 언제 완료되어야 하는지 모델이 알 수 있도록 했습니다.


Case study: Entity extraction

This is similar to a language transformation task. To improve the performance, it is best to either sort different extracted entities alphabetically or in the same order as they appear in the original text. This will help the model to keep track of all the entities which need to be generated in order. The dataset could look as follows:


이는 언어 변환 작업과 유사합니다. 성능을 향상시키려면 서로 다른 추출 항목을 사전순으로 정렬하거나 원본 텍스트에 나타나는 것과 동일한 순서로 정렬하는 것이 가장 좋습니다. 이렇게 하면 모델이 순서대로 생성해야 하는 모든 엔터티를 추적하는 데 도움이 됩니다. 데이터 세트는 다음과 같이 표시될 수 있습니다.


{"prompt":"<any text, for example news article>\n\n###\n\n", "completion":" <list of entities, separated by a newline> END"}



For example:

{"prompt":"Portugal will be removed from the UK's green travel list from Tuesday, amid rising coronavirus cases and concern over a \"Nepal mutation of the so-called Indian variant\". It will join the amber list, meaning holidaymakers should not visit and returnees must isolate for 10 days...\n\n###\n\n", "completion":" Portugal\nUK\nNepal mutation\nIndian variant END"}



A multi-line separator works best, as the text will likely contain multiple lines. Ideally there will be a high diversity of the types of input prompts (news articles, Wikipedia pages, tweets, legal documents), which reflect the likely texts which will be encountered when extracting entities.


텍스트에 여러 줄이 포함될 가능성이 있으므로 여러 줄 구분 기호가 가장 적합합니다. 이상적으로는 엔터티를 추출할 때 접할 수 있는 텍스트를 반영하는 입력 프롬프트 유형(뉴스 기사, Wikipedia 페이지, 트윗, 법률 문서)이 매우 다양할 것입니다.


Case study: Customer support chatbot

A chatbot will normally contain relevant context about the conversation (order details), summary of the conversation so far as well as most recent messages. For this use case the same past conversation can generate multiple rows in the dataset, each time with a slightly different context, for every agent generation as a completion. This use case will require a few thousand examples, as it will likely deal with different types of requests, and customer issues. To ensure the performance is of high quality we recommend vetting the conversation samples to ensure the quality of agent messages. The summary can be generated with a separate text transformation fine tuned model. The dataset could look as follows:


챗봇에는 일반적으로 대화에 대한 관련 컨텍스트(주문 세부 정보), 지금까지의 대화 요약 및 가장 최근 메시지가 포함됩니다. 이 사용 사례의 경우 동일한 과거 대화가 완료로 모든 에이전트 생성에 대해 매번 약간 다른 컨텍스트로 데이터 세트에 여러 행을 생성할 수 있습니다. 이 사용 사례는 다양한 유형의 요청과 고객 문제를 처리할 가능성이 높기 때문에 수천 개의 예가 필요합니다. 높은 품질의 성능을 보장하려면 상담원 메시지의 품질을 보장하기 위해 대화 샘플을 조사하는 것이 좋습니다. 별도의 텍스트 변환 미세 조정 모델을 사용하여 요약을 생성할 수 있습니다. 데이터 세트는 다음과 같이 표시될 수 있습니다.


{"prompt":"Summary: <summary of the interaction so far>\n\nSpecific information:<for example order details in natural language>\n\n###\n\nCustomer: <message1>\nAgent: <response1>\nCustomer: <message2>\nAgent:", "completion":" <response2>\n"}
{"prompt":"Summary: <summary of the interaction so far>\n\nSpecific information:<for example order details in natural language>\n\n###\n\nCustomer: <message1>\nAgent: <response1>\nCustomer: <message2>\nAgent: <response2>\nCustomer: <message3>\nAgent:", "completion":" <response3>\n"}



Here we purposefully separated different types of input information, but maintained Customer Agent dialog in the same format between a prompt and a completion. All the completions should only be by the agent, and we can use \n as a stop sequence when doing inference.


여기서 우리는 의도적으로 서로 다른 유형의 입력 정보를 분리했지만 프롬프트와 완료 사이에 동일한 형식으로 고객 에이전트 대화 상자를 유지했습니다. 모든 완료는 에이전트에 의해서만 이루어져야 하며 추론을 수행할 때 중지 시퀀스로 \n을 사용할 수 있습니다.


Case study: Product description based on a technical list of properties

Here it is important to convert the input data into a natural language, which will likely lead to superior performance. For example, the following format:


여기서 입력 데이터를 자연어로 변환하는 것이 중요하며 이는 우수한 성능으로 이어질 가능성이 높습니다. 예를 들어 다음과 같은 형식입니다.


{"prompt":"Item=handbag, Color=army_green, price=$99, size=S->", "completion":" This stylish small green handbag will add a unique touch to your look, without costing you a fortune."}



Won't work as well as:

{"prompt":"Item is a handbag. Colour is army green. Price is midrange. Size is small.->", "completion":" This stylish small green handbag will add a unique touch to your look, without costing you a fortune."}



For high performance ensure that the completions were based on the description provided. If external content is often consulted, then adding such content in an automated way would improve the performance. If the description is based on images, it may help to use an algorithm to extract a textual description of the image. Since completions are only one sentence long, we can use . as the stop sequence during inference.


고성능을 위해 완성이 제공된 설명을 기반으로 했는지 확인하십시오. 외부 콘텐츠가 자주 참조되는 경우 이러한 콘텐츠를 자동화된 방식으로 추가하면 성능이 향상됩니다. 설명이 이미지를 기반으로 하는 경우 알고리즘을 사용하여 이미지의 텍스트 설명을 추출하는 것이 도움이 될 수 있습니다. 완성은 한 문장 길이이므로 를 사용할 수 있습니다. 추론하는 동안 중지 시퀀스로.


Advanced usage

Customize your model name

You can add a suffix of up to 40 characters to your fine-tuned model name using the suffix parameter.


접미사 매개변수를 사용하여 미세 조정된 모델 이름에 최대 40자의 접미사를 추가할 수 있습니다.



openai api fine_tunes.create -t test.jsonl -m ada --suffix "custom model name"

The resulting name would be:



Analyzing your fine-tuned model

We attach a result file to each job once it has been completed. This results file ID will be listed when you retrieve a fine-tune, and also when you look at the events on a fine-tune. You can download these files:


작업이 완료되면 각 작업에 결과 파일을 첨부합니다. 이 결과 파일 ID는 미세 조정을 검색할 때와 미세 조정에서 이벤트를 볼 때 나열됩니다. 다음 파일을 다운로드할 수 있습니다.




openai api fine_tunes.results -i <YOUR_FINE_TUNE_JOB_ID>


curl https://api.openai.com/v1/files/$RESULTS_FILE_ID/content \
  -H "Authorization: Bearer $OPENAI_API_KEY" > results.csv


The _results.csv file contains a row for each training step, where a step refers to one forward and backward pass on a batch of data. In addition to the step number, each row contains the following fields corresponding to that step:

  • elapsed_tokens: the number of tokens the model has seen so far (including repeats)
  • elapsed_examples: the number of examples the model has seen so far (including repeats), where one example is one element in your batch. For example, if batch_size = 4, each step will increase elapsed_examples by 4.
  • training_loss: loss on the training batch
  • training_sequence_accuracy: the percentage of completions in the training batch for which the model's predicted tokens matched the true completion tokens exactly. For example, with a batch_size of 3, if your data contains the completions [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 2/3 = 0.67
  • training_token_accuracy: the percentage of tokens in the training batch that were correctly predicted by the model. For example, with a batch_size of 3, if your data contains the completions [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 5/6 = 0.83

_results.csv 파일에는 각 학습 단계에 대한 행이 포함되어 있습니다. 여기서 단계는 데이터 배치에 대한 하나의 정방향 및 역방향 전달을 나타냅니다. 단계 번호 외에도 각 행에는 해당 단계에 해당하는 다음 필드가 포함되어 있습니다.
* elapsed_tokens: 모델이 지금까지 본 토큰 수(반복 포함)
* elapsed_examples: 모델이 지금까지 본 예의 수(반복 포함). 여기서 하나의 예는 배치의 한 요소입니다. 예를 들어 batch_size = 4인 경우 각 단계에서 elapsed_examples가 4씩 증가합니다.
* training_loss: 훈련 배치의 손실
* training_sequence_accuracy: 모델의 예측 토큰이 실제 완료 토큰과 정확히 일치하는 교육 배치의 완료 비율입니다. 예를 들어 batch_size가 3인 경우 데이터에 완료 [[1, 2], [0, 5], [4, 2]]가 포함되고 모델이 [[1, 1], [0, 5]를 예측한 경우 , [4, 2]], 이 정확도는 2/3 = 0.67입니다.
* training_token_accuracy: 훈련 배치에서 모델이 올바르게 예측한 토큰의 백분율입니다. 예를 들어 batch_size가 3인 경우 데이터에 완료 [[1, 2], [0, 5], [4, 2]]가 포함되고 모델이 [[1, 1], [0, 5]를 예측한 경우 , [4, 2]], 이 정확도는 5/6 = 0.83입니다.


Classification specific metrics

We also provide the option of generating additional classification-specific metrics in the results file, such as accuracy and weighted F1 score. These metrics are periodically calculated against the full validation set and at the end of fine-tuning. You will see them as additional columns in your results file.

To enable this, set the parameter --compute_classification_metrics. Additionally, you must provide a validation file, and set either the classification_n_classes parameter, for multiclass classification, or classification_positive_class, for binary classification.


또한 결과 파일에서 정확도 및 가중 F1 점수와 같은 추가 분류별 메트릭을 생성하는 옵션을 제공합니다. 이러한 메트릭은 전체 유효성 검사 세트에 대해 그리고 미세 조정이 끝날 때 주기적으로 계산됩니다. 결과 파일에 추가 열로 표시됩니다.
이를 활성화하려면 --compute_classification_metrics 매개변수를 설정합니다. 또한 유효성 검사 파일을 제공하고 다중 클래스 분류의 경우 classification_n_classes 매개 변수를 설정하거나 이진 분류의 경우 classification_positive_class를 설정해야 합니다.



# For multiclass classification
openai api fine_tunes.create \
  -m <MODEL> \
  --compute_classification_metrics \
  --classification_n_classes <N_CLASSES>

# For binary classification
openai api fine_tunes.create \
  -m <MODEL> \
  --compute_classification_metrics \
  --classification_n_classes 2 \
  --classification_positive_class <POSITIVE_CLASS_FROM_DATASET>

The following metrics will be displayed in your results file if you set --compute_classification_metrics:

For multiclass classification

  • classification/accuracy: accuracy
  • classification/weighted_f1_score: weighted F-1 score

--compute_classification_metrics를 설정하면 결과 파일에 다음 지표가 표시됩니다.
다중 클래스 분류의 경우
* 분류/정확도: 정확도
* classification/weighted_f1_score: 가중 F-1 점수


For binary classification

The following metrics are based on a classification threshold of 0.5 (i.e. when the probability is > 0.5, an example is classified as belonging to the positive class.)

  • classification/accuracy
  • classification/precision
  • classification/recall
  • classification/f{beta}
  • classification/auroc - AUROC
  • classification/auprc - AUPRC

이진 분류의 경우
다음 메트릭은 0.5의 분류 임계값을 기반으로 합니다(즉, 확률이 > 0.5인 경우 예는 포지티브 클래스에 속하는 것으로 분류됨).
* 분류/정확도
* 분류/정밀도
* 분류/회수
* 분류/f{베타}
* 분류/오록 - AUROC
* 분류/auprc - AUPRC


Note that these evaluations assume that you are using text labels for classes that tokenize down to a single token, as described above. If these conditions do not hold, the numbers you get will likely be wrong.


이러한 평가에서는 위에서 설명한 대로 단일 토큰으로 토큰화하는 클래스에 대해 텍스트 레이블을 사용하고 있다고 가정합니다. 이러한 조건이 충족되지 않으면 얻은 숫자가 잘못되었을 수 있습니다.



You can reserve some of your data for validation. A validation file has exactly the same format as a train file, and your train and validation data should be mutually exclusive.

If you include a validation file when creating your fine-tune job, the generated results file will include evaluations on how well the fine-tuned model performs against your validation data at periodic intervals during training.


유효성 검사를 위해 일부 데이터를 예약할 수 있습니다. 검증 파일은 훈련 파일과 정확히 동일한 형식을 가지며 훈련 및 검증 데이터는 상호 배타적이어야 합니다.
미세 조정 작업을 생성할 때 유효성 검사 파일을 포함하는 경우 생성된 결과 파일에는 미세 조정 모델이 훈련 중 주기적 간격으로 유효성 검사 데이터에 대해 얼마나 잘 수행하는지에 대한 평가가 포함됩니다.



openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> \
  -m <MODEL>


If you provided a validation file, we periodically calculate metrics on batches of validation data during training time. You will see the following additional metrics in your results file:

  • validation_loss: loss on the validation batch
  • validation_sequence_accuracy: the percentage of completions in the validation batch for which the model's predicted tokens matched the true completion tokens exactly. For example, with a batch_size of 3, if your data contains the completion [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 2/3 = 0.67
  • validation_token_accuracy: the percentage of tokens in the validation batch that were correctly predicted by the model. For example, with a batch_size of 3, if your data contains the completion [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 5/6 = 0.83

유효성 검사 파일을 제공한 경우 교육 시간 동안 유효성 검사 데이터 배치에 대한 지표를 주기적으로 계산합니다. 결과 파일에 다음과 같은 추가 메트릭이 표시됩니다.
* validation_loss: 검증 배치의 손실
* validation_sequence_accuracy: 모델의 예측 토큰이 실제 완료 토큰과 정확히 일치하는 검증 배치의 완료 비율입니다. 예를 들어 batch_size가 3인 경우 데이터에 완료 [[1, 2], [0, 5], [4, 2]]가 포함되고 모델이 [[1, 1], [0, 5]를 예측한 경우 , [4, 2]], 이 정확도는 2/3 = 0.67입니다.
* validation_token_accuracy: 검증 배치에서 모델이 올바르게 예측한 토큰의 백분율입니다. 예를 들어 batch_size가 3인 경우 데이터에 완료 [[1, 2], [0, 5], [4, 2]]가 포함되고 모델이 [[1, 1], [0, 5]를 예측한 경우 , [4, 2]], 이 정확도는 5/6 = 0.83입니다.



We've picked default hyperparameters that work well across a range of use cases. The only required parameter is the training file.

That said, tweaking the hyperparameters used for fine-tuning can often lead to a model that produces higher quality output. In particular, you may want to configure the following:

  • model: The name of the base model to fine-tune. You can select one of "ada", "babbage", "curie", or "davinci". To learn more about these models, see the Models documentation.
  • n_epochs - defaults to 4. The number of epochs to train the model for. An epoch refers to one full cycle through the training dataset.
  • batch_size - defaults to ~0.2% of the number of examples in the training set, capped at 256. The batch size is the number of training examples used to train a single forward and backward pass. In general, we've found that larger batch sizes tend to work better for larger datasets.
  • learning_rate_multiplier - defaults to 0.05, 0.1, or 0.2 depending on final batch_size. The fine-tuning learning rate is the original learning rate used for pretraining multiplied by this multiplier. We recommend experimenting with values in the range 0.02 to 0.2 to see what produces the best results. Empirically, we've found that larger learning rates often perform better with larger batch sizes.
  • compute_classification_metrics - defaults to False. If True, for fine-tuning for classification tasks, computes classification-specific metrics (accuracy, F-1 score, etc) on the validation set at the end of every epoch.

다양한 사용 사례에서 잘 작동하는 기본 하이퍼파라미터를 선택했습니다. 유일한 필수 매개변수는 교육 파일입니다.
즉, 미세 조정에 사용되는 하이퍼파라미터를 조정하면 종종 더 높은 품질의 출력을 생성하는 모델로 이어질 수 있습니다. 특히 다음을 구성할 수 있습니다.
* model: 미세 조정할 기본 모델의 이름입니다. "ada", "babbage", "curie" 또는 "davinci" 중 하나를 선택할 수 있습니다. 이러한 모델에 대한 자세한 내용은 모델 설명서를 참조하십시오.
* n_epochs - 기본값은 4입니다. 모델을 교육할 에포크 수입니다. 에포크는 교육 데이터 세트를 통한 하나의 전체 주기를 나타냅니다.
* batch_size - 기본값은 훈련 세트에 있는 예제 수의 ~0.2%이며 최대 256개입니다. 배치 크기는 단일 정방향 및 역방향 패스를 훈련하는 데 사용되는 훈련 예제의 수입니다. 일반적으로 우리는 더 큰 배치 크기가 더 큰 데이터 세트에서 더 잘 작동하는 경향이 있음을 발견했습니다.
* learning_rate_multiplier - 최종 batch_size에 따라 기본값은 0.05, 0.1 또는 0.2입니다. 미세 조정 학습 속도는 사전 훈련에 사용된 원래 학습 속도에 이 승수를 곱한 것입니다. 0.02에서 0.2 범위의 값으로 실험하여 최상의 결과를 생성하는 것이 무엇인지 확인하는 것이 좋습니다. 경험적으로 우리는 더 큰 학습률이 더 큰 배치 크기에서 더 나은 성능을 발휘하는 경우가 많다는 것을 발견했습니다.
compute_classification_metrics - 기본값은 False입니다. True인 경우 분류 작업을 미세 조정하기 위해 매 에포크가 끝날 때마다 검증 세트에 대한 분류 관련 메트릭(정확도, F-1 점수 등)을 계산합니다.


To configure these additional hyperparameters, pass them in via command line flags on the OpenAI CLI, for example:


이러한 추가적인 hyperparameter들을 구성하려면 Open AI CLI에서 코맨드 라인 플래그를 통해 전달하세요. 예를 들어:

openai api fine_tunes.create \
  -t file-JD89ePi5KMsB3Tayeli5ovfW \
  -m ada \
  --n_epochs 1


Continue fine-tuning from a fine-tuned model


If you have already fine-tuned a model for your task and now have additional training data that you would like to incorporate, you can continue fine-tuning from the model. This creates a model that has learned from all of the training data without having to re-train from scratch.

To do this, pass in the fine-tuned model name when creating a new fine-tuning job (e.g. -m curie:ft-<org>-<date>). Other training parameters do not have to be changed, however if your new training data is much smaller than your previous training data, you may find it useful to reduce learning_rate_multiplier by a factor of 2 to 4.


작업에 대한 모델을 이미 미세 조정했고 이제 통합하려는 추가 교육 데이터가 있는 경우 모델에서 미세 조정을 계속할 수 있습니다. 이렇게 하면 처음부터 다시 훈련할 필요 없이 모든 훈련 데이터에서 학습한 모델이 생성됩니다.
이렇게 하려면 새 미세 조정 작업을 생성할 때 미세 조정된 모델 이름을 전달합니다(예: -m curie:ft-<org>-<date>). 다른 훈련 매개변수는 변경할 필요가 없지만 새 훈련 데이터가 이전 훈련 데이터보다 훨씬 작은 경우 learning_rate_multiplier를 2~4배 줄이는 것이 유용할 수 있습니다.


Weights & Biases

You can sync your fine-tunes with Weights & Biases to track experiments, models, and datasets.

To get started, you will need a Weights & Biases account and a paid OpenAI plan. To make sure you are using the lastest version of openai and wandb, run:


미세 조정을 Weights & Biases와 동기화하여 실험, 모델 및 데이터 세트를 추적할 수 있습니다.
시작하려면 Weights & Biases 계정과 유료 OpenAI 플랜이 필요합니다. 최신 버전의 openai 및 wandb를 사용하고 있는지 확인하려면 다음을 실행하십시오.

pip install --upgrade openai wandb


To sync your fine-tunes with Weights & Biases, run:

openai wandb sync

You can read the Weights & Biases documentation for more information on this integration.


Example notebooks


This notebook will demonstrate how to fine-tune a model that can classify whether a piece of input text is related to Baseball or Hockey. We will perform this task in four steps in the notebook:

  1. Data exploration will give an overview of the data source and what an example looks like
  2. Data preparation will turn our data source into a jsonl file that can be used for fine-tuning
  3. Fine-tuning will kick off the fine-tuning job and explain the resulting model's performance
  4. Using the model will demonstrate making requests to the fine-tuned model to get predictions.

이 노트북은 입력 텍스트가 야구 또는 하키와 관련이 있는지 여부를 분류할 수 있는 모델을 미세 조정하는 방법을 보여줍니다. 노트북에서 다음 네 단계로 이 작업을 수행합니다.

1. 데이터 탐색은 데이터 소스에 대한 개요와 예제의 모양을 제공합니다.
2. 데이터 준비는 데이터 소스를 미세 조정에 사용할 수 있는 jsonl 파일로 변환합니다.
3. 미세 조정은 미세 조정 작업을 시작하고 결과 모델의 성능을 설명합니다.
4. 모델을 사용하면 예측을 얻기 위해 미세 조정된 모델에 요청하는 것을 보여줍니다.


Question answering

The idea of this project is to create a question answering model, based on a few paragraphs of provided text. Base GPT-3 models do a good job at answering questions when the answer is contained within the paragraph, however if the answer isn't contained, the base models tend to try their best to answer anyway, often leading to confabulated answers.


이 프로젝트의 아이디어는 제공된 텍스트의 몇 단락을 기반으로 질문 응답 모델을 만드는 것입니다. 기본 GPT-3 모델은 답이 문단 내에 포함되어 있을 때 질문에 답하는 데 능숙하지만, 답이 포함되어 있지 않으면 기본 모델은 어쨌든 최선을 다해 답변을 시도하는 경향이 있으며 종종 조립식 답변으로 이어집니다.



To create a model which answers questions only if there is sufficient context for doing so, we first create a dataset of questions and answers based on paragraphs of text. In order to train the model to answer only when the answer is present, we also add adversarial examples, where the question doesn't match the context. In those cases, we ask the model to output "No sufficient context for answering the question".


충분한 컨텍스트가 있는 경우에만 질문에 답하는 모델을 만들려면 먼저 텍스트 단락을 기반으로 질문과 답변의 데이터 세트를 만듭니다. 대답이 있을 때만 대답하도록 모델을 훈련시키기 위해 질문이 컨텍스트와 일치하지 않는 적대적인 예도 추가합니다. 이러한 경우 모델에 "질문에 답하기 위한 컨텍스트가 충분하지 않음"을 출력하도록 요청합니다.


We will perform this task in three notebooks:

  1. The first notebook focuses on collecting recent data, which GPT-3 didn't see during it's pre-training. We picked the topic of Olympic Games 2020 (which actually took place in the summer of 2021), and downloaded 713 unique pages. We organized the dataset by individual sections, which will serve as context for asking and answering the questions.
  2. The second notebook will utilize Davinci-instruct to ask a few questions based on a Wikipedia section, as well as answer those questions, based on that section.
  3. The third notebook will utilize the dataset of context, question and answer pairs to additionally create adversarial questions and context pairs, where the question was not generated on that context. In those cases the model will be prompted to answer "No sufficient context for answering the question". We will also train a discriminator model, which predicts whether the question can be answered based on the context or not.

다음 세 가지 노트북에서 이 작업을 수행합니다.

1. 첫 번째 노트북은 사전 훈련 중에 GPT-3가 보지 못한 최근 데이터 수집에 중점을 둡니다. 2020년 올림픽(실제로는 2021년 여름에 개최)을 주제로 선정하여 713개의 고유 페이지를 다운로드했습니다. 우리는 질문을 하고 대답하기 위한 컨텍스트 역할을 할 개별 섹션별로 데이터 세트를 구성했습니다.

2. 두 번째 노트북은 Davinci-instruct를 활용하여 Wikipedia 섹션을 기반으로 몇 가지 질문을 하고 해당 섹션을 기반으로 해당 질문에 답변합니다.

3. 세 번째 노트북은 컨텍스트, 질문 및 답변 쌍의 데이터 세트를 활용하여 해당 컨텍스트에서 질문이 생성되지 않은 적대적 질문 및 컨텍스트 쌍을 추가로 생성합니다. 이러한 경우 모델은 "질문에 답하기 위한 컨텍스트가 충분하지 않습니다"라고 대답하라는 메시지가 표시됩니다. 우리는 또한 문맥에 따라 질문에 답할 수 있는지 여부를 예측하는 판별 모델을 훈련할 것입니다.


