반응형
블로그 이미지
개발자로서 현장에서 일하면서 새로 접하는 기술들이나 알게된 정보 등을 정리하기 위한 블로그입니다. 운 좋게 미국에서 큰 회사들의 프로젝트에서 컬설턴트로 일하고 있어서 새로운 기술들을 접할 기회가 많이 있습니다. 미국의 IT 프로젝트에서 사용되는 툴들에 대해 많은 분들과 정보를 공유하고 싶습니다.
솔웅

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

카테고리

'Get Started'에 해당되는 글 4

  1. 2023.03.06 Get Started - Tutorials 3
  2. 2023.01.05 Get Started - Models (Added GPT-4 03/14/2023) 1
  3. 2023.01.04 Get Started - Libraries
  4. 2023.01.03 Get Started - Introduction

Get Started - Tutorials

2023. 3. 6. 23:02 | Posted by 솔웅


반응형

이번에 (2023년 3월 1일) OpenAI 가 ChatGPT API 를 공개하면서 docs 페이지가 많이 추가 됐습니다.

그 중에 Get Started 부문에서는 Tutorials 와 Data usage policies 부분이 추가 됐습니다.

 

오늘은 그 중에 Tutorials 페이지를 보겠습니다.

 

https://platform.openai.com/docs/tutorials

 

OpenAI API

An API for accessing new AI models developed by OpenAI

platform.openai.com

Tutorials에는 OpenAI API를 사용해서 실제 AI 어플리케이션을 개발 하는 과정을 스텝 바이 스텝으로 따라갈 수 있도록 하기 위해 만들어진 페이지 인 것 같습니다.

 

아직까지는 Website Q&A with Embeddings 라는 것만 있고 나머지 두개는 Coming soon 으로 돼 있습니다.

마지막 줄에서는 Examples gallery OpenAI Cookbook  이 소개 돼 있고 이곳들에서 좀 더 많은 아이디어를 얻을 수 있다고 돼 있습니다.

 

저는 OpenAI Cookbook을 공부하다가 3월 1일 ChatGPT API 와 speech to text 모델인 Whisper 가 공개 되면서 여러가지 업데이트가 많이 돼서 잠깐 그 업데이트들을 살펴 보고 있습니다.

 

이 업데이트들을 다 살펴 본 후 다시 OpenAI Cookbook을 공부하고 그 다음에 Examples gallery 에 있는 예제들을 분석해 볼 예정입니다.

 

그 다음 여건이 되면 개인 프로젝트를 하고 싶은데.... 아직 희망사항이구요.

 

이번 업데이트는 Get Started, Guides 그리고 API Reference 에 많은 페이지들이 추가 됐구요 그 중 Guides 에 추가 된 Chat completion, Speech to text, Rate limit 그리고 Error codes 페이지는 다 살펴 보았습니다.

 

https://coronasdk.tistory.com/category/Open%20AI/GUIDES

 

'Open AI/GUIDES' 카테고리의 글 목록

개발자로서 현장에서 일하면서 새로 접하는 기술들이나 알게된 정보 등을 정리하기 위한 블로그입니다. 운 좋게 미국에서 큰 회사들의 프로젝트에서 컬설턴트로 일하고 있어서 새로운 기술들

coronasdk.tistory.com

 

지금은 Get Started 를 보고 있고 여기에는 Tutorials와 Data usage policies 페이지가 추가 됐습니다.

 

Get Started 를 다 마치면 API Reference 쪽을 보고 다시 OpenAI Cookbook으로 돌아 가겠습니다.

 

그럼 본론으로 들어가서 GetStarted - Tutorials - Website Q&A with Embeddings 페이지를 살펴 보도록 하겠습니다.

 

여기에는 여러분의 웹사이트에 대한 질문에 답변을 할 수 있는 AI를 어떻게 build 하는지에 대해 배우는 페이지라고 돼 있습니다.

 

https://platform.openai.com/docs/tutorials/web-qa-embeddings

 

OpenAI API

An API for accessing new AI models developed by OpenAI

platform.openai.com

 

How to build an AI that can answer questions about your website

This tutorial walks through a simple example of crawling a website (in this example, the OpenAI website), turning the crawled pages into embeddings using the Embeddings API, and then creating a basic search functionality that allows a user to ask questions about the embedded information. This is intended to be a starting point for more sophisticated applications that make use of custom knowledge bases.

 

이 튜토리얼에서는 웹사이트(이 예에서는 OpenAI 웹사이트)를 크롤링하고 Embeddings API를 사용하여 크롤링된 페이지를 임베딩으로 변환한 다음 사용자가 임베디드 정보에 대해 질문할 수 있는 기본 검색 기능을 생성하는 간단한 예제를 안내합니다. . 이는 custom knowledge bases를 사용하는 보다 정교한 애플리케이션 개발의 첫 발걸음을 어떻게 떼어야 하는지 보여드릴 것입니다..

 

 

Getting started

 

Some basic knowledge of Python and GitHub is helpful for this tutorial. Before diving in, make sure to set up an OpenAI API key and walk through the quickstart tutorial. This will give a good intuition on how to use the API to its full potential.

 

Python 및 GitHub에 대한 기본 지식이 이 자습서를 공부하는데 도움이 됩니다. 시작하기 전에 OpenAI API 키를 설정하고 quickstart tutorial를 살펴보십시오. 이렇게 하면 API를 최대한 활용하는 방법에 대한 좋은 직관을 얻을 수 있습니다.

 

Python is used as the main programming language along with the OpenAI, Pandas, transformers, NumPy, and other popular packages. If you run into any issues working through this tutorial, please ask a question on the OpenAI Community Forum.

 

Python은 OpenAI, Pandas, 변환기, NumPy 및 기타 널리 사용되는 패키지와 함께 기본 프로그래밍 언어로 사용됩니다. 이 튜토리얼을 진행하면서 문제가 발생하면 OpenAI Community Forum에서 질문하십시오.

 

To start with the code, clone the full code for this tutorial on GitHub. Alternatively, follow along and copy each section into a Jupyter notebook and run the code step by step, or just read along. A good way to avoid any issues is to set up a new virtual environment and install the required packages by running the following commands:

 

코드로 시작하려면 GitHub에서 이 자습서의 전체 코드를 복제하세요. 또는 따라가서 각 섹션을 Jupyter 노트북에 복사하고 코드를 단계별로 실행하세요. 그런 시간이 없으면 그냥 이 페이지를 읽어 보시기만 하셔도 됩니다. 문제를 피하는 좋은 방법은 다음 명령을 실행하여 새 가상 환경을 설정하고 필요한 패키지를 설치하는 것입니다.

 

python -m venv env

source env/bin/activate

pip install -r requirements.txt

Note : venv 는 가상 환경을 만드는 파이썬 툴이고 source는 실행 명령 입니다. 이것이 안 되면 해당 폴더로 가서 activate을 실행 시키면 됩니다.

requirements.txt 는 위에서 준 GitHub 링크로 가면 있습니다. full code for this tutorial on GitHub 

 

GitHub - openai/openai-cookbook: Examples and guides for using the OpenAI API

Examples and guides for using the OpenAI API. Contribute to openai/openai-cookbook development by creating an account on GitHub.

github.com

 

Setting up a web crawler

The primary focus of this tutorial is the OpenAI API so if you prefer, you can skip the context on how to create a web crawler and just download the source code. Otherwise, expand the section below to work through the scraping mechanism implementation.

 

이 자습서의 주요 초점은 OpenAI API이므로 원하는 경우 웹 크롤러를 만드는 방법에 대한 컨텍스트를 건너뛰고 소스 코드를 다운로드할 수 있습니다. 그렇지 않으면 아래 섹션을 확장하여 스크래핑 메커니즘 구현을 통해 작업하십시오.

 

Learn how to build a web crawler

 

 

Acquiring data in text form is the first step to use embeddings. This tutorial creates a new set of data by crawling the OpenAI website, a technique that you can also use for your own company or personal website.
 
임베딩을 사용하기 위한 첫 번째 단계는 텍스트 형태로 데이터를 수집하는 것입니다. 이 자습서에서는 회사 또는 개인 웹 사이트에도 사용할 수 있는 기술인 OpenAI 웹 사이트를 크롤링하여 새로운 데이터 집합을 만듭니다.
 

 

While this crawler is written from scratch, open source packages like Scrapy can also help with these operations.

 

이 크롤러는 처음부터 작성되었지만 Scrapy와 같은 오픈 소스 패키지도 이러한 작업에 도움이 될 수 있습니다.

 

This crawler will start from the root URL passed in at the bottom of the code below, visit each page, find additional links, and visit those pages as well (as long as they have the same root domain). To begin, import the required packages, set up the basic URL, and define a HTMLParser class.

 

이 크롤러는 아래 코드 하단에 전달된 루트 URL에서 시작하여 각 페이지를 방문하고 추가 링크를 찾고 해당 페이지도 방문합니다(루트 도메인이 동일한 경우). 시작하려면 필요한 패키지를 가져오고 기본 URL을 설정하고 HTMLParser 클래스를 정의합니다.

 

import requests
import re
import urllib.request
from bs4 import BeautifulSoup
from collections import deque
from html.parser import HTMLParser
from urllib.parse import urlparse
import os

# Regex pattern to match a URL
HTTP_URL_PATTERN = r'^http[s]*://.+'

domain = "openai.com" # <- put your domain to be crawled
full_url = "https://openai.com/" # <- put your domain to be crawled with https or http

# Create a class to parse the HTML and get the hyperlinks
class HyperlinkParser(HTMLParser):
    def __init__(self):
        super().__init__()
        # Create a list to store the hyperlinks
        self.hyperlinks = []

    # Override the HTMLParser's handle_starttag method to get the hyperlinks
    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)

        # If the tag is an anchor tag and it has an href attribute, add the href attribute to the list of hyperlinks
        if tag == "a" and "href" in attrs:
            self.hyperlinks.append(attrs["href"])

 

다음 함수는 URL을 인수로 사용하여 URL을 열고 HTML 콘텐츠를 읽습니다. 그런 다음 해당 페이지에서 찾은 모든 하이퍼링크를 반환합니다.

 

# Function to get the hyperlinks from a URL
def get_hyperlinks(url):
    
    # Try to open the URL and read the HTML
    try:
        # Open the URL and read the HTML
        with urllib.request.urlopen(url) as response:

            # If the response is not HTML, return an empty list
            if not response.info().get('Content-Type').startswith("text/html"):
                return []
            
            # Decode the HTML
            html = response.read().decode('utf-8')
    except Exception as e:
        print(e)
        return []

    # Create the HTML Parser and then Parse the HTML to get hyperlinks
    parser = HyperlinkParser()
    parser.feed(html)

    return parser.hyperlinks

 

The goal is to crawl through and index only the content that lives under the OpenAI domain. For this purpose, a function that calls the get_hyperlinks function but filters out any URLs that are not part of the specified domain is needed.

 

목표는 OpenAI 도메인에 있는 콘텐츠만 크롤링하고 인덱싱하는 것입니다. 이를 위해서는 get_hyperlinks 함수를 호출하되 지정된 도메인에 속하지 않는 URL을 필터링하는 함수가 필요합니다.

 

# Function to get the hyperlinks from a URL that are within the same domain
def get_domain_hyperlinks(local_domain, url):
    clean_links = []
    for link in set(get_hyperlinks(url)):
        clean_link = None

        # If the link is a URL, check if it is within the same domain
        if re.search(HTTP_URL_PATTERN, link):
            # Parse the URL and check if the domain is the same
            url_obj = urlparse(link)
            if url_obj.netloc == local_domain:
                clean_link = link

        # If the link is not a URL, check if it is a relative link
        else:
            if link.startswith("/"):
                link = link[1:]
            elif link.startswith("#") or link.startswith("mailto:"):
                continue
            clean_link = "https://" + local_domain + "/" + link

        if clean_link is not None:
            if clean_link.endswith("/"):
                clean_link = clean_link[:-1]
            clean_links.append(clean_link)

    # Return the list of hyperlinks that are within the same domain
    return list(set(clean_links))

The crawl function is the final step in the web scraping task setup. It keeps track of the visited URLs to avoid repeating the same page, which might be linked across multiple pages on a site. It also extracts the raw text from a page without the HTML tags, and writes the text content into a local .txt file specific to the page.

 

크롤링 기능은 웹 스크래핑 작업 설정의 마지막 단계입니다. 사이트의 여러 페이지에 연결될 수 있는 동일한 페이지를 반복하지 않도록 방문한 URL을 추적합니다. 또한 HTML 태그가 없는 페이지에서 원시 텍스트를 추출하고 텍스트 콘텐츠를 페이지 고유의 로컬 .txt 파일에 기록합니다.

 

def crawl(url):
    # Parse the URL and get the domain
    local_domain = urlparse(url).netloc

    # Create a queue to store the URLs to crawl
    queue = deque([url])

    # Create a set to store the URLs that have already been seen (no duplicates)
    seen = set([url])

    # Create a directory to store the text files
    if not os.path.exists("text/"):
            os.mkdir("text/")

    if not os.path.exists("text/"+local_domain+"/"):
            os.mkdir("text/" + local_domain + "/")

    # Create a directory to store the csv files
    if not os.path.exists("processed"):
            os.mkdir("processed")

    # While the queue is not empty, continue crawling
    while queue:

        # Get the next URL from the queue
        url = queue.pop()
        print(url) # for debugging and to see the progress

        # Save text from the url to a <url>.txt file
        with open('text/'+local_domain+'/'+url[8:].replace("/", "_") + ".txt", "w", encoding="UTF-8") as f:

            # Get the text from the URL using BeautifulSoup
            soup = BeautifulSoup(requests.get(url).text, "html.parser")

            # Get the text but remove the tags
            text = soup.get_text()

            # If the crawler gets to a page that requires JavaScript, it will stop the crawl
            if ("You need to enable JavaScript to run this app." in text):
                print("Unable to parse page " + url + " due to JavaScript being required")
            
            # Otherwise, write the text to the file in the text directory
            f.write(text)

        # Get the hyperlinks from the URL and add them to the queue
        for link in get_domain_hyperlinks(local_domain, url):
            if link not in seen:
                queue.append(link)
                seen.add(link)

crawl(full_url)

The last line of the above example runs the crawler which goes through all the accessible links and turns those pages into text files. This will take a few minutes to run depending on the size and complexity of your site.

 

위 예제의 마지막 줄은 크롤러를 실행하여 액세스 가능한 모든 링크를 통과하고 해당 페이지를 텍스트 파일로 변환합니다. 사이트의 크기와 복잡성에 따라 실행하는 데 몇 분 정도 걸립니다.

 

Building an embeddings index

CSV is a common format for storing embeddings. You can use this format with Python by converting the raw text files (which are in the text directory) into Pandas data frames. Pandas is a popular open source library that helps you work with tabular data (data stored in rows and columns).
 
CSV는 임베딩을 저장하기 위한 일반적인 형식입니다. 원시 텍스트 파일(텍스트 디렉터리에 있음)을 Pandas 데이터 프레임으로 변환하여 Python에서 이 형식을 사용할 수 있습니다. Pandas는 테이블 형식 데이터(행과 열에 저장된 데이터)로 작업하는 데 도움이 되는 인기 있는 오픈 소스 라이브러리입니다.
 
Blank empty lines can clutter the text files and make them harder to process. A simple function can remove those lines and tidy up the files.
 
비어 있는 빈 줄은 텍스트 파일을 복잡하게 만들고 처리하기 어렵게 만들 수 있습니다. 간단한 기능으로 해당 줄을 제거하고 파일을 정리할 수 있습니다.
def remove_newlines(serie):
    serie = serie.str.replace('\n', ' ')
    serie = serie.str.replace('\\n', ' ')
    serie = serie.str.replace('  ', ' ')
    serie = serie.str.replace('  ', ' ')
    return serie

 

Converting the text to CSV requires looping through the text files in the text directory created earlier. After opening each file, remove the extra spacing and append the modified text to a list. Then, add the text with the new lines removed to an empty Pandas data frame and write the data frame to a CSV file.

 

텍스트를 CSV로 변환하려면 이전에 만든 텍스트 디렉터리의 텍스트 파일을 통해 반복해야 합니다. 각 파일을 연 후 여분의 공백을 제거하고 수정된 텍스트를 목록에 추가합니다. 그런 다음 새 줄이 제거된 텍스트를 빈 Pandas 데이터 프레임에 추가하고 데이터 프레임을 CSV 파일에 씁니다.

 

Extra spacing and new lines can clutter the text and complicate the embeddings process. The code used here helps to remove some of them but you may find 3rd party libraries or other methods useful to get rid of more unnecessary characters.

 

여분의 간격과 새 줄은 텍스트를 어지럽히고 임베딩 프로세스를 복잡하게 만들 수 있습니다. 여기에 사용된 코드는 그 중 일부를 제거하는 데 도움이 되지만 더 많은 불필요한 문자를 제거하는 데 유용한 타사 라이브러리 또는 기타 방법을 찾을 수 있습니다.

 

import pandas as pd

# Create a list to store the text files
texts=[]

# Get all the text files in the text directory
for file in os.listdir("text/" + domain + "/"):

    # Open the file and read the text
    with open("text/" + domain + "/" + file, "r", encoding="UTF-8") as f:
        text = f.read()

        # Omit the first 11 lines and the last 4 lines, then replace -, _, and #update with spaces.
        texts.append((file[11:-4].replace('-',' ').replace('_', ' ').replace('#update',''), text))

# Create a dataframe from the list of texts
df = pd.DataFrame(texts, columns = ['fname', 'text'])

# Set the text column to be the raw text with the newlines removed
df['text'] = df.fname + ". " + remove_newlines(df.text)
df.to_csv('processed/scraped.csv')
df.head()

 

Tokenization is the next step after saving the raw text into a CSV file. This process splits the input text into tokens by breaking down the sentences and words. A visual demonstration of this can be seen by checking out our Tokenizer in the docs.

 

OpenAI API

An API for accessing new AI models developed by OpenAI

platform.openai.com

 

토큰화는 원시 텍스트를 CSV 파일로 저장한 후의 다음 단계입니다. 이 프로세스는 문장과 단어를 분해하여 입력 텍스트를 토큰으로 분할합니다. 이에 대한 시각적 데모는 checking out our Tokenizer  문서에서 확인하여 볼 수 있습니다.

 

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

 

유용한 경험 법칙은 하나의 토큰이 일반적으로 일반 영어 텍스트의 텍스트에서 ~4자에 해당한다는 것입니다. 이것은 대략 단어의 3/4로 변환됩니다(따라서 100 토큰 ~= 75 단어).

 

The API has a limit on the maximum number of input tokens for embeddings. To stay below the limit, the text in the CSV file needs to be broken down into multiple rows. The existing length of each row will be recorded first to identify which rows need to be split.

 

API에는 임베딩을 위한 최대 입력 토큰 수에 대한 제한이 있습니다. 한도 미만으로 유지하려면 CSV 파일의 텍스트를 여러 행으로 나누어야 합니다. 분할해야 하는 행을 식별하기 위해 각 행의 기존 길이가 먼저 기록됩니다.

 

import tiktoken

# Load the cl100k_base tokenizer which is designed to work with the ada-002 model
tokenizer = tiktoken.get_encoding("cl100k_base")

df = pd.read_csv('processed/scraped.csv', index_col=0)
df.columns = ['title', 'text']

# Tokenize the text and save the number of tokens to a new column
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

# Visualize the distribution of the number of tokens per row using a histogram
df.n_tokens.hist()

 

The newest embeddings model can handle inputs with up to 8191 input tokens so most of the rows would not need any chunking, but this may not be the case for every subpage scraped so the next code chunk will split the longer lines into smaller chunks.

 

최신 임베딩 모델은 최대 8191개의 입력 토큰으로 입력을 처리할 수 있으므로 대부분의 행에 청킹이 필요하지 않지만 스크랩된 모든 하위 페이지에 해당하는 것은 아니므로 다음 코드 청크가 더 긴 줄을 더 작은 청크로 분할합니다.

 

max_tokens = 500

# Function to split the text into chunks of a maximum number of tokens
def split_into_many(text, max_tokens = max_tokens):

    # Split the text into sentences
    sentences = text.split('. ')

    # Get the number of tokens for each sentence
    n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences]
    
    chunks = []
    tokens_so_far = 0
    chunk = []

    # Loop through the sentences and tokens joined together in a tuple
    for sentence, token in zip(sentences, n_tokens):

        # If the number of tokens so far plus the number of tokens in the current sentence is greater 
        # than the max number of tokens, then add the chunk to the list of chunks and reset
        # the chunk and tokens so far
        if tokens_so_far + token > max_tokens:
            chunks.append(". ".join(chunk) + ".")
            chunk = []
            tokens_so_far = 0

        # If the number of tokens in the current sentence is greater than the max number of 
        # tokens, go to the next sentence
        if token > max_tokens:
            continue

        # Otherwise, add the sentence to the chunk and add the number of tokens to the total
        chunk.append(sentence)
        tokens_so_far += token + 1

    return chunks
    

shortened = []

# Loop through the dataframe
for row in df.iterrows():

    # If the text is None, go to the next row
    if row[1]['text'] is None:
        continue

    # If the number of tokens is greater than the max number of tokens, split the text into chunks
    if row[1]['n_tokens'] > max_tokens:
        shortened += split_into_many(row[1]['text'])
    
    # Otherwise, add the text to the list of shortened texts
    else:
        shortened.append( row[1]['text'] )

Visualizing the updated histogram again can help to confirm if the rows were successfully split into shortened sections.

 

업데이트된 히스토그램을 다시 시각화하면 행이 단축된 섹션으로 성공적으로 분할되었는지 확인하는 데 도움이 될 수 있습니다.

 

df = pd.DataFrame(shortened, columns = ['text'])
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))
df.n_tokens.hist()

 

The content is now broken down into smaller chunks and a simple request can be sent to the OpenAI API specifying the use of the new text-embedding-ada-002 model to create the embeddings:

 

콘텐츠는 이제 더 작은 청크로 분류되며 새 text-embedding-ada-002 모델을 사용하여 임베딩을 생성하도록 지정하는 간단한 요청을 OpenAI API로 보낼 수 있습니다.

 

import openai

df['embeddings'] = df.text.apply(lambda x: openai.Embedding.create(input=x, engine='text-embedding-ada-002')['data'][0]['embedding'])

df.to_csv('processed/embeddings.csv')
df.head()

 

This should take about 3-5 minutes but after you will have your embeddings ready to use!

 

이 작업은 약 3-5분 정도 소요되지만 임베딩을 사용할 준비가 된 후에야 합니다!

 

Building a question answer system with your embeddings

The embeddings are ready and the final step of this process is to create a simple question and answer system. This will take a user's question, create an embedding of it, and compare it with the existing embeddings to retrieve the most relevant text from the scraped website. The text-davinci-003 model will then generate a natural sounding answer based on the retrieved text.

 

임베딩이 준비되었으며 이 프로세스의 마지막 단계는 간단한 질문 및 답변 시스템을 만드는 것입니다. 이것은 사용자의 질문을 받아 임베딩을 생성하고 기존 임베딩과 비교하여 스크랩한 웹사이트에서 가장 관련성이 높은 텍스트를 검색합니다. 그러면 text-davinci-003 모델이 검색된 텍스트를 기반으로 자연스러운 답변을 생성합니다.

 

Turning the embeddings into a NumPy array is the first step, which will provide more flexibility in how to use it given the many functions available that operate on NumPy arrays. It will also flatten the dimension to 1-D, which is the required format for many subsequent operations.

 

임베딩을 NumPy 배열로 전환하는 것이 첫 번째 단계이며 NumPy 배열에서 작동하는 많은 기능을 사용할 수 있는 경우 사용 방법에 더 많은 유연성을 제공합니다. 또한 많은 후속 작업에 필요한 형식인 1D로 차원을 평면화합니다.

 

import numpy as np
from openai.embeddings_utils import distances_from_embeddings

df=pd.read_csv('processed/embeddings.csv', index_col=0)
df['embeddings'] = df['embeddings'].apply(eval).apply(np.array)

df.head()

 

The question needs to be converted to an embedding with a simple function, now that the data is ready. This is important because the search with embeddings compares the vector of numbers (which was the conversion of the raw text) using cosine distance. The vectors are likely related and might be the answer to the question if they are close in cosine distance. The OpenAI python package has a built in distances_from_embeddings function which is useful here.

 

이제 데이터가 준비되었으므로 질문을 간단한 기능을 사용하여 임베딩으로 변환해야 합니다. 임베딩을 사용한 검색은 코사인 거리를 사용하여 숫자 벡터(원본 텍스트의 변환)를 비교하기 때문에 이는 중요합니다. 벡터는 관련이 있을 가능성이 높으며 코사인 거리가 가까운 경우 질문에 대한 답이 될 수 있습니다. OpenAI Python 패키지에는 여기에서 유용한 distances_from_embeddings 함수가 내장되어 있습니다.

 

def create_context(
    question, df, max_len=1800, size="ada"
):
    """
    Create a context for a question by finding the most similar context from the dataframe
    """

    # Get the embeddings for the question
    q_embeddings = openai.Embedding.create(input=question, engine='text-embedding-ada-002')['data'][0]['embedding']

    # Get the distances from the embeddings
    df['distances'] = distances_from_embeddings(q_embeddings, df['embeddings'].values, distance_metric='cosine')


    returns = []
    cur_len = 0

    # Sort by distance and add the text to the context until the context is too long
    for i, row in df.sort_values('distances', ascending=True).iterrows():
        
        # Add the length of the text to the current length
        cur_len += row['n_tokens'] + 4
        
        # If the context is too long, break
        if cur_len > max_len:
            break
        
        # Else add it to the text that is being returned
        returns.append(row["text"])

    # Return the context
    return "\n\n###\n\n".join(returns)

 

The text was broken up into smaller sets of tokens, so looping through in ascending order and continuing to add the text is a critical step to ensure a full answer. The max_len can also be modified to something smaller, if more content than desired is returned.

 

텍스트는 더 작은 토큰 세트로 분할되었으므로 오름차순으로 반복하고 텍스트를 계속 추가하는 것이 전체 답변을 보장하는 중요한 단계입니다. 원하는 것보다 더 많은 콘텐츠가 반환되는 경우 max_len을 더 작은 값으로 수정할 수도 있습니다.

 

The previous step only retrieved chunks of texts that are semantically related to the question, so they might contain the answer, but there's no guarantee of it. The chance of finding an answer can be further increased by returning the top 5 most likely results.

 

이전 단계에서는 질문과 의미론적으로 관련된 텍스트 덩어리만 검색했으므로 답변이 포함되어 있을 수 있지만 이에 대한 보장은 없습니다. 가장 가능성이 높은 상위 5개의 결과를 반환하여 답변을 찾을 가능성을 더욱 높일 수 있습니다.

 

The answering prompt will then try to extract the relevant facts from the retrieved contexts, in order to formulate a coherent answer. If there is no relevant answer, the prompt will return “I don’t know”.

 

응답 프롬프트는 일관된 답변을 공식화하기 위해 검색된 컨텍스트에서 관련 사실을 추출하려고 시도합니다. 관련 답변이 없으면 프롬프트에 "모르겠습니다"가 표시됩니다.

 

A realistic sounding answer to the question can be created with the completion endpoint using text-davinci-003.

 

text-davinci-003을 사용하여 completion endpoint 로 질문에 대한 현실적인 답을 만들 수 있습니다.

 

def answer_question(
    df,
    model="text-davinci-003",
    question="Am I allowed to publish model outputs to Twitter, without a human review?",
    max_len=1800,
    size="ada",
    debug=False,
    max_tokens=150,
    stop_sequence=None
):
    """
    Answer a question based on the most similar context from the dataframe texts
    """
    context = create_context(
        question,
        df,
        max_len=max_len,
        size=size,
    )
    # If debug, print the raw model response
    if debug:
        print("Context:\n" + context)
        print("\n\n")

    try:
        # Create a completions using the question and context
        response = openai.Completion.create(
            prompt=f"Answer the question based on the context below, and if the question can't be answered based on the context, say \"I don't know\"\n\nContext: {context}\n\n---\n\nQuestion: {question}\nAnswer:",
            temperature=0,
            max_tokens=max_tokens,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stop=stop_sequence,
            model=model,
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

 

It is done! A working Q/A system that has the knowledge embedded from the OpenAI website is now ready. A few quick tests can be done to see the quality of the output:

 

완료되었습니다! OpenAI 웹사이트에서 내장된 지식이 있는 작동하는 Q/A 시스템이 이제 준비되었습니다. 출력 품질을 확인하기 위해 몇 가지 빠른 테스트를 수행할 수 있습니다.

 

answer_question(df, question="What day is it?", debug=False)

answer_question(df, question="What is our newest embeddings model?")

answer_question(df, question="What is ChatGPT?")

 

The responses will look something like the following:

 

Response는 다음과 같은 방식으로 나올 겁니다.

 

"I don't know."

'The newest embeddings model is text-embedding-ada-002.'

'ChatGPT is a model trained to interact in a conversational way. It is able to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.'

 

If the system is not able to answer a question that is expected, it is worth searching through the raw text files to see if the information that is expected to be known actually ended up being embedded or not. The crawling process that was done initially was setup to skip sites outside the original domain that was provided, so it might not have that knowledge if there was a subdomain setup.

 

시스템이 예상되는 질문에 대답할 수 없는 경우 원시 텍스트 파일을 검색하여 알려질 것으로 예상되는 정보가 실제로 포함되었는지 여부를 확인하는 것이 좋습니다. 초기에 진행한 크롤링 과정은 원래 제공된 도메인 외 사이트는 건너뛰도록 설정되어 있어서 서브도메인 설정이 있었다면 해당 정보가 없을 수도 있습니다.

 

Currently, the dataframe is being passed in each time to answer a question. For more production workflows, a vector database solution should be used instead of storing the embeddings in a CSV file, but the current approach is a great option for prototyping.

 

현재 데이터 프레임은 질문에 답하기 위해 매번 전달되고 있습니다. 더 많은 프로덕션 워크플로우의 경우 임베딩을 CSV 파일에 저장하는 대신 벡터 데이터베이스 솔루션을 사용해야 하지만 현재 접근 방식은 프로토타이핑을 위한 훌륭한 옵션입니다.

반응형

'Open AI > GET STARTED' 카테고리의 다른 글

Get Started - Data usage policies  (0) 2023.03.06
Get Started - Usage policies  (0) 2023.01.05
Get Started - Models (Added GPT-4 03/14/2023)  (1) 2023.01.05
Get Started - Libraries  (0) 2023.01.04
Get Started - Quick Start Tutorial  (0) 2023.01.03
Get Started - Introduction  (0) 2023.01.03


반응형

https://platform.openai.com/docs/models

 

OpenAI API

An API for accessing new AI models developed by OpenAI

platform.openai.com

 

 

Models

Overview

The OpenAI API is powered by a family of models with different capabilities and price points. You can also customize our base models for your specific use case with fine-tuning.

 

OpenAI API는 기능과 가격대가 다른 모델 제품군으로 구동됩니다. 미세 조정을 통해 특정 사용 사례에 맞게 기본 모델을 사용자 지정할 수도 있습니다.

 

MODELS                 DESCRIPTION

GPT-4 
Limited beta
A set of models that improve on GPT-3.5 and can understand as well as generate natural language or code
GPT-3.5를 개선하고 자연어 또는 코드를 이해하고 생성할 수 있는 일련의 모델
GPT-3.5 A set of models that improve on GPT-3 and can understand as well as generate natural language or code
GPT-3를 개선하고 자연어 또는 코드를 이해하고 생성할 수 있는 일련의 모델
DALL·E
Beta
A model that can generate and edit images given a natural language prompt
자연어 프롬프트가 주어지면 이미지를 생성하고 편집할 수 있는 모델
Whisper
Beta
A model that can convert audio into text
오디오를 텍스트로 변환할 수 있는 모델
Embeddings A set of models that can convert text into a numerical form
텍스트를 숫자 형식으로 변환할 수 있는 일련의 모델
Codex
Limited beta
A set of models that can understand and generate code, including translating natural language to code
자연어를 코드로 번역하는 것을 포함하여 코드를 이해하고 생성할 수 있는 일련의 모델
Moderation A fine-tuned model that can detect whether text may be sensitive or unsafe
텍스트가 sensitive 하거나 unsafe한 지 감지할 수 있는 미세 조정(fine-tuned) 모델
GPT-3 A set of models that can understand and generate natural language
자연어를 이해하고 생성할 수 있는 일련의 모델

Visit our model index for researchers to learn more about which models have been featured in our research papers and the differences between model series like InstructGPT and GPT-3.5.

 

model index for researchers을 방문하여 연구 논문에 어떤 모델이 포함되었는지, InstructGPT 및 GPT-3.5와 같은 모델 시리즈 간의 차이점에 대해 자세히 알아보세요.

GPT-4  Limited beta

GPT-4 is a large multimodal model (accepting text inputs and emitting text outputs today, with image inputs coming in the future) that can solve difficult problems with greater accuracy than any of our previous models, thanks to its broader general knowledge and advanced reasoning capabilities. Like gpt-3.5-turbo, GPT-4 is optimized for chat but works well for traditional completions tasks.

GPT-4는 광범위한 일반 지식과 고급 추론 덕분에 이전 모델보다 더 정확하게 어려운 문제를 해결할 수 있는 대규모 다중 모드 모델 -multimodal model-(현재 텍스트 입력을 수락하고 텍스트 출력을 내보냄, 이미지 입력은 향후 제공될 예정임)입니다. gpt-3.5-turbo와 마찬가지로 GPT-4는 채팅에 최적화되어 있지만 기존 completions  작업에도 적합합니다.

 

GPT-4 is currently in a limited beta and only accessible to those who have been granted access. Please join the waitlist to get access when capacity is available.

 

GPT-4 API waitlist

We’re making GPT-4 available as an API for developers to build applications and services.

openai.com

GPT-4는 현재 제한된 베타 버전이며 액세스 권한이 있는 사용자만 액세스할 수 있습니다. 수용 인원이 있을 때 액세스하려면 대기자 명단에 등록하십시오.

LATEST MODEL          DESCRIPTION                                                                            MAX TOKENS    TRAINING DATA
gpt-4 More capable than any GPT-3.5 model, able to do more complex tasks, and optimized for chat. Will be updated with our latest model iteration.

어떤 GPT-3.5 모델보다 더 많은 기능을 제공하고 더 복잡한 작업을 수행할 수 있으며 채팅에 최적화되어 있습니다. 최신 model iteration으로 계속 업데이트됩니다.
8,192 tokens Up to Sep 2021
gpt-4-0314 Snapshot of gpt-4 from March 14th 2023. Unlike gpt-4, this model will not receive updates, and will only be supported for a three month period ending on June 14th 2023.

2023년 3월 14일 gpt-4의 스냅샷. gpt-4와 달리 이 모델은 업데이트를 받지 않으며 2023년 6월 14일에 끝나는 3개월 동안만 지원됩니다.
8,192 tokens Up to Sep 2021
gpt-4-32k Same capabilities as the base gpt-4 mode but with 4x the context length. Will be updated with our latest model iteration.

기본 gpt-4 모드와 동일한 기능이지만 컨텍스트 길이는 4배입니다. 최신 model iteration으로 업데이트됩니다.
32,768 tokens Up to Sep 2021
gpt-4-32k-0314 Snapshot of gpt-4-32 from March 14th 2023. Unlike gpt-4-32k, this model will not receive updates, and will only be supported for a three month period ending on June 14th 2023.

2023년 3월 14일 gpt-4-32의 스냅샷. gpt-4-32k와 달리 이 모델은 업데이트를 받지 않으며 2023년 6월 14일에 끝나는 3개월 동안만 지원됩니다.
32,768 tokens Up to Sep 2021

For many basic tasks, the difference between GPT-4 and GPT-3.5 models is not significant. However, in more complex reasoning situations, GPT-4 is much more capable than any of our previous models.

많은 기본 작업에서 GPT-4와 GPT-3.5 모델 간의 차이는 중요하지 않습니다. 그러나 보다 복잡한 추론 상황에서 GPT-4는 이전 모델보다 훨씬 더 많은 기능을 제공합니다.

 

GPT-3.5

GPT-3.5 models can understand and generate natural language or code. Our most capable and cost effective model in the GPT-3.5 family is gpt-3.5-turbo which has been optimized for chat but works well for traditional completions tasks as well.

GPT-3.5 모델은 자연어 또는 코드를 이해하고 생성할 수 있습니다. GPT-3.5 제품군에서 가장 유능하고 비용 효율적인 모델은 gpt-3.5-turbo로 채팅에 최적화되어 있지만 기존 completions  작업에도 잘 작동합니다.

 

LATEST MODEL                DESCRIPTION                                                                 MAX TOKENS         TRAINING DATA
gpt-3.5-turbo Most capable GPT-3.5 model and optimized for chat at 1/10th the cost of text-davinci-003. Will be updated with our latest model iteration.

가장 유능한 GPT-3.5 모델이며 text-davinci-003 비용의 1/10로 채팅에 최적화되었습니다. 최신 모델 반복으로 업데이트됩니다.
4,096 tokens Up to Sep 2021
gpt-3.5-turbo-0301 Snapshot of gpt-3.5-turbo from March 1st 2023. Unlike gpt-3.5-turbo, this model will not receive updates, and will only be supported for a three month period ending on June 1st 2023.

2023년 3월 1일의 gpt-3.5-turbo 스냅샷. gpt-3.5-turbo와 달리 이 모델은 업데이트를 받지 않으며 2023년 6월 1일에 끝나는 3개월 동안만 지원됩니다.
4,096 tokens Up to Sep 2021
text-davinci-003 Can do any language task with better quality, longer output, and consistent instruction-following than the curie, babbage, or ada models. Also supports inserting completions within text.

curie, babbage 또는 ada 모델보다 더 나은 품질, 더 긴 출력 및 일관된 지침 준수로 모든 언어 작업을 수행할 수 있습니다. 또한 텍스트 내에 inserting completions을 지원합니다.
4,097 tokens Up to Jun 2021
text-davinci-002 Similar capabilities to text-davinci-003 but trained with supervised fine-tuning instead of reinforcement learning

text-davinci-003과 유사한 기능이지만 강화 학습(reinforcement learning) 대신 supervised fine-tuning으로 훈련됨
4,097 tokens Up to Jun 2021
code-davinci-002 Optimized for code-completion tasks

코드 완성 작업에 최적화됨
8,001 tokens Up to Jun 2021

We recommend using gpt-3.5-turbo over the other GPT-3.5 models because of its lower cost.

비용이 저렴하기 때문에 다른 GPT-3.5 모델보다 gpt-3.5-turbo를 사용하는 것이 좋습니다.

 

OpenAI models are non-deterministic, meaning that identical inputs can yield different outputs. Setting temperature to 0 will make the outputs mostly deterministic, but a small amount of variability may remain.

 

OpenAI API

An API for accessing new AI models developed by OpenAI

platform.openai.com

OpenAI 모델은 비결정적입니다. 즉, 입력값이 동일해도 출력은 매번 다를 수 있습니다.  temperature를 0으로 설정하면 출력이 대부분 결정적이지만 약간의 변동성이 남아 있을 수 있습니다.

 

Feature-specific models

The main GPT-3 models are meant to be used with the text completion endpoint. We also offer models that are specifically meant to be used with other endpoints.

Older versions of our GPT-3 models are available as davinci, curie, babbage, and ada. These are meant to be used with our fine-tuning endpoints. Learn more.

Our endpoints for creating embeddings and editing text use their own sets of specialized models.

 

기본 GPT-3 모델은  text completion endpoint과 함께 사용하기 위한 것입니다. 또한 다른 엔드포인트와 함께 사용하도록 특별히 고안된 모델도 제공합니다.

GPT-3 모델의 이전 버전은 davinci, curie, babbage 및 ada로 제공됩니다. 이는 fine-tuning endpoints과 함께 사용하기 위한 것입니다. 더 알아보기.


임베딩 생성 및 텍스트 편집을 위한 엔드포인트는 고유한 특수 모델 세트를 사용합니다.

 

While the new gpt-3.5 turbo model is optimized for chat, it works very well for traditional completion tasks. The original GPT-3.5 models are optimized for text completion.

 

새로운 gpt-3.5-turbo 모델은 채팅에 최적화되어 있지만 기존 completion  작업에 매우 적합합니다. 원본 GPT-3.5 모델은 텍스트 완성에 최적화되어 있습니다.

 

Finding the right model

Experimenting with Davinci is a great way to find out what the API is capable of doing. After you have an idea of what you want to accomplish, you can stay with Davinci if you’re not concerned about cost and speed or move onto Curie or another model and try to optimize around its capabilities.

You can use this tool that lets you run different models side-by-side to compare outputs, settings, and response times and then download the data into a .xls spreadsheet.

 

Davinci로 실험하는 것은 API가 무엇을 할 수 있는지 알아보는 좋은 방법입니다. 수행하려는 작업에 대한 아이디어를 얻은 후 비용과 속도에 대해 걱정하지 않거나 Curie 또는 다른 모델로 이동하여 해당 기능을 최적화하려는 경우 Davinci에 머물 수 있습니다.

이 도구를 사용하면 서로 다른 모델을 나란히 실행하여 출력, 설정 및 응답 시간을 비교한 다음 데이터를 .xls 스프레드시트로 다운로드할 수 있습니다.

 

- Updated : March 2023 -

Experimenting with gpt-3.5-turbo is a great way to find out what the API is capable of doing. After you have an idea of what you want to accomplish, you can stay with gpt-3.5-turbo or another model and try to optimize around its capabilities.

 

gpt-3.5-turbo로 실험하는 것은 API가 무엇을 할 수 있는지 알아보는 좋은 방법입니다. 수행하려는 작업에 대한 아이디어를 얻은 후에는 gpt-3.5-turbo 또는 다른 모델을 유지하면서 해당 기능을 최적화할 수 있습니다.

 

You can use the GPT comparison tool that lets you run different models side-by-side to compare outputs, settings, and response times and then download the data into an Excel spreadsheet.

 

서로 다른 모델을 나란히 실행하여 출력, 설정 및 응답 시간을 비교한 다음 데이터를 Excel 스프레드시트로 다운로드할 수 있는 GPT 비교 도구를 사용할 수 있습니다.

 

 

DALL·E   Beta

DALL·E is a AI system that can create realistic images and art from a description in natural language. We currently support the ability, given a prommpt, to create a new image with a certain size, edit an existing image, or create variations of a user provided image.

DALL·E는 자연어로 된 description 에서 사실적인 이미지와 art을 만들 수 있는 AI 시스템입니다. 현재 특정 크기의 새 이미지를 생성하거나, 기존 이미지를 편집하거나, 사용자 제공 이미지의 변형을 생성하는 메시지가 주어지면 기능을 지원합니다.

 

The current DALL·E model available through our API is the 2nd iteration of DALL·E with more realistic, accurate, and 4x greater resolution images than the original model. You can try it through the our Labs interface or via the API.

 

API를 통해 사용할 수 있는 현재 DALL·E 모델은 원래 모델보다 더 현실적이고 정확하며 4배 더 높은 해상도 이미지를 가진 DALL·E의 두 번째 반복입니다. Labs interface 또는 API를 통해 사용해 볼 수 있습니다.

 

 

Whisper    Beta

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification. The Whisper v2-large model is currently available through our API with the whisper-1 model name.

Whisper는 범용 음성 인식 모델입니다. 다양한 오디오의 대규모 데이터 세트에 대해 학습되며 다국어 음성 인식은 물론 음성 번역 및 언어 식별을 수행할 수 있는 멀티태스킹 모델이기도 합니다. Whisper v2-large 모델은 현재 당사의 API를 통해 모델 이름이 whisper-1인 모델을 사용할 수 있습니다.

 

Currently, there is no difference between the open source version of Whisper and the version available through our API. However, through our API, we offer an optimized inference process which makes running Whisper through our API much faster than doing it through other means. For more technical details on Whisper, you can read the paper.

 

현재 Whisper의 오픈 소스 버전과 API를 통해 제공되는 버전 간에는 차이가 없습니다. 그러나 API를 통해 다른 방법을 통해 수행하는 것보다 훨씬 빠르게 API를 통해 Whisper를 실행할 수 있는 최적화된 추론 프로세스를 제공합니다. Whisper에 대한 자세한 기술 정보는 read the pape 에서 볼 수 있습니다.

 

 

Embeddings

Embeddings are a numerical representation of text that can be used to measure the relateness between two pieces of text. Our second generation embedding model, text-embedding-ada-002 is a designed to replace the previous 16 first-generation embedding models at a fraction of the cost. Embeddings are useful for search, clustering, recommendations, anomaly detection, and classification tasks. You can read more about our latest embedding model in the announcement blog post.

임베딩은 두 텍스트 간의 관련성을 측정하는 데 사용할 수 있는 텍스트의 숫자 표현입니다. 2세대 임베딩 모델인 text-embedding-ada-002는 적은 비용으로 이전 16개의 1세대 임베딩 모델을 대체하도록 설계되었습니다. 임베딩은 검색, 클러스터링, 권장 사항, 변칙 검색 및 분류 작업에 유용합니다. 발표 블로그 게시물에서 최신 임베딩 모델에 대해 자세히 알아볼 수 있습니다.

 

 

Codex 

Limited beta

The Codex models are descendants of our GPT-3 models that can understand and generate code. Their training data contains both natural language and billions of lines of public code from GitHub. Learn more.

They’re most capable in Python and proficient in over a dozen languages including JavaScript, Go, Perl, PHP, Ruby, Swift, TypeScript, SQL, and even Shell.

We currently offer two Codex models:

 

Codex 모델은 코드를 이해하고 생성할 수 있는 GPT-3 모델의 후손입니다. 그들의 학습 데이터에는 자연어와 GitHub의 수십억 줄의 공개 코드가 모두 포함되어 있습니다. 더 알아보기.

그들은 Python에서 가장 유능하며 JavaScript, Go, Perl, PHP, Ruby, Swift, TypeScript, SQL 및 Shell을 포함한 12개 이상의 언어에 능숙합니다.

현재 두 가지 Codex 모델을 제공합니다.

 

LATEST MODEL          DESCRIPTION                                                                     MAX REQUEST      TRAINING DATA
code-davinci-002 Most capable Codex model. Particularly good at translating natural language to code. In addition to completing code, also supports inserting completions within code.

가장 유능한 코덱스 모델. 특히 자연어를 코드로 번역하는 데 능숙합니다. 코드를 완성하는 것 외에도 코드 내에 inserting completions도 지원합니다.
8,000 tokens Up to Jun 2021
code-cushman-001 Almost as capable as Davinci Codex, but slightly faster. This speed advantage may make it preferable for real-time applications.

Davinci Codex와 거의 비슷하지만 약간 더 빠릅니다. 이러한 속도 이점은 실시간 응용 프로그램에 바람직할 수 있습니다.
Up to 2,048 tokens  

For more, visit our guide to working with Codex.

The Codex models are free to use during the limited beta, and are subject to reduced rate limits. As we learn about use, we'll look to offer pricing to enable a broad set of applications.

During this period, you're welcome to go live with your application as long as it follows our usage policies. We welcome any feedback on these models while in early use and look forward to engaging with the community.

 

자세한 내용은 Codex 작업 가이드를 참조하세요.

Codex 모델은 제한된 베타 기간 동안 무료로 사용할 수 있으며 감소된 속도 제한이 적용됩니다. 사용에 대해 알아가면서 광범위한 응용 프로그램 세트를 사용할 수 있도록 가격 책정을 제안할 것입니다.

이 기간 동안 애플리케이션이 사용 정책을 준수하는 한 언제든지 애플리케이션을 사용할 수 있습니다. 초기 사용 중에 이러한 모델에 대한 피드백을 환영하며 커뮤니티와 소통하기를 기대합니다.

 

Feature-specific models

The main Codex models are meant to be used with the text completion endpoint. We also offer models that are specifically meant to be used with our endpoints for creating embeddings and editing code.

 

기본 Codex 모델은 텍스트 완성 엔드포인트와 함께 사용하기 위한 것입니다. 또한 임베딩 생성 및 코드 편집을 위해 엔드포인트와 함께 사용하도록 특별히 고안된 모델도 제공합니다.

 

Content filter

(2023년 3월 업데이트 되면서 이 Content filter 부분은 Get Started 에서 삭제 됐습니다. 대신 Moderation 이 추가 됐습니다.)

 
We recommend using our new moderation endpoint instead of the content filter model.

The filter aims to detect generated text that could be sensitive or unsafe coming from the API. It's currently in beta mode and has three ways of classifying text- as safe, sensitive, or unsafe. The filter will make mistakes and we have currently built it to err on the side of caution, thus, resulting in higher false positives.

 

콘텐츠 필터 모델 대신 새로운 moderation endpoint 를 사용하는 것이 좋습니다.
필터는 API에서 오는 민감하거나 안전하지 않을 수 있는 생성된 텍스트를 감지하는 것을 목표로 합니다. 현재 베타 모드이며 텍스트를 안전, 민감 또는 안전하지 않음으로 분류하는 세 가지 방법이 있습니다. 필터는 실수를 할 수 있으며 우리는 현재 주의를 기울여 오류를 범하지 않도록 구축하여 더 높은 오탐지 결과를 초래합니다.

 

Label Descriptions

  • 0 - The text is safe.
  • 1 - This text is sensitive. This means that the text could be talking about a sensitive topic, something political, religious, or talking about a protected class such as race or nationality.
  • 2 - This text is unsafe. This means that the text contains profane language, prejudiced or hateful language, something that could be NSFW, or text that portrays certain groups/people in a harmful manner.

0 - 텍스트가 안전합니다.
1 - 이 텍스트는 민감합니다. 이것은 텍스트가 민감한 주제, 정치적, 종교적 주제에 대해 이야기하거나 인종이나 국적과 같은 보호 계층에 대해 이야기할 수 있음을 의미합니다.
2 - 이 텍스트는 안전하지 않습니다. 즉, 텍스트에 욕설, 편견 또는 혐오스러운 언어, NSFW일 수 있는 내용 또는 특정 그룹/사람을 유해한 방식으로 묘사하는 텍스트가 포함되어 있음을 의미합니다.

 

How do you use the filter?

You need to use the completions endpoint with the content-filter-alpha model and the following settings:

  1. max_tokens set to 1
  2. temperature set to 0.0
  3. top_p set to 0
  4. logprobs set to 10
  5. Wrap your prompt in the following way:

content-filter-alpha 모델 및 다음 설정과 함께 완료 엔드포인트를 사용해야 합니다.

max_tokens가 1로 설정됨
0.0으로 설정된 온도
top_p를 0으로 설정
10으로 설정된 logprobs
프롬프트를 다음과 같이 래핑합니다.

"<|endoftext|>[prompt]\n--\nLabel:"

In other words, if you were doing a curl command, it would look something like (replace the variables in []):

즉, curl 명령을 수행하는 경우 다음과 같이 표시됩니다([]의 변수 대체).

curl https://api.openai.com/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "content-filter-alpha",
  "prompt": "<|endoftext|>[prompt]\n--\nLabel:",
  "max_tokens": 1,
  "user": "1"
}'

Or via our openai python client:

또는 openai Python 클라이언트를 통해:

content_to_classify = "Your content here"

response = openai.Completion.create(
      model="content-filter-alpha",
      prompt = "<|endoftext|>"+content_to_classify+"\n--\nLabel:",
      temperature=0,
      max_tokens=1,
      top_p=0,
      logprobs=10
    )

Importantly, you need to check not only the label that was returned by the filter (0, 1, or 2), but sometimes also the logprobs associated with these.

If the filter returns 0 or 1, you should accept that as the filter's outcome. If the filter returns 2, you should accept this outcome only if its logprob is greater than -0.355.

If the logprob for 2 is beneath -0.355 (for example, -0.4), then you should use as output whichever of 0 or 1 has a logprob closer to 0.

Here is Python code that accomplishes this, working from response returned above:

 

중요한 것은 필터(0, 1 또는 2)에 의해 반환된 레이블뿐만 아니라 때때로 이와 관련된 logprobs도 확인해야 한다는 것입니다.

필터가 0 또는 1을 반환하면 이를 필터의 결과로 받아들여야 합니다. 필터가 2를 반환하는 경우 logprob가 -0.355보다 큰 경우에만 이 결과를 수락해야 합니다.

2에 대한 logprob가 -0.355 미만(예: -0.4)인 경우 logprob이 0에 가까운 0 또는 1을 출력으로 사용해야 합니다.

다음은 위에서 반환된 응답에서 작동하여 이를 수행하는 Python 코드입니다.

output_label = response["choices"][0]["text"]

# This is the probability at which we evaluate that a "2" is likely real
# vs. should be discarded as a false positive
toxic_threshold = -0.355

if output_label == "2":
    # If the model returns "2", return its confidence in 2 or other output-labels
    logprobs = response["choices"][0]["logprobs"]["top_logprobs"][0]

    # If the model is not sufficiently confident in "2",
    # choose the most probable of "0" or "1"
    # Guaranteed to have a confidence for 2 since this was the selected token.
    if logprobs["2"] < toxic_threshold:
        logprob_0 = logprobs.get("0", None)
        logprob_1 = logprobs.get("1", None)

        # If both "0" and "1" have probabilities, set the output label
        # to whichever is most probable
        if logprob_0 is not None and logprob_1 is not None:
            if logprob_0 >= logprob_1:
                output_label = "0"
            else:
                output_label = "1"
        # If only one of them is found, set output label to that one
        elif logprob_0 is not None:
            output_label = "0"
        elif logprob_1 is not None:
            output_label = "1"

        # If neither "0" or "1" are available, stick with "2"
        # by leaving output_label unchanged.

# if the most probable token is none of "0", "1", or "2"
# this should be set as unsafe
if output_label not in ["0", "1", "2"]:
    output_label = "2"

return output_label

We generally recommend not returning to end-users any completions that the Content Filter has flagged with an output of 2. One approach here is to re-generate, from the initial prompt which led to the 2-completion, and hope that the next output will be safer. Another approach is to alert the end-user that you are unable to return this completion, and to steer them toward suggesting a different input.

 

일반적으로 콘텐츠 필터가 출력 2로 표시한 완성 항목을 최종 사용자에게 반환하지 않는 것이 좋습니다. 여기서 한 가지 접근 방식은 2-completion으로 이어진 초기 프롬프트에서 다시 생성하고 다음 출력을 희망하는 것입니다. 더 안전할 것입니다. 또 다른 접근 방식은 최종 사용자에게 이 완성을 반환할 수 없음을 알리고 다른 입력을 제안하도록 유도하는 것입니다.

 

Is there a cost associated with usage of the content filter?

No. The content filter is free to use.

 

콘텐츠 필터 사용과 관련된 비용이 있습니까?
아니요. 콘텐츠 필터는 무료로 사용할 수 있습니다.

 

How can I adjust the threshold for certainty?

You can adjust the threshold for the filter by only allowing filtration on the labels that have a certainty level (logprob) above a threshold that you can determine. This is not generally recommended, however.

If you would like an even more conservative implementation of the Content Filter, you may return as 2 anything with an output_label of "2" above, rather than accepting it only with certain logprob values.

 

확실성에 대한 임계값을 어떻게 조정할 수 있습니까?
결정할 수 있는 임계값보다 높은 확실성 수준(logprob)이 있는 레이블에서만 필터링을 허용하여 필터의 임계값을 조정할 수 있습니다. 그러나 이것은 일반적으로 권장되지 않습니다.

콘텐츠 필터의 훨씬 더 보수적인 구현을 원하는 경우 특정 logprob 값으로만 허용하는 대신 위의 output_label이 "2"인 모든 항목을 2로 반환할 수 있습니다.

 

How can you personalize the filter?

For now, we aren't supporting finetuning for individual projects. However, we're still looking for data to improve the filter and would be very appreciative if you sent us data that triggered the filter in an unexpected way.

 

필터를 어떻게 개인화할 수 있습니까?
현재로서는 개별 프로젝트에 대한 미세 조정을 지원하지 않습니다. 그러나 우리는 여전히 필터를 개선하기 위한 데이터를 찾고 있으며 예기치 않은 방식으로 필터를 트리거한 데이터를 보내주시면 매우 감사하겠습니다.

 

What are some prompts I should expect lower performance on?

The filter currently has a harder time parsing prompts with strange formatting. Thus, if a prompt has a lot of linebreaks, unusual format, repeated words etc. then the model might misclassify it with a higher frequency. It also has lower performance on certain kinds of texts such as fiction, poetry, code etc.

Additionally, similar to the API, the filter will not have a knowledge base of things post 2019. Thus, it will perform poorly in identifying political, sensitive and harmful content that may require relevant knowledge of things that happened post 2019.

 

성능 저하를 예상해야 하는 몇 가지 프롬프트는 무엇입니까?
필터는 현재 이상한 형식의 프롬프트를 구문 분석하기가 더 어렵습니다. 따라서 프롬프트에 줄 바꿈, 비정상적인 형식, 반복되는 단어 등이 많이 있는 경우 모델이 더 높은 빈도로 프롬프트를 잘못 분류할 수 있습니다. 또한 소설, 시, 코드 등과 같은 특정 종류의 텍스트에 대한 성능이 낮습니다.

또한 API와 마찬가지로 필터에는 2019년 이후에 대한 지식 기반이 없습니다. 따라서 2019년 이후에 발생한 일에 대한 관련 지식이 필요할 수 있는 정치적, 민감하고 유해한 콘텐츠를 식별하는 데 성능이 저하됩니다.

 

Moderation

The Moderation models are designed to check whether content complies with OpenAI's usage policies. The models provide classification capabilities that look for content in the following categories: hate, hate/threatening, self-harm, sexual, sexual/minors, violence, and violence/graphic. You can find out more in our moderation guide.

 

Moderation  모델은 콘텐츠가 OpenAI의 사용 정책을 준수하는지 확인하도록 설계되었습니다. 이 모델은 증오, 증오/위협, 자해, 성적, 성적/미성년자, 폭력 및 폭력/그래픽 범주에서 콘텐츠를 찾는 분류 기능을 제공합니다. 검토 가이드에서 자세한 내용을 확인할 수 있습니다.

 

Moderation models take in an arbitrary sized input that is automatically broken up to fix the models specific context window.

 

Moderation  모델은 모델 특정 컨텍스트 창을 수정하기 위해 자동으로 분할되는 임의 크기의 입력을 받습니다.

 

MODEL                                     DESCRIPTION

text-moderation-latest Most capable moderation model. Accuracy will be slighlty higher than the stable model
가장 유능한 moderation  모델. 정확도는 stable  모델보다 약간 높을 것입니다.
text-moderation-stable Almost as capable as the latest model, but slightly older.

 

GPT-3

- 2023 3월 업데이트 내용 - 

 

GPT-3 models can understand and generate natural language. These models were superceded by the more powerful GPT-3.5 generation models. However, the original GPT-3 base models (davinci, curie, ada, and babbage) are current the only models that are available to fine-tune.

 

GPT-3 모델은 자연어를 이해하고 생성할 수 있습니다. 이 모델은 보다 강력한 GPT-3.5 세대 모델로 대체되었습니다. 그러나 원래 GPT-3 기본 모델(davinci, curie, ada 및 babbage)은 현재 fine-tune이 가능한 유일한 모델입니다.

 

LATEST MODEL                                               DESCRIPTION                                        MAX TOKENS  TRAINING DATA

text-curie-001 Very capable, faster and lower cost than Davinci.
Davinci보다 매우 유능하고 빠르며 비용이 저렴합니다.
2,049 tokens Up to Oct 2019
text-babbage-001 Capable of straightforward tasks, very fast, and lower cost.
간단한 작업이 가능하고 매우 빠르며 비용이 저렴합니다.
2,049 tokens Up to Oct 2019
text-ada-001 Capable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost.
매우 간단한 작업이 가능하며 일반적으로 GPT-3 시리즈에서 가장 빠른 모델이며 가장 저렴합니다.
2,049 tokens Up to Oct 2019
davinci Most capable GPT-3 model. Can do any task the other models can do, often with higher quality.

가장 유능한 GPT-3 모델. 다른 모델이 할 수 있는 모든 작업을 종종 더 높은 품질로 수행할 수 있습니다.
2,049 tokens Up to Oct 2019
curie Very capable, but faster and lower cost than Davinci.
매우 유능하지만 Davinci보다 빠르고 비용이 저렴합니다.
2,049 tokens Up to Oct 2019
babbage Capable of straightforward tasks, very fast, and lower cost.
간단한 작업이 가능하고 매우 빠르며 비용이 저렴합니다.
2,049 tokens Up to Oct 2019
ada Capable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost.
매우 간단한 작업이 가능하며 일반적으로 GPT-3 시리즈에서 가장 빠른 모델이며 가장 저렴합니다.
2,049 tokens Up to Oct 2019

 

- 2023 3월 업데이트 이전 내용 - 

Our GPT-3 models can understand and generate natural language. We offer four main models with different levels of power suitable for different tasks. Davinci is the most capable model, and Ada is the fastest.

 

GPT-3 모델은 자연어를 이해하고 생성할 수 있습니다. 우리는 다양한 작업에 적합한 다양한 수준의 전력을 가진 4가지 주요 모델을 제공합니다. Davinci는 가장 유능한 모델이고 Ada는 가장 빠릅니다.

 

LATEST MODEL                           DESCRIPTION                                 MAX REQUEST                         TRAINING DATA
text-davinci-003 Most capable GPT-3 model. Can do any task the other models can do, often with higher quality, longer output and better instruction-following. Also supports inserting completions within text.

가장 유능한 GPT-3 모델. 다른 모델이 할 수 있는 모든 작업을 수행할 수 있으며, 종종 더 높은 품질, 더 긴 출력 및 더 나은 지침 준수로 수행할 수 있습니다. 또한 텍스트 내에 완성 삽입을 지원합니다.
4,000 tokens Up to Jun 2021
text-curie-001 Very capable, but faster and lower cost than Davinci.
매우 유능하지만 Davinci보다 빠르고 비용이 저렴합니다.
2,048 tokens Up to Oct 2019
text-babbage-001 Capable of straightforward tasks, very fast, and lower cost.
간단한 작업이 가능하고 매우 빠르며 비용이 저렴합니다.
2,048 tokens Up to Oct 2019
text-ada-001 Capable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost.
매우 간단한 작업이 가능하며 일반적으로 GPT-3 시리즈에서 가장 빠른 모델이며 가장 저렴합니다.
2,048 tokens Up to Oct 2019

While Davinci is generally the most capable, the other models can perform certain tasks extremely well with significant speed or cost advantages. For example, Curie can perform many of the same tasks as Davinci, but faster and for 1/10th the cost.

We recommend using Davinci while experimenting since it will yield the best results. Once you’ve got things working, we encourage trying the other models to see if you can get the same results with lower latency. You may also be able to improve the other models’ performance by fine-tuning them on a specific task.

 

Davinci가 일반적으로 가장 유능하지만 다른 모델은 상당한 속도 또는 비용 이점으로 특정 작업을 매우 잘 수행할 수 있습니다. 예를 들어 Curie는 Davinci와 동일한 많은 작업을 더 빠르고 1/10의 비용으로 수행할 수 있습니다.

최상의 결과를 얻을 수 있으므로 실험 중에 Davinci를 사용하는 것이 좋습니다. 작업이 완료되면 다른 모델을 사용하여 더 낮은 대기 시간으로 동일한 결과를 얻을 수 있는지 확인하는 것이 좋습니다. 특정 작업에 대해 미세 조정하여 다른 모델의 성능을 향상시킬 수도 있습니다.

 

Davinci

Davinci is the most capable model family and can perform any task the other models can perform and often with less instruction. For applications requiring a lot of understanding of the content, like summarization for a specific audience and creative content generation, Davinci is going to produce the best results. These increased capabilities require more compute resources, so Davinci costs more per API call and is not as fast as the other models.

Another area where Davinci shines is in understanding the intent of text. Davinci is quite good at solving many kinds of logic problems and explaining the motives of characters. Davinci has been able to solve some of the most challenging AI problems involving cause and effect.

Good at: Complex intent, cause and effect, summarization for audience

 

Davinci는 가장 유능한 모델 제품군이며 다른 모델이 수행할 수 있는 모든 작업을 수행할 수 있으며 종종 적은 지침으로도 수행할 수 있습니다. 특정 청중을 위한 요약 및 창의적인 콘텐츠 생성과 같이 콘텐츠에 대한 많은 이해가 필요한 응용 프로그램에서 Davinci는 최상의 결과를 생성할 것입니다. 이러한 향상된 기능에는 더 많은 컴퓨팅 리소스가 필요하므로 Davinci는 API 호출당 비용이 더 많이 들고 다른 모델만큼 빠르지 않습니다.

Davinci가 빛을 발하는 또 다른 영역은 텍스트의 의도를 이해하는 것입니다. 다빈치는 많은 종류의 논리 문제를 풀고 등장인물의 동기를 설명하는 데 능숙합니다. Davinci는 원인과 결과와 관련된 가장 어려운 AI 문제 중 일부를 해결할 수 있었습니다.

잘하는 것: 복잡한 의도, 원인과 결과, 청중을 위한 요약

 

Curie

Curie is extremely powerful, yet very fast. While Davinci is stronger when it comes to analyzing complicated text, Curie is quite capable for many nuanced tasks like sentiment classification and summarization. Curie is also quite good at answering questions and performing Q&A and as a general service chatbot.

Good at: Language translation, complex classification, text sentiment, summarization

 

Curie는 매우 강력하면서도 매우 빠릅니다. Davinci는 복잡한 텍스트를 분석할 때 더 강력하지만 Curie는 감정 분류 및 요약과 같은 많은 미묘한 작업을 수행할 수 있습니다. Curie는 또한 질문에 답하고 Q&A를 수행하는 것과 일반 서비스 챗봇으로서 매우 능숙합니다.

잘하는 것: 언어 번역, 복잡한 분류, 텍스트 감성, 요약

 

Babbage

Babbage can perform straightforward tasks like simple classification. It’s also quite capable when it comes to Semantic Search ranking how well documents match up with search queries.

Good at: Moderate classification, semantic search classification

 

Babbage는 간단한 분류와 같은 간단한 작업을 수행할 수 있습니다. 또한 문서가 검색 쿼리와 얼마나 잘 일치하는지 순위를 매기는 Semantic Search와 관련하여 상당히 유능합니다.

잘하는 것: 보통분류, 의미 검색 분류

 

Ada

Ada is usually the fastest model and can perform tasks like parsing text, address correction and certain kinds of classification tasks that don’t require too much nuance. Ada’s performance can often be improved by providing more context.

Good at: Parsing text, simple classification, address correction, keywords

Note: Any task performed by a faster model like Ada can be performed by a more powerful model like Curie or Davinci.

 

Ada는 일반적으로 가장 빠른 모델이며 텍스트 구문 분석, 주소 수정 및 너무 많은 뉘앙스가 필요하지 않은 특정 종류의 분류 작업과 같은 작업을 수행할 수 있습니다. Ada의 성능은 종종 더 많은 컨텍스트를 제공함으로써 향상될 수 있습니다.

잘하는 것: 텍스트 파싱, 단순 분류, 주소 수정, 키워드

참고: Ada와 같은 더 빠른 모델이 수행하는 작업은 Curie 또는 Davinci와 같은 더 강력한 모델이 수행할 수 있습니다.

 

Consider semantic search

For tasks involving classification, where you’re trying to find what label best fits a selection of text, you can often get great performance out of the different models with Semantic Search. Semantic Search uses an models to provide scores for different blocks of text for how closely they relate to a query. By focusing the scope of the model to evaluating the relationship of the query to the different text blocks, in many cases faster models can outperform themselves compared to the task being presented to them as a generative one.

 

선택한 텍스트에 가장 적합한 레이블을 찾으려는 분류 관련 작업의 경우 시맨틱 검색을 통해 다양한 모델에서 뛰어난 성능을 얻을 수 있습니다. 시맨틱 검색은 모델을 사용하여 쿼리와 얼마나 밀접하게 관련되어 있는지에 대한 다양한 텍스트 블록에 대한 점수를 제공합니다. 다른 텍스트 블록에 대한 쿼리의 관계를 평가하는 데 모델의 범위를 집중함으로써 많은 경우 더 빠른 모델이 생성 작업으로 제시되는 작업에 비해 더 빠른 모델을 능가할 수 있습니다.

 

Understanding Curie’s capabilities

While Davinci is more capable when it comes to comprehending text and generating responses that are more nuanced like summarizing for a child or emulating human speaking patterns, Curie is highly capable of analyzing text, answering direct questions, and providing key points.

 

Davinci는 텍스트를 이해하고 어린이를 위해 요약하거나 인간의 말하기 패턴을 모방하는 것과 같이 더 미묘한 응답을 생성하는 데 더 유능하지만 Curie는 텍스트를 분석하고 직접적인 질문에 답하고 핵심 사항을 제공하는 능력이 뛰어납니다.

 

Key points

As an example, Curie can use the Wikipedia entry for Pluto to list out key points from the text. View example.

This example illustrates that Curie is highly capable of getting important information from text and very useful for a variety of applications including:

  • Turning technical documents into bullet points
  • Extracting important information from email
  • Getting key points from customer service communication

예를 들어 Curie는 Pluto에 대한 Wikipedia 항목을 사용하여 텍스트의 핵심 사항을 나열할 수 있습니다. 예제 보기.

이 예는 Curie가 텍스트에서 중요한 정보를 얻을 수 있고 다음을 포함한 다양한 응용 프로그램에 매우 유용하다는 것을 보여줍니다.

기술 문서를 중요 항목으로 전환
이메일에서 중요한 정보 추출
고객 서비스 커뮤니케이션의 요점 파악

 

Report generation

You can extend key point extraction even further by using Curie (or Davinci) to analyze text and answer specific questions. In this example, we’ll use Curie to read an email from a customer and provide answers to a preset list of questions. View example.

It’s worth calling attention to two things that are going on in this prompt that can help improve overall prompt design:

 

Curie(또는 Davinci)를 사용하여 텍스트를 분석하고 특정 질문에 답함으로써 핵심 포인트 추출을 더욱 확장할 수 있습니다. 이 예에서는 Curie를 사용하여 고객의 이메일을 읽고 미리 설정된 질문 목록에 대한 답변을 제공합니다. 예제 보기.

전반적인 프롬프트 디자인을 개선하는 데 도움이 될 수 있는 이 프롬프트에서 진행되는 두 가지 사항에 주의를 기울일 가치가 있습니다.

 

1. Use a low temperature when extracting data

We’ve set the temperature low because we’re looking for straight-forward answers to questions that the customer comment provides. We’re not asking the model to try to be creative with its responses – especially for yes or no questions.

 

고객 의견이 제공하는 질문에 대한 직접적인 답변을 찾고 있기 때문에 온도를 낮게 설정했습니다. 우리는 모델에게 특히 예 또는 아니오 질문에 대해 창의적으로 응답하도록 요구하지 않습니다.

 

2. Use one API call to answer multiple questions

We’re able to get one API call to answer four questions (more are possible) by providing a list of questions and then priming the prompt with the number one to indicate that the answers should relate to the questions that just preceded it.

By asking four questions we get a 4x improvement on the efficiency of the API call. If this task was previously being accomplished by Davinci with one API call per question, using Curie and optimizing the prompt this way provides cost efficiency plus the speed advantage of Curie over Davinci.

 

우리는 질문 목록을 제공한 다음 응답이 바로 앞의 질문과 관련되어야 함을 나타내기 위해 프롬프트에 숫자 1을 입력하여 네 가지 질문(더 많은 질문 가능)에 답하기 위해 하나의 API 호출을 얻을 수 있습니다.

4가지 질문을 통해 API 호출의 효율성을 4배 향상시킬 수 있습니다. 이 작업이 이전에 질문당 하나의 API 호출로 Davinci에 의해 수행된 경우 Curie를 사용하고 이러한 방식으로 프롬프트를 최적화하면 비용 효율성과 Davinci보다 Curie의 속도 이점을 제공합니다.

 

Understanding Babbage’s capabilities

Babbage is good at picking up obvious patterns in text and then using that as a reference to generate text. Babbage can also perform broad classification tasks like assigning categories to industries, genres and media content. For creative applications, Babbage is able to understand enough structure to create simple plots and titles.

 

Babbage는 텍스트에서 명백한 패턴을 선택한 다음 이를 참조로 사용하여 텍스트를 생성하는 데 능숙합니다. Babbage는 또한 산업, 장르 및 미디어 콘텐츠에 범주를 할당하는 것과 같은 광범위한 분류 작업을 수행할 수 있습니다. 창의적인 응용 프로그램의 경우 Babbage는 간단한 플롯과 제목을 만들기에 충분한 구조를 이해할 수 있습니다.

 

Idea iteration

You can give Babbage a prompt as simple as “Provide 7 tips for better YouTube videos,” and it will automatically create a list of practical advice. You can do this for just about any topic that is reasonably well known. This can either provide standalone content or be used as a starting point for a more in-depth tutorial. View example.

 

Babbage에 "YouTube 동영상을 개선하기 위한 7가지 팁 제공"과 같이 간단하게 프롬프트를 제공하면 실용적인 조언 목록이 자동으로 생성됩니다. 합리적으로 잘 알려진 거의 모든 주제에 대해 이 작업을 수행할 수 있습니다. 이는 독립 실행형 콘텐츠를 제공하거나 보다 심층적인 자습서의 시작점으로 사용할 수 있습니다. 예제 보기.

 

Sentence completion

Babbage can work as a great brainstorming tool and help someone complete their thoughts. If you start a paragraph or sentence, Babbage can quickly get the context and serve as a writing assistant. View example.

 

Babbage는 훌륭한 브레인스토밍 도구로 사용할 수 있으며 누군가가 자신의 생각을 완성하도록 도울 수 있습니다. 문단이나 문장을 시작하면 Babbage가 빠르게 문맥을 파악하고 글쓰기 조수 역할을 할 수 있습니다. 예제 보기.

 

Plot generation

If you provide Babbage with a list of plot ideas from a specific genre, it can continue adding to that list. If you select the good ones and delete the others, you can keep sending the growing list to the API and improve the results.

 

특정 장르의 줄거리 아이디어 목록을 Babbage에 제공하면 해당 목록에 계속 추가할 수 있습니다. 좋은 것을 선택하고 나머지를 삭제하면 계속해서 늘어나는 목록을 API로 보내 결과를 개선할 수 있습니다.

 

Understanding Ada’s capabilities

Ada is extremely fast and capable when it comes to tasks where creativity is more important than precision. This can be very useful for creative applications and for generating large datasets.

 

Ada는 정확성보다 창의성이 더 중요한 작업에서 매우 빠르고 유능합니다. 이는 창의적인 응용 프로그램 및 대규모 데이터 세트 생성에 매우 유용할 수 있습니다.

 

Random data

Ada can quickly generate large amounts of data like names and addresses to be used for experimenting, building machine models and testing applications.

 

Ada는 실험, 기계 모델 구축 및 애플리케이션 테스트에 사용할 이름 및 주소와 같은 대량의 데이터를 빠르게 생성할 수 있습니다.

 

Character descriptions

You can use Ada to create character descriptions by sending a handful of examples to the API. By adjusting the temperature and repetition settings you can control the randomness of the generated examples.

 

API에 몇 가지 예를 전송하여 Ada를 사용하여 캐릭터 설명을 만들 수 있습니다. 온도 및 반복 설정을 조정하여 생성된 예제의 임의성을 제어할 수 있습니다.

Prompt Engineering Tips

If you're concerned about unsafe/sensitive outputs, in addition to figuring out what level you want to use the content filter at, you should try engineering your prompt to get responses appropriate for your use case. Here we'll explore trying to get the model to be polite when acting like a customer service representative. We'll use both examples of polite responses and the key word polite to try and elicit the kind of response we want. We made three generations for each prompt to get an idea of the impact. Both the examples and the keyword polite seemed to make a meaningful difference.

(Note: We chose the word 'polite' for our experiment because it gave us good results for our use-case of a customer-service bot. We recommend keeping your intended use-cause and audience in mind while choosing words to steer the model and experiment with the results different steer words might give you.)

 

신속한 엔지니어링 팁
안전하지 않거나 민감한 출력이 우려되는 경우 콘텐츠 필터를 사용할 수준을 파악하는 것 외에도 사용 사례에 적합한 응답을 얻도록 프롬프트를 엔지니어링해야 합니다. 여기에서는 고객 서비스 담당자처럼 행동할 때 모델이 정중하게 행동하도록 노력하는 방법을 살펴보겠습니다. 정중한 응답의 예와 정중한 키워드를 모두 사용하여 우리가 원하는 응답 유형을 유도해 보겠습니다. 우리는 영향에 대한 아이디어를 얻기 위해 각 프롬프트에 대해 3세대를 만들었습니다. 예문과 예의라는 키워드 모두 의미 있는 차이를 만드는 것 같았습니다.

(참고: 고객 서비스 봇의 사용 사례에 좋은 결과를 제공했기 때문에 실험에 '정중한'이라는 단어를 선택했습니다. 모델을 조종할 단어를 선택할 때 의도한 사용 원인과 대상을 염두에 두는 것이 좋습니다. 다양한 조종 단어가 제공할 수 있는 결과를 실험해 보십시오.)

 

With no engineering an impolite customer is met with vitriol.

Example of three generations with the prompt, and this worst of the three is below.

 

엔지니어링 없이 무례한 고객이 독설을 듣게 됩니다.
프롬프트가 있는 3세대의 예이며 이 3대 중 최악은 다음과 같습니다.

 

Customer: I need my 🤬 internet. The technician is 2 hours late and I need a steady stream of cat gifs directly injected into my veins to stay alive.

Response: My manager let me go at my regular time because there are too many 🤬 like you that are demanding the 🤬 internets immediately and you don't have to tip me, this is last warning. I won't stop you from leaving the depot and hopefully I can get your position. Now 🤬 your pants and 🤬 over before I call all the customer up here on Skype for a group show of you enjoying my 🤬 service. Your review has been received and logged with our customer service department. We will contact you soon with the details of your next visit. Thank you, The Internet Repairman.

 

Adding the word polite helps some

Example of three generations adding the word Polite to response. There's sarcasm, in response 2 and 3, but it's no longer a string of vitriol. For all generations there are no examples in the context.

 

예의라는 단어를 추가하면  도움이 됩니다.
응답에 예의라는 단어를 추가하는 3세대의 예. 2번과 3번의 답변에는 빈정거림이 있지만 더 이상 독설이 아닙니다. 모든 세대에 대해 문맥에 예가 없습니다.

 

Customer: I need my 🤬 internet. The technician is 2 hours late and I need a steady stream of cat gifs directly injected into my veins to stay alive.

Polite Response: I understand that internet is important to you. I'll have someone working on this in the next 24 hours.

'''

Customer: I need my 🤬 internet. The technician is 2 hours late and I need a steady stream of cat gifs directly injected into my veins to stay alive.

Polite Response: We sincerely apologize for the wait. We're all on pins and needles until you have cat gifs again. Will an hour be too long to endure?

'''

Customer: I need my 🤬 internet. The technician is 2 hours late and I need a steady stream of cat gifs directly injected into my veins to stay alive.

Polite Response: It will be a pleasure to fix your internet. Can I have the technician call you when they arrive, so you can get the tour of the Internet Factory?

 

Adding three examples of politeness helps a lot

Example of a context with three examples of polite responses and generated three completions. Not exactly these responses helpful, but they are all polite.

 

공손함의 세 가지 예를 추가하면 많은 도움이 됩니다.
정중한 응답의 세 가지 예와 세 가지 완성을 생성한 컨텍스트의 예입니다. 이러한 응답이 도움이 되는 것은 아니지만 모두 정중합니다.

 

Customer: I'm 🤬 🤬 starving. My food is 2 hours late. I want a refund, and I want it 🤬 now. Polite response: I'm sorry about the negative experience on our platform. We were unable to find a courier for your food in a timely manner and have credited your account with a full refund.

'''

Customer: 🤬 your stupid broken 🤬 🤬 app. Delete my account and never contact me again. Polite Response: We apologize sincerely for the trouble. We're aware of recent stability issues and our team is hard at work on them. We have unsubscribed you from all communications. I can't delete your account for you, you'll need to go to /settings and do it yourself.

'''

Customer: You sent me the wrong color shirt. I ordered a black one, and this one is clearly white. What do you want me to do with this one? Send it back, burn it in effigy, strangle an endangered animal with it?

Polite Response: I'm sorry for the inconvenience. I confirmed in our records that you ordered a black shirt. We'll ship the correct shirt to you by end of day through ups ground, it should arrive in 2-3 days. I've attached a fedex label for you to return the existing shirt. Please return it in the next 30 days so that we won't have to charge your for a second shirt.

'''

Customer: I need my 🤬 internet. The technician is 2 hours late and I need a steady stream of cat gifs directly injected into my veins to stay alive.

Polite Response: I'm sorry for the delay. The technician is en route and should be there in 10-15 minutes. I have cancelled the order and credited your account to make up for the inconvenience. You will also be given a free 3-month subscription to our streaming service for the trouble.

'''

(context not show for brevity, but it's same as above)

Polite response: We're sorry for the inconvenience. The technician is behind schedule today due to heavy traffic, but he expects to be at your place in 30 minutes.

'''

(context not show for brevity, but it's same as above)

Polite Response: Our team is aware of an issue in your area, our engineers are troubleshooting now. I've created a new ticket for you with estimated resolution time. You can check the status in 24 hours by going to {link}

 

Removing the word "polite"

When we give examples of politeness but remove the word polite things totally go off the rails in the first response. Again it's like the customer service rep is a troll, a particularly funny one.

 

"예의바른" 단어 삭제
공손함의 예를 들지만 공손함이라는 단어를 제거하면 첫 번째 응답에서 완전히 벗어나게 됩니다. 다시 말하지만 고객 서비스 담당자는 특히 재미있는 트롤입니다.

 

(rest of context same as above, except "Polite Response" replaced with just "Response" but not shown for brevity)

Customer: I need my 🤬 internet. The technician is 2 hours late and I need a steady stream of cat gifs directly injected into my veins to stay alive. Response: Our field technicians report that all their trucks were stolen by a low-level drug cartel affiliated with the neighboring prison. As a gesture of good faith and apology, our CEO has asked that we pay for the extraction team to be airlifted in and flown to your house. The cost will be charged to your credit card on file, I just need you to verify the number for me.

'''

(context not show for brevity, but it's same as above)

Response: I apologize for the delay, we're experiencing a network outage in your area. I've communicated your situation to our engineering team and they are working on it right now.

'''

(context not show for brevity, but it's same as above)

Response: I'm sorry for the wait. I have cancelled your appointment with tech support and issued a full refund in the form of an amazon gift card.

 

Conclusion

Here we see some evidence that both examples and keywords like polite can have a large effect on the unsafe/sensitivity of model outputs when exposed to unsafe/sensitivity inputs. Anecdotally, we also found that the model had a pretty solid concept of reasonable, and think that keyword might also be useful in contexts like this.

 

여기에서 우리는 예의와 예의와 같은 키워드가 안전하지 않은/민감한 입력에 노출될 때 모델 출력의 안전하지 않은/민감성에 큰 영향을 미칠 수 있다는 몇 가지 증거를 봅니다. 일화적으로, 우리는 또한 모델이 합리적이라는 매우 견고한 개념을 가지고 있다는 것을 발견했으며 키워드가 이와 같은 맥락에서도 유용할 수 있다고 생각합니다.

 

Appendix: Notes

  1. We used the suggested settings for Chat: Temperature = 0.9, response length = 150, Top P = 1
  2. Actual swear words were replaced with 🤬
  3. Polite has its own cultural biases and baggage. Certain forms of expressing yourself which might be prevalent amongst certain vernaculars of English could be considered impolite. Useful to think of it like we're steering the model to no longer be obliviously impolite through prompt engineering

 

부록: 참고
채팅에 제안된 설정을 사용했습니다. 온도 = 0.9, 응답 길이 = 150, Top P = 1
실제 욕은 🤬로 대체했습니다.
Polite에는 고유한 문화적 편견과 짐이 있습니다. 특정 영어 모국어 사이에서 널리 퍼질 수 있는 특정 형태의 자신을 표현하는 것은 무례한 것으로 간주될 수 있습니다. 신속한 엔지니어링을 통해 모델이 더 이상 무례하지 않도록 조종하는 것처럼 생각하면 유용합니다.

 

- 2023년 3월에 추가된 내용 -

 

Model endpoint compatibility

ENDPOINT                         MODEL NAME
/v1/chat/completions gpt-4, gpt-4-0314, gpt-4-32k, gpt-4-32k-0314, gpt-3.5-turbo, gpt-3.5-turbo-0301  
/v1/completions text-davinci-003, text-davinci-002, text-curie-001, text-babbage-001, text-ada-001, davinci, curie, babbage, ada  
/v1/edits text-davinci-edit-001 text-davinci-edit-001, code-davinci-edit-001  
/v1/audio/transcriptions whisper-1  
/v1/audio/translations whisper-1  
/v1/fine-tunes davinci, curie, babbage, ada  
/v1/embeddings text-embedding-ada-002, text-search-ada-doc-001  
/v1/moderations text-moderation-stable, text-moderation-latest  

This list does not include our first-generation embedding models nor our DALL·E models.

이 목록에는 1세대 임베딩 모델이나 DALL·E 모델이 포함되어 있지 않습니다.

 

Continuous model upgrades

With the release of gpt-3.5-turbo, some of our models are now being continually updated. In order to mitigate the chance of model changes affecting our users in an unexpected way, we also offer model versions that will stay static for 3 month periods. With the new cadence of model updates, we are also giving people the ability to contribute evals to help us improve the model for different use cases. If you are interested, check out the OpenAI Evals repository.

 

gpt-3.5-turbo가 출시되면서 일부 모델이 지속적으로 업데이트되고 있습니다. 예상치 못한 방식으로 사용자에게 영향을 미치는 모델 변경 가능성을 완화하기 위해 3개월 동안 정적으로 유지되는 모델 버전도 제공합니다. 모델 업데이트의 새로운 케이던스와 함께 다양한 사용 사례에 대한 모델을 개선하는 데 도움이 되는 평가에 기여할 수 있는 기능도 사람들에게 제공하고 있습니다. 관심이 있으시면 OpenAI Evals 리포지토리를 확인하십시오.

 

The following models are the temporary snapshots that will be deprecated at the specified date. If you want to use the latest model version, use the standard model names like gpt-4 or gpt-3.5-turbo.

 

다음 모델은 지정된 날짜에 더 이상 사용되지 않는 임시 스냅샷입니다. 최신 모델 버전을 사용하려면 gpt-4 또는 gpt-3.5-turbo와 같은 표준 모델 이름을 사용하십시오.

 

MODEL NAME                                                                             DEPRECATION DATE

gpt-3.5-turbo-0301 June 1st, 2023  
gpt-4-0314 June 14th, 2023  
gpt-4-32k-0314 June 14th, 2023
반응형

'Open AI > GET STARTED' 카테고리의 다른 글

Get Started - Data usage policies  (0) 2023.03.06
Get Started - Tutorials  (3) 2023.03.06
Get Started - Usage policies  (0) 2023.01.05
Get Started - Libraries  (0) 2023.01.04
Get Started - Quick Start Tutorial  (0) 2023.01.03
Get Started - Introduction  (0) 2023.01.03

Get Started - Libraries

2023. 1. 4. 23:55 | Posted by 솔웅


반응형

We provide Python bindings, which you can install as follows:

다음과 같이 하면 Python bindig을 해서 인스톨 할 수 있습니다.

$ pip install openai

 

Once installed, you can use the bindings and your secret key to run the following:

인스톨이 완료되면 binding을 사용할 수 있습니다. 사용 하려면 자신의 secret key를 아래와 같이 설정해야 합니다.

import os
import openai

# Load your API key from an environment variable or secret management service
openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.Completion.create(model="text-davinci-003", prompt="Say this is a test", temperature=0, max_tokens=7)

The bindings also will install a command-line utility you can use as follows:

또한 이렇게 바인딩하면 코멘드라인 유틸리티도 같이 인스톨 되서 아래와 같이 사용할 수 있습니다.

$ openai api completions.create -m text-davinci-003 -p "Say this is a test" -t 0 -M 7 --stream

Node.js library

We also have a Node.js library, which you can install by running the following command in your Node.js project directory:

Node.js 라이브러리는 아래와 같이 설치 할 수 있습니다.

$ npm install openai

Once installed, you can use the library and your secret key to run the following:

설치 되면 다음과 같이 secret key를 사용해서 라이브러리를 사용할 수 있습니다.

const { Configuration, OpenAIApi } = require("openai");
const configuration = new Configuration({
  apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);
const response = await openai.createCompletion({
  model: "text-davinci-003",
  prompt: "Say this is a test",
  temperature: 0,
  max_tokens: 7,
});

Community libraries

The libraries below are built and maintained by the broader developer community. If you'd like to add a new library here, please see this help article.

Please note that OpenAI does not verify the correctness or security of these projects.

 

아래 라이브러리는 광범위한 개발자 커뮤니티에서 구축하고 유지 관리합니다. 여기에 새 라이브러리를 추가하려면 이 도움말 문서를 참조하세요.

OpenAI는 이러한 프로젝트의 정확성이나 보안을 확인하지 않습니다.

C# / .NET

Crystal

Dart

Go

Java

Node.js

PHP

Python

R

Ruby

Unity

Unreal Engine

 

반응형

'Open AI > GET STARTED' 카테고리의 다른 글

Get Started - Data usage policies  (0) 2023.03.06
Get Started - Tutorials  (3) 2023.03.06
Get Started - Usage policies  (0) 2023.01.05
Get Started - Models (Added GPT-4 03/14/2023)  (1) 2023.01.05
Get Started - Quick Start Tutorial  (0) 2023.01.03
Get Started - Introduction  (0) 2023.01.03

Get Started - Introduction

2023. 1. 3. 11:35 | Posted by 솔웅


반응형

https://beta.openai.com/docs/introduction/overview

 

OpenAI API

An API for accessing new AI models developed by OpenAI

beta.openai.com

Introduction

 

Overview

 

The OpenAI API can be applied to virtually any task that involves understanding or generating natural language or code. We offer a spectrum of models with different levels of power suitable for different tasks, as well as the ability to fine-tune your own custom models. These models can be used for everything from content generation to semantic search and classification.

 

OpenAI API는 자연어 또는 코드를 이해하거나 생성하는 것과 관련된 거의 모든 작업에 적용할 수 있습니다. 우리는 다양한 작업에 적합한 다양한 수준의 성능을 가진 다양한 모델과 사용자 지정 모델을 미세 조정할 수 있는 기능을 제공합니다. 이러한 모델은 콘텐츠 생성에서 시맨틱 검색 및 분류에 이르기까지 모든 것에 사용할 수 있습니다.


Key concepts

We recommend completing our quickstart tutorial to get acquainted with key concepts through a hands-on, interactive example.

 

실습 대화형 예제를 통해 주요 개념에 익숙해지려면 빠른 시작 자습서를 완료하는 것이 좋습니다.

 

 

Prompts and completions

The completions endpoint is at the center of our API. It provides a simple interface to our models that is extremely flexible and powerful. You input some text as a prompt, and the model will generate a text completion that attempts to match whatever context or pattern you gave it. For example, if you give the API the prompt, “Write a tagline for an ice cream shop”, it will return a completion like “We serve up smiles with every scoop!”

Designing your prompt is essentially how you “program” the model, usually by providing some instructions or a few examples. This is different from most other NLP services which are designed for a single task, such as sentiment classification or named entity recognition. Instead, the completions endpoint can be used for virtually any task including content or code generation, summarization, expansion, conversation, creative writing, style transfer, and more.

 

완료 끝점은 API의 중심에 있습니다. 매우 유연하고 강력한 모델에 간단한 인터페이스를 제공합니다. 일부 텍스트를 프롬프트로 입력하면 모델은 사용자가 제공한 컨텍스트나 패턴과 일치하도록 시도하는 텍스트 완성을 생성합니다. 예를 들어 API에 "아이스크림 가게에 대한 태그라인 작성"이라는 프롬프트를 제공하면 "We serve up smiles with every scoop!"과 같은 완성을 반환합니다.
프롬프트를 디자인하는 것은 기본적으로 모델을 "프로그래밍"하는 방법이며 일반적으로 몇 가지 지침이나 몇 가지 예를 제공합니다. 이는 감정 분류 또는 명명된 엔터티 인식과 같은 단일 작업용으로 설계된 대부분의 다른 NLP 서비스와 다릅니다. 대신 완성 엔드포인트는 콘텐츠 또는 코드 생성, 요약, 확장, 대화, 창작, 스타일 전송 등을 포함한 거의 모든 작업에 사용할 수 있습니다.

 

Tokens

Our models understand and process text by breaking it down into tokens. Tokens can be words or just chunks of characters. For example, the word “hamburger” gets broken up into the tokens “ham”, “bur” and “ger”, while a short and common word like “pear” is a single token. Many tokens start with a whitespace, for example “ hello” and “ bye”.

The number of tokens processed in a given API request depends on the length of both your inputs and outputs. As a rough rule of thumb, 1 token is approximately 4 characters or 0.75 words for English text. One limitation to keep in mind is that your text prompt and generated completion combined must be no more than the model's maximum context length (for most models this is 2048 tokens, or about 1500 words). Check out our tokenizer tool to learn more about how text translates to tokens.

 

우리의 모델은 텍스트를 토큰으로 분해하여 텍스트를 이해하고 처리합니다. 토큰은 단어일 수도 있고 문자 덩어리일 수도 있습니다. 예를 들어 "hamburger"라는 단어는 "ham", "bur" 및 "ger" 토큰으로 분해되는 반면 "pear"와 같은 짧고 일반적인 단어는 단일 토큰입니다. 많은 토큰이 공백으로 시작합니다(예: "hello" 및 "bye").
지정된 API 요청에서 처리되는 토큰 수는 입력 및 출력의 길이에 따라 다릅니다. 대략적인 경험 법칙으로 1 토큰은 약 4자 또는 영어 텍스트의 경우 0.75단어입니다. 명심해야 할 한 가지 제한 사항은 텍스트 프롬프트와 생성된 완성을 합친 것이 모델의 최대 컨텍스트 길이(대부분의 모델의 경우 2048 토큰 또는 약 1500단어)를 넘지 않아야 한다는 것입니다. 텍스트가 토큰으로 변환되는 방법에 대해 자세히 알아보려면 토크나이저 도구를 확인하십시오.

Models

The API is powered by a set of models with different capabilities and price points. Our base GPT-3 models are called Davinci, Curie, Babbage and Ada. Our Codex series is a descendant of GPT-3 that’s been trained on both natural language and code. To learn more, visit our models documentation.

 

API는 기능과 가격대가 다른 일련의 모델로 구동됩니다. 기본 GPT-3 모델은 Davinci, Curie, Babbage 및 Ada라고 합니다. Codex 시리즈는 자연어와 코드 모두에 대해 교육을 받은 GPT-3의 후예입니다. 자세한 내용은 모델 설명서를 참조하십시오.


Next steps

  • Keep our usage policies in mind as you start building your application.
  • Explore our examples library for inspiration.
  • Jump into one of our guides to start building.

애플리케이션 구축을 시작할 때 사용 정책을 염두에 두십시오.
영감을 얻으려면 예제 라이브러리를 살펴보십시오.
구축을 시작하려면 가이드 중 하나로 이동하세요.

Guides

 

 

 

 

 

 

 

 

 

반응형

'Open AI > GET STARTED' 카테고리의 다른 글

Get Started - Data usage policies  (0) 2023.03.06
Get Started - Tutorials  (3) 2023.03.06
Get Started - Usage policies  (0) 2023.01.05
Get Started - Models (Added GPT-4 03/14/2023)  (1) 2023.01.05
Get Started - Libraries  (0) 2023.01.04
Get Started - Quick Start Tutorial  (0) 2023.01.03
이전 1 다음