pythainlp.generate

The pythainlp.generate module is a powerful tool for generating Thai text using PyThaiNLP. It includes several classes and functions that enable users to create text based on various language models and n-gram models.

Modules

Unigram

class pythainlp.generate.Unigram(name: str = 'tnc')[source]

Text generator using Unigram

Parameters:: name (str) – corpus name * tnc - Thai National Corpus (default) * ttc - Thai Textbook Corpus (TTC) * oscar - OSCAR Corpus

__init__(name: str = 'tnc')[source]

gen_sentence(start_seq: str | None = None, N: int = 3, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) → List[str] | str[source]

Parameters:

start_seq (str) – word to begin sentence with
N (int) – number of words
output_str (bool) – output as string
duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

List[str], str

Example:

from pythainlp.generate import Unigram

gen = Unigram()

gen.gen_sentence("แมว")
# output: 'แมวเวลานะนั้น'

The Unigram class provides functionality for generating text based on unigram language models. Unigrams are single words or tokens, and this class allows you to create text by selecting words probabilistically based on their frequencies in the training data.

Bigram

class pythainlp.generate.Bigram(name: str = 'tnc')[source]

Text generator using Bigram

Parameters:: name (str) – corpus name * tnc - Thai National Corpus (default)

__init__(name: str = 'tnc')[source]

prob(t1: str, t2: str) → float[source]

probability of word

Parameters:

t1 (int) – text 1
t2 (int) – text 2

Returns:

probability value

Return type:

float

gen_sentence(start_seq: str | None = None, N: int = 4, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) → List[str] | str[source]

Parameters:

start_seq (str) – word to begin sentence with
N (int) – number of words
output_str (bool) – output as string
duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

List[str], str

Example:

from pythainlp.generate import Bigram

gen = Bigram()

gen.gen_sentence("แมว")
# output: 'แมวไม่ได้รับเชื้อมัน'

The Bigram class is designed for generating text using bigram language models. Bigrams are sequences of two words, and this class enables you to generate text by predicting the next word based on the previous word’s probability.

Trigram

class pythainlp.generate.Trigram(name: str = 'tnc')[source]

Text generator using Trigram

Parameters:: name (str) – corpus name * tnc - Thai National Corpus (default)

__init__(name: str = 'tnc')[source]

prob(t1: str, t2: str, t3: str) → float[source]

probability of word

Parameters:

t1 (int) – text 1
t2 (int) – text 2
t3 (int) – text 3

Returns:

probability value

Return type:

float

gen_sentence(start_seq: str | None = None, N: int = 4, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) → List[str] | str[source]

Parameters:

start_seq (str) – word to begin sentence with
N (int) – number of words
output_str (bool) – output as string
duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

List[str], str

Example:

from pythainlp.generate import Trigram

gen = Trigram()

gen.gen_sentence()
# output: 'ยังทำตัวเป็นเซิร์ฟเวอร์คือ'

The Trigram class extends text generation to trigram language models. Trigrams consist of three consecutive words, and this class facilitates the creation of text by predicting the next word based on the two preceding words’ probabilities.

pythainlp.generate.thai2fit.gen_sentence

pythainlp.generate.thai2fit.gen_sentence(start_seq: str | None = None, N: int = 4, prob: float = 0.001, output_str: bool = True) → List[str] | str[source]

Text generator using Thai2fit

Parameters:

start_seq (str) – word to begin sentence with
N (int) – number of words
output_str (bool) – output as string
duplicate (bool) – allow duplicate words in sentence

Returns:

list words or str words

Return type:

List[str], str

Example:

from pythainlp.generate.thai2fit import gen_sentence

gen_sentence()
# output: 'แคทรียา อิงลิช  (นักแสดง'

gen_sentence("แมว")
# output: 'แมว คุณหลวง '

The function pythainlp.generate.thai2fit.gen_sentence() offers a convenient way to generate sentences using the Thai2Vec language model. It takes a seed text as input and generates a coherent sentence based on the provided context.

pythainlp.generate.wangchanglm.WangChanGLM

class pythainlp.generate.wangchanglm.WangChanGLM[source]

__init__()[source]

is_exclude(text: str) → bool[source]

load_model(model_path: str = 'pythainlp/wangchanglm-7.5B-sft-en-sharded', return_dict: bool = True, load_in_8bit: bool = False, device: str = 'cuda', torch_dtype=torch.float16, offload_folder: str = './', low_cpu_mem_usage: bool = True)[source]

Load model

Parameters:

model_path (str) – model path
return_dict (bool) – return dict
load_in_8bit (bool) – load model in 8bit
device (str) – device (cpu, cuda or other)
torch_dtype (torch_dtype) – torch_dtype
offload_folder (str) – offload folder
low_cpu_mem_usage (bool) – low cpu mem usage

gen_instruct(text: str, max_new_tokens: int = 512, top_p: float = 0.95, temperature: float = 0.9, top_k: int = 50, no_repeat_ngram_size: int = 2, typical_p: float = 1.0, thai_only: bool = True, skip_special_tokens: bool = True)[source]

Generate Instruct

Parameters:

text (str) – text
max_new_tokens (int) – maximum number of new tokens
top_p (float) – top p
temperature (float) – temperature
top_k (int) – top k
no_repeat_ngram_size (int) – do not repeat ngram size
typical_p (float) – typical p
thai_only (bool) – Thai only
skip_special_tokens (bool) – skip special tokens

Returns:

the answer from Instruct

Return type:

str

instruct_generate(instruct: str, context: str | None = None, max_new_tokens=512, temperature: float = 0.9, top_p: float = 0.95, top_k: int = 50, no_repeat_ngram_size: int = 2, typical_p: float = 1, thai_only: bool = True, skip_special_tokens: bool = True)[source]

Generate Instruct

Parameters:

instruct (str) – Instruct
context (str) – context
max_new_tokens (int) – maximum number of new tokens
top_p (float) – top p
temperature (float) – temperature
top_k (int) – top k
no_repeat_ngram_size (int) – do not repeat ngram size
typical_p (float) – typical p
thai_only (bool) – Thai only
skip_special_tokens (bool) – skip special tokens

Returns:

the answer from Instruct

Return type:

str

Example:

from pythainlp.generate.wangchanglm import WangChanGLM
import torch

model = WangChanGLM()

model.load_model(device="cpu",torch_dtype=torch.bfloat16)

print(model.instruct_generate(instruct="ขอวิธีลดน้ำหนัก"))
# output: ลดน้ําหนักให้ได้ผล ต้องทําอย่างค่อยเป็นค่อยไป
# ปรับเปลี่ยนพฤติกรรมการกินอาหาร
# ออกกําลังกายอย่างสม่ําเสมอ
# และพักผ่อนให้เพียงพอ
# ที่สําคัญควรหลีกเลี่ยงอาหารที่มีแคลอรี่สูง
# เช่น อาหารทอด อาหารมัน อาหารที่มีน้ําตาลสูง
# และเครื่องดื่มแอลกอฮอล์

The WangChanGLM class is a part of the pythainlp.generate.wangchanglm module, offering text generation capabilities. It includes methods for creating text using the WangChanGLM language model.

Usage

To use the text generation capabilities provided by the pythainlp.generate module, follow these steps:

Select the appropriate class or function based on the type of language model you want to use (Unigram, Bigram, Trigram, Thai2Vec, or WangChanGLM).
Initialize the selected class or use the function with the necessary parameters.
Call the appropriate methods to generate text based on the chosen model.
Utilize the generated text for various applications, such as chatbots, content generation, and more.

Example

Here’s a simple example of how to generate text using the Unigram class:

::

from pythainlp.generate import Unigram

# Initialize the Unigram model unigram = Unigram()

# Generate a sentence sentence = unigram.gen_sentence(“สวัสดีครับ”)

print(sentence)