pythainlp.generate

The pythainlp.generate module is a powerful tool for generating Thai text using PyThaiNLP. It includes several classes and functions that enable users to create text based on various language models and n-gram models.

Modules

Unigram

class pythainlp.generate.Unigram(name: str = 'tnc')[source]

Text generator using Unigram

Parameters:

name (str) – corpus name * tnc - Thai National Corpus (default) * ttc - Thai Textbook Corpus (TTC) * oscar - OSCAR Corpus

__init__(name: str = 'tnc')[source]
gen_sentence(start_seq: str | None = None, N: int = 3, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) List[str] | str[source]
Parameters:
  • start_seq (str) – word to begin sentence with

  • N (int) – number of words

  • output_str (bool) – output as string

  • duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

List[str], str

Example:

from pythainlp.generate import Unigram

gen = Unigram()

gen.gen_sentence("แมว")
# output: 'แมวเวลานะนั้น'

The Unigram class provides functionality for generating text based on unigram language models. Unigrams are single words or tokens, and this class allows you to create text by selecting words probabilistically based on their frequencies in the training data.

Bigram

class pythainlp.generate.Bigram(name: str = 'tnc')[source]

Text generator using Bigram

Parameters:

name (str) – corpus name * tnc - Thai National Corpus (default)

__init__(name: str = 'tnc')[source]
prob(t1: str, t2: str) float[source]

probability of word

Parameters:
  • t1 (int) – text 1

  • t2 (int) – text 2

Returns:

probability value

Return type:

float

gen_sentence(start_seq: str | None = None, N: int = 4, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) List[str] | str[source]
Parameters:
  • start_seq (str) – word to begin sentence with

  • N (int) – number of words

  • output_str (bool) – output as string

  • duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

List[str], str

Example:

from pythainlp.generate import Bigram

gen = Bigram()

gen.gen_sentence("แมว")
# output: 'แมวไม่ได้รับเชื้อมัน'

The Bigram class is designed for generating text using bigram language models. Bigrams are sequences of two words, and this class enables you to generate text by predicting the next word based on the previous word’s probability.

Trigram

class pythainlp.generate.Trigram(name: str = 'tnc')[source]

Text generator using Trigram

Parameters:

name (str) – corpus name * tnc - Thai National Corpus (default)

__init__(name: str = 'tnc')[source]
prob(t1: str, t2: str, t3: str) float[source]

probability of word

Parameters:
  • t1 (int) – text 1

  • t2 (int) – text 2

  • t3 (int) – text 3

Returns:

probability value

Return type:

float

gen_sentence(start_seq: str | None = None, N: int = 4, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) List[str] | str[source]
Parameters:
  • start_seq (str) – word to begin sentence with

  • N (int) – number of words

  • output_str (bool) – output as string

  • duplicate (bool) – allow duplicate words in sentence

Returns:

list of words or a word string

Return type:

List[str], str

Example:

from pythainlp.generate import Trigram

gen = Trigram()

gen.gen_sentence()
# output: 'ยังทำตัวเป็นเซิร์ฟเวอร์คือ'

The Trigram class extends text generation to trigram language models. Trigrams consist of three consecutive words, and this class facilitates the creation of text by predicting the next word based on the two preceding words’ probabilities.

pythainlp.generate.thai2fit.gen_sentence

pythainlp.generate.thai2fit.gen_sentence(start_seq: str | None = None, N: int = 4, prob: float = 0.001, output_str: bool = True) List[str] | str[source]

Text generator using Thai2fit

Parameters:
  • start_seq (str) – word to begin sentence with

  • N (int) – number of words

  • output_str (bool) – output as string

  • duplicate (bool) – allow duplicate words in sentence

Returns:

list words or str words

Return type:

List[str], str

Example:

from pythainlp.generate.thai2fit import gen_sentence

gen_sentence()
# output: 'แคทรียา อิงลิช  (นักแสดง'

gen_sentence("แมว")
# output: 'แมว คุณหลวง '

The function pythainlp.generate.thai2fit.gen_sentence() offers a convenient way to generate sentences using the Thai2Vec language model. It takes a seed text as input and generates a coherent sentence based on the provided context.

pythainlp.generate.wangchanglm.WangChanGLM

class pythainlp.generate.wangchanglm.WangChanGLM[source]
__init__()[source]
is_exclude(text: str) bool[source]
load_model(model_path: str = 'pythainlp/wangchanglm-7.5B-sft-en-sharded', return_dict: bool = True, load_in_8bit: bool = False, device: str = 'cuda', torch_dtype=torch.float16, offload_folder: str = './', low_cpu_mem_usage: bool = True)[source]

Load model

Parameters:
  • model_path (str) – model path

  • return_dict (bool) – return dict

  • load_in_8bit (bool) – load model in 8bit

  • device (str) – device (cpu, cuda or other)

  • torch_dtype (torch_dtype) – torch_dtype

  • offload_folder (str) – offload folder

  • low_cpu_mem_usage (bool) – low cpu mem usage

gen_instruct(text: str, max_new_tokens: int = 512, top_p: float = 0.95, temperature: float = 0.9, top_k: int = 50, no_repeat_ngram_size: int = 2, typical_p: float = 1.0, thai_only: bool = True, skip_special_tokens: bool = True)[source]

Generate Instruct

Parameters:
  • text (str) – text

  • max_new_tokens (int) – maximum number of new tokens

  • top_p (float) – top p

  • temperature (float) – temperature

  • top_k (int) – top k

  • no_repeat_ngram_size (int) – do not repeat ngram size

  • typical_p (float) – typical p

  • thai_only (bool) – Thai only

  • skip_special_tokens (bool) – skip special tokens

Returns:

the answer from Instruct

Return type:

str

instruct_generate(instruct: str, context: str | None = None, max_new_tokens=512, temperature: float = 0.9, top_p: float = 0.95, top_k: int = 50, no_repeat_ngram_size: int = 2, typical_p: float = 1, thai_only: bool = True, skip_special_tokens: bool = True)[source]

Generate Instruct

Parameters:
  • instruct (str) – Instruct

  • context (str) – context

  • max_new_tokens (int) – maximum number of new tokens

  • top_p (float) – top p

  • temperature (float) – temperature

  • top_k (int) – top k

  • no_repeat_ngram_size (int) – do not repeat ngram size

  • typical_p (float) – typical p

  • thai_only (bool) – Thai only

  • skip_special_tokens (bool) – skip special tokens

Returns:

the answer from Instruct

Return type:

str

Example:

from pythainlp.generate.wangchanglm import WangChanGLM
import torch

model = WangChanGLM()

model.load_model(device="cpu",torch_dtype=torch.bfloat16)

print(model.instruct_generate(instruct="ขอวิธีลดน้ำหนัก"))
# output: ลดน้ําหนักให้ได้ผล ต้องทําอย่างค่อยเป็นค่อยไป
# ปรับเปลี่ยนพฤติกรรมการกินอาหาร
# ออกกําลังกายอย่างสม่ําเสมอ
# และพักผ่อนให้เพียงพอ
# ที่สําคัญควรหลีกเลี่ยงอาหารที่มีแคลอรี่สูง
# เช่น อาหารทอด อาหารมัน อาหารที่มีน้ําตาลสูง
# และเครื่องดื่มแอลกอฮอล์

The WangChanGLM class is a part of the pythainlp.generate.wangchanglm module, offering text generation capabilities. It includes methods for creating text using the WangChanGLM language model.

Usage

To use the text generation capabilities provided by the pythainlp.generate module, follow these steps:

  1. Select the appropriate class or function based on the type of language model you want to use (Unigram, Bigram, Trigram, Thai2Vec, or WangChanGLM).

  2. Initialize the selected class or use the function with the necessary parameters.

  3. Call the appropriate methods to generate text based on the chosen model.

  4. Utilize the generated text for various applications, such as chatbots, content generation, and more.

Example

Here’s a simple example of how to generate text using the Unigram class:

::

from pythainlp.generate import Unigram

# Initialize the Unigram model unigram = Unigram()

# Generate a sentence sentence = unigram.gen_sentence(“สวัสดีครับ”)

print(sentence)