pythainlp.generate

The pythainlp.generate is Thai text generate with PyThaiNLP.

Modules

class pythainlp.generate.Unigram(name: str = 'tnc')[source]

Text generator using Unigram

Parameters:

name (str) – corpus name * tnc - Thai National Corpus (default) * ttc - Thai Textbook Corpus (TTC) * oscar - OSCAR Corpus

__init__(name: str = 'tnc')[source]
gen_sentence(start_seq: str | None = None, N: int = 3, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) List[str] | str[source]
Parameters:
  • start_seq (str) – word for begin word.

  • N (int) – number of word.

  • output_str (bool) – output is str

  • duplicate (bool) – duplicate word in sent

Returns:

list words or str words

Return type:

List[str], str

Example:

from pythainlp.generate import Unigram

gen = Unigram()

gen.gen_sentence("แมว")
# ouput: 'แมวเวลานะนั้น'
class pythainlp.generate.Bigram(name: str = 'tnc')[source]

Text generator using Bigram

Parameters:

name (str) – corpus name * tnc - Thai National Corpus (default)

__init__(name: str = 'tnc')[source]
prob(t1: str, t2: str) float[source]

probability word

Parameters:
  • t1 (int) – text 1

  • t2 (int) – text 2

Returns:

probability value

Return type:

float

gen_sentence(start_seq: str | None = None, N: int = 4, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) List[str] | str[source]
Parameters:
  • start_seq (str) – word for begin word.

  • N (int) – number of word.

  • output_str (bool) – output is str

  • duplicate (bool) – duplicate word in sent

Returns:

list words or str words

Return type:

List[str], str

Example:

from pythainlp.generate import Bigram

gen = Bigram()

gen.gen_sentence("แมว")
# ouput: 'แมวไม่ได้รับเชื้อมัน'
class pythainlp.generate.Trigram(name: str = 'tnc')[source]

Text generator using Trigram

Parameters:

name (str) – corpus name * tnc - Thai National Corpus (default)

__init__(name: str = 'tnc')[source]
prob(t1: str, t2: str, t3: str) float[source]

probability word

Parameters:
  • t1 (int) – text 1

  • t2 (int) – text 2

  • t3 (int) – text 3

Returns:

probability value

Return type:

float

gen_sentence(start_seq: str | None = None, N: int = 4, prob: float = 0.001, output_str: bool = True, duplicate: bool = False) List[str] | str[source]
Parameters:
  • start_seq (str) – word for begin word.

  • N (int) – number of word.

  • output_str (bool) – output is str

  • duplicate (bool) – duplicate word in sent

Returns:

list words or str words

Return type:

List[str], str

Example:

from pythainlp.generate import Trigram

gen = Trigram()

gen.gen_sentence()
# ouput: 'ยังทำตัวเป็นเซิร์ฟเวอร์คือ'
pythainlp.generate.thai2fit.gen_sentence(start_seq: str | None = None, N: int = 4, prob: float = 0.001, output_str: bool = True) List[str] | str[source]

Text generator using Thai2fit

Parameters:
  • start_seq (str) – word for begin word.

  • N (int) – number of word.

  • output_str (bool) – output is str

  • duplicate (bool) – duplicate word in sent

Returns:

list words or str words

Return type:

List[str], str

Example:

from pythainlp.generate.thai2fit import gen_sentence

gen_sentence()
# output: 'แคทรียา อิงลิช  (นักแสดง'

gen_sentence("แมว")
# output: 'แมว คุณหลวง '
class pythainlp.generate.wangchanglm.WangChanGLM[source]
__init__()[source]
is_exclude(text: str) bool[source]
load_model(model_path: str = 'pythainlp/wangchanglm-7.5B-sft-en-sharded', return_dict: bool = True, load_in_8bit: bool = False, device: str = 'cuda', torch_dtype=torch.float16, offload_folder: str = './', low_cpu_mem_usage: bool = True)[source]

Load model

Parameters:
  • model_path (str) – Model path

  • return_dict (bool) – return_dict

  • load_in_8bit (bool) – load model in 8bit

  • device (str) – device (cpu, cuda or other)

  • torch_dtype (torch_dtype) – torch_dtype

  • offload_folder (str) – offload folder

  • low_cpu_mem_usage (bool) – low cpu mem usage

gen_instruct(text: str, max_new_tokens: int = 512, top_p: float = 0.95, temperature: float = 0.9, top_k: int = 50, no_repeat_ngram_size: int = 2, typical_p: float = 1.0, thai_only: bool = True, skip_special_tokens: bool = True)[source]

Generate Instruct

Parameters:
  • text (str) – text

  • max_new_tokens (int) – max new tokens

  • top_p (float) – Top p

  • temperature (float) – temperature

  • top_k (int) – Top k

  • no_repeat_ngram_size (int) – no repeat ngram size

  • typical_p (float) – typical p

  • thai_only (bool) – Thai only

  • skip_special_tokens (bool) – skip special tokens

Returns:

the answer from Instruct.

Return type:

str

instruct_generate(instruct: str, context: str | None = None, max_new_tokens=512, temperature: float = 0.9, top_p: float = 0.95, top_k: int = 50, no_repeat_ngram_size: int = 2, typical_p: float = 1, thai_only: bool = True, skip_special_tokens: bool = True)[source]

Generate Instruct

Parameters:
  • instruct (str) – Instruct

  • context (str) – context

  • max_new_tokens (int) – max new tokens

  • top_p (float) – Top p

  • temperature (float) – temperature

  • top_k (int) – Top k

  • no_repeat_ngram_size (int) – no repeat ngram size

  • typical_p (float) – typical p

  • thai_only (bool) – Thai only

  • skip_special_tokens (bool) – skip special tokens

Returns:

the answer from Instruct.

Return type:

str

Example:

from pythainlp.generate.wangchanglm import WangChanGLM
import torch

model = WangChanGLM()

model.load_model(device="cpu",torch_dtype=torch.bfloat16)

print(model.instruct_generate(instruct="ขอวิธีลดน้ำหนัก"))
# output: ลดน้ําหนักให้ได้ผล ต้องทําอย่างค่อยเป็นค่อยไป
# ปรับเปลี่ยนพฤติกรรมการกินอาหาร
# ออกกําลังกายอย่างสม่ําเสมอ
# และพักผ่อนให้เพียงพอ
# ที่สําคัญควรหลีกเลี่ยงอาหารที่มีแคลอรี่สูง
# เช่น อาหารทอด อาหารมัน อาหารที่มีน้ําตาลสูง
# และเครื่องดื่มแอลกอฮอล์