pythainlp.spell

The pythainlp.spell module is a powerful tool for finding the closest correctly spelled word to a given text in the Thai language. It provides functionalities to correct spelling errors and enhance the accuracy of text processing.

Modules

correct

pythainlp.spell.correct(word: str, engine: str = 'pn') str[source]

Corrects the spelling of the given word by returning the correctly spelled word.

Parameters:
  • word (str) – word to correct spelling of

  • engine (str) –

    • pn - Peter Norvig’s algorithm [1] (default)

    • phunspell - A spell checker utilizing spylls, a port of Hunspell.

    • symspellpy - symspellpy is a Python port of SymSpell v6.5.

    • wanchanberta_thai_grammarly - WanchanBERTa Thai Grammarly

Returns:

the corrected word

Return type:

str

Example:

from pythainlp.spell import correct

correct("เส้นตรบ")
# output: 'เส้นตรง'

correct("ครัช")
# output: 'ครับ'

correct("สังเกตุ")
# output: 'สังเกต'

correct("กระปิ")
# output: 'กะปิ'

correct("เหตการณ")
# output: 'เหตุการณ์'

The correct function is designed to correct the spelling of a single Thai word. Given an input word, this function returns the closest correctly spelled word from the dictionary, making it valuable for spell-checking and text correction tasks.

correct_sent

pythainlp.spell.correct_sent(list_words: List[str], engine: str = 'pn') List[str][source]

Corrects and returns the spelling of the given sentence

Parameters:
  • list_words (List[str]) – list of words in sentence

  • engine (str) –

    • pn - Peter Norvig’s algorithm [1] (default)

    • phunspell - A spell checker utilizing spylls, a port of Hunspell.

    • symspellpy - symspellpy is a Python port of SymSpell v6.5.

    • wanchanberta_thai_grammarly - WanchanBERTa Thai Grammarly

Returns:

the corrected list of words in sentence

Return type:

List[str]

Example:

from pythainlp.spell import correct_sent

correct_sent(["เด็","อินอร์เน็ต","แรง"],engine='symspellpy')
# output: ['เด็ก', 'อินเทอร์เน็ต', 'แรง']

The correct_sent function is an extension of the correct function and is used to correct an entire sentence. It tokenizes the input sentence, corrects each word, and returns the corrected sentence. This is beneficial for proofreading and improving the readability of Thai text.

spell

pythainlp.spell.spell(word: str, engine: str = 'pn') List[str][source]

Provides a list of possible correct spellings of the given word. The list of words are from the words in the dictionary that incurs an edit distance value of 1 or 2. The result is a list of words sorted by their occurrences in the spelling dictionary in descending order.

Parameters:
  • word (str) – Word to check spell of

  • engine (str) –

    • pn - Peter Norvig’s algorithm [1] (default)

    • phunspell - A spell checker utilizing spylls, a port of Hunspell.

    • symspellpy - symspellpy is a Python port of SymSpell v6.5.

    • tltk - wrapper for TLTK.

Returns:

list of possible correct words within 1 or 2 edit distance and sorted by frequency of word occurrences in the spelling dictionary in descending order.

Return type:

list[str]

Example:

from pythainlp.spell import spell

spell("เส้นตรบ",  engine="pn")
# output: ['เส้นตรง']

spell("เส้นตรบ")
# output: ['เส้นตรง']

spell("เส้นตรบ",  engine="tltk")
# output: ['เส้นตรง']

spell("ครัช")
# output: ['ครับ', 'ครัว', 'รัช', 'ครัม', 'ครัน', 'วรัช', 'ครัส',
# 'ปรัช', 'บรัช', 'ครัง', 'คัช', 'คลัช', 'ครัย', 'ครัด']

spell("กระปิ")
# output: ['กะปิ', 'กระบิ']

spell("สังเกตุ")
# output:  ['สังเกต']

spell("เหตการณ")
# output:  ['เหตุการณ์']

The spell function is responsible for identifying spelling errors within a given Thai word. It checks whether the input word is spelled correctly or not and returns a Boolean result. This function is useful for validating the correctness of Thai words.

spell_sent

pythainlp.spell.spell_sent(list_words: List[str], engine: str = 'pn') List[List[str]][source]

Provides a list of possible correct spellings of sentence

Parameters:
  • list_words (List[str]) – list of words in sentence

  • engine (str) –

    • pn - Peter Norvig’s algorithm [1] (default)

    • phunspell - A spell checker utilizing spylls, a port of Hunspell.

    • symspellpy - symspellpy is a Python port of SymSpell v6.5.

Returns:

list of possibly correct words

Return type:

List[List[str]]

Example:

from pythainlp.spell import spell_sent

spell_sent(["เด็","อินอร์เน็ต","แรง"],engine='symspellpy')
# output: [['เด็ก', 'อินเทอร์เน็ต', 'แรง']]

The spell_sent function extends the spell-checking functionality to entire sentences. It tokenizes the input sentence and checks the spelling of each word. It returns a list of Booleans indicating whether each word in the sentence is spelled correctly or not.

NorvigSpellChecker

class pythainlp.spell.NorvigSpellChecker(custom_dict: ~typing.Dict[str, int] | ~typing.Iterable[str] | ~typing.Iterable[~typing.Tuple[str, int]] | None = None, min_freq: int = 2, min_len: int = 2, max_len: int = 40, dict_filter: ~typing.Callable[[str], bool] | None = <function _is_thai_and_not_num>)[source]
__init__(custom_dict: ~typing.Dict[str, int] | ~typing.Iterable[str] | ~typing.Iterable[~typing.Tuple[str, int]] | None = None, min_freq: int = 2, min_len: int = 2, max_len: int = 40, dict_filter: ~typing.Callable[[str], bool] | None = <function _is_thai_and_not_num>)[source]

Initializes Peter Norvig’s spell checker object. Spelling dictionary can be customized. By default, spelling dictionary is from Thai National Corpus

Basically, Norvig’s spell checker will choose the most likely corrected spelling given a word by searching for candidates of corrected words based on edit distance. Then, it selects the candidate with the highest word occurrence probability.

Parameters:
  • custom_dict (str) –

    A custom spelling dictionary. This can be: (1) a dictionary (dict), with words (str)

    as keys and frequencies (int) as values;

    1. an iterable (list, tuple, or set) of words (str) and frequency (int) tuples: (str, int); or

    2. an iterable of just words (str), without frequencies – in this case 1 will be assigned to every words.

    Default is from Thai National Corpus (around 40,000 words).

  • min_freq (int) – Minimum frequency of a word to keep (default = 2)

  • min_len (int) – Minimum length (in characters) of a word to keep (default = 2)

  • max_len (int) – Maximum length (in characters) of a word to keep (default = 40)

  • dict_filter (func) – A function to filter the dictionary. Default filter removes any word with numbers or non-Thai characters. If no filter is required, use None.

dictionary() ItemsView[str, int][source]

Returns the spelling dictionary currently used by this spell checker

Returns:

spelling dictionary of this instance

Return type:

list[tuple[str, int]]

Example:

from pythainlp.spell import NorvigSpellChecker

dictionary= [("หวาน", 30), ("มะนาว", 2), ("แอบ", 3223)]

checker = NorvigSpellChecker(custom_dict=dictionary)
checker.dictionary()
# output: dict_items([('หวาน', 30), ('มะนาว', 2), ('แอบ', 3223)])
known(words: Iterable[str]) List[str][source]

Returns a list of given words found in the spelling dictionary

Parameters:

words (list[str]) – A list of words to check if they exist in the spelling dictionary

Returns:

intersection of the given word list and words in the spelling dictionary

Return type:

list[str]

Example:

from pythainlp.spell import NorvigSpellChecker

checker = NorvigSpellChecker()

checker.known(["เพยน", "เพล", "เพลง"])
# output: ['เพล', 'เพลง']

checker.known(['ยกไ', 'ไฟล์ม'])
# output: []

checker.known([])
# output: []
prob(word: str) float[source]

Returns the probability of an input word, according to the spelling dictionary

Parameters:

word (str) – A word to check occurrence probability of

Returns:

word occurrence probability

Return type:

float

Example:

from pythainlp.spell import NorvigSpellChecker

checker = NorvigSpellChecker()

checker.prob("ครัช")
# output: 0.0

checker.prob("รัก")
# output: 0.0006959172792052158

checker.prob("น่ารัก")
# output: 9.482306849763902e-05
freq(word: str) int[source]

Returns the frequency of an input word, according to the spelling dictionary

Parameters:

word (str) – A word to check frequency of

Returns:

frequency of the given word in the spelling dictionary

Return type:

int

Example:

from pythainlp.spell import NorvigSpellChecker

checker = NorvigSpellChecker()

checker.freq("ปัญญา")
# output: 3639

checker.freq("บิญชา")
# output: 0
spell(word: str) List[str][source]

Returns a list of all correctly-spelled words whose spelling is similar to the given word by edit distance metrics. The returned list of words will be sorted by decreasing order of word frequencies in the word spelling dictionary.

First, if the input word is spelled correctly, this method returns a list of exactly one word which is itself. Next, this method looks for a list of all correctly spelled words whose edit distance value is 1 from the input word. If there is no such word, then the search expands to a list of words whose edit distance value is 2. And if that still fails, the list of input words is returned.

Parameters:

word (str) – A word to check spelling of

Returns:

list of possibly correct words within 1 or 2 edit distance and sorted by frequency of word occurrence in the spelling dictionary in descending order.

Return type:

list[str]

Example:

from pythainlp.spell import NorvigSpellChecker

checker = NorvigSpellChecker()

checker.spell("เส้นตรบ")
# output: ['เส้นตรง']

checker.spell("ครัช")
# output: ['ครับ', 'ครัว', 'รัช', 'ครัม', 'ครัน',
# 'วรัช', 'ครัส', 'ปรัช', 'บรัช', 'ครัง',
#'คัช', 'คลัช', 'ครัย', 'ครัด']
correct(word: str) str[source]

Returns the most possible word, using the probability from the spelling dictionary

Parameters:

word (str) – A word to correct spelling of

Returns:

the correct spelling of the given word

Return type:

str

Example:

from pythainlp.spell import NorvigSpellChecker

checker = NorvigSpellChecker()

checker.correct("ปัญชา")
# output: 'ปัญหา'

checker.correct("บิญชา")
# output: 'บัญชา'

checker.correct("มิตรภาบ")
# output: 'มิตรภาพ'
__dict__ = mappingproxy({'__module__': 'pythainlp.spell.pn', '__init__': <function NorvigSpellChecker.__init__>, 'dictionary': <function NorvigSpellChecker.dictionary>, 'known': <function NorvigSpellChecker.known>, 'prob': <function NorvigSpellChecker.prob>, 'freq': <function NorvigSpellChecker.freq>, 'spell': <function NorvigSpellChecker.spell>, 'correct': <function NorvigSpellChecker.correct>, '__dict__': <attribute '__dict__' of 'NorvigSpellChecker' objects>, '__weakref__': <attribute '__weakref__' of 'NorvigSpellChecker' objects>, '__doc__': None, '__annotations__': {}})
__module__ = 'pythainlp.spell.pn'

The NorvigSpellChecker class is a fundamental component of the pythainlp.spell module. It implements a spell-checking algorithm based on the work of Peter Norvig. This class is designed for more advanced spell-checking and provides customizable settings for spell correction.

DEFAULT_SPELL_CHECKER

pythainlp.spell.DEFAULT_SPELL_CHECKER = Default instance of the standard NorvigSpellChecker, using word list data from the Thai National Corpus: http://www.arts.chula.ac.th/ling/tnc/

The DEFAULT_SPELL_CHECKER is an instance of the NorvigSpellChecker class with default settings. It is pre-configured to use word list data from the Thai National Corpus, making it a reliable choice for general spell-checking tasks.

References