pythainlp.spell

The pythainlp.spell finds the closest correctly spelled word to the given text.

Modules

pythainlp.spell.correct(word: str, engine: str = 'pn') str[source]

Corrects the spelling of the given word by returning the correctly spelled word.

Parameters
  • word (str) – word to correct spelling

  • engine (str) –

    • pn - Peter Norvig’s algorithm 1 (default)

    • phunspell - A spell checker utilizing spylls a port of Hunspell.

    • symspellpy - symspellpy is a Python port of SymSpell v6.5.

Returns

the corrected word

Return type

str

Example

from pythainlp.spell import correct

correct("เส้นตรบ")
# output: 'เส้นตรง'

correct("ครัช")
# output: 'ครับ'

correct("สังเกตุ")
# output: 'สังเกต'

correct("กระปิ")
# output: 'กะปิ'

correct("เหตการณ")
# output: 'เหตุการณ์'
pythainlp.spell.correct_sent(list_words: List[str], engine: str = 'pn') List[str][source]

Corrects the spelling of the given sentence by returning

Parameters
  • list_words (List[str]) – list word of sentence

  • engine (str) –

    • pn - Peter Norvig’s algorithm 1 (default)

    • phunspell - A spell checker utilizing spylls a port of Hunspell.

    • symspellpy - symspellpy is a Python port of SymSpell v6.5.

Returns

the corrected list sentences of word

Return type

List[str]

Example

from pythainlp.spell import correct_sent

correct_sent(["เด็","อินอร์เน็ต","แรง"],engine='symspellpy')
# output: ['เด็ก', 'อินเทอร์เน็ต', 'แรง']
pythainlp.spell.spell(word: str, engine: str = 'pn') List[str][source]

Provides a list of possible correct spelling of the given word. The list of words are from the words in the dictionary that incurs an edit distance value of 1 or 2. The result is a list of words sorted by their occurrences in the spelling dictionary in descending order.

Parameters
  • word (str) – Word to spell check

  • engine (str) –

    • pn - Peter Norvig’s algorithm 1 (default)

    • phunspell - A spell checker utilizing spylls a port of Hunspell.

    • symspellpy - symspellpy is a Python port of SymSpell v6.5.

    • tltk - wrapper for TLTK.

Returns

list of possible correct words within 1 or 2 edit distance and sorted by frequency of word occurrences in the spelling dictionary in descending order.

Return type

list[str]

Example

from pythainlp.spell import spell

spell("เส้นตรบ",  engine="pn")
# output: ['เส้นตรง']

spell("เส้นตรบ")
# output: ['เส้นตรง']

spell("เส้นตรบ",  engine="tltk")
# output: ['เส้นตรง']

spell("ครัช")
# output: ['ครับ', 'ครัว', 'รัช', 'ครัม', 'ครัน', 'วรัช', 'ครัส',
# 'ปรัช', 'บรัช', 'ครัง', 'คัช', 'คลัช', 'ครัย', 'ครัด']

spell("กระปิ")
# output: ['กะปิ', 'กระบิ']

spell("สังเกตุ")
# output:  ['สังเกต']

spell("เหตการณ")
# output:  ['เหตุการณ์']
pythainlp.spell.spell_sent(list_words: List[str], engine: str = 'pn') List[List[str]][source]

Provides a list of possible correct spelling of sentence

Parameters
  • list_words (List[str]) – list word of sentence

  • engine (str) –

    • pn - Peter Norvig’s algorithm 1 (default)

    • phunspell - A spell checker utilizing spylls a port of Hunspell.

    • symspellpy - symspellpy is a Python port of SymSpell v6.5.

Returns

list of possible correct words

Return type

List[List[str]]

Example

from pythainlp.spell import spell_sent

spell_sent(["เด็","อินอร์เน็ต","แรง"],engine='symspellpy')
# output: [['เด็ก', 'อินเทอร์เน็ต', 'แรง']]
class pythainlp.spell.NorvigSpellChecker(custom_dict: typing.Optional[typing.Union[typing.Dict[str, int], typing.Iterable[str], typing.Iterable[typing.Tuple[str, int]]]] = None, min_freq: int = 2, min_len: int = 2, max_len: int = 40, dict_filter: typing.Optional[typing.Callable[[str], bool]] = <function _is_thai_and_not_num>)[source]
__init__(custom_dict: typing.Optional[typing.Union[typing.Dict[str, int], typing.Iterable[str], typing.Iterable[typing.Tuple[str, int]]]] = None, min_freq: int = 2, min_len: int = 2, max_len: int = 40, dict_filter: typing.Optional[typing.Callable[[str], bool]] = <function _is_thai_and_not_num>)[source]

Initializes Peter Norvig’s spell checker object. Spelling dictionary can be customized. By default, spelling dictionary is from Thai National Corpus

Basically, Norvig’s spell checker will choose the most likely spelling correction give a word by searching for candidate corrected words based on edit distance. Then, it selects the candidate with the highest word occurrence probability.

Parameters
  • custom_dict (str) –

    A custom spelling dictionary. This can be: (1) a dictionary (dict), with words (str)

    as keys and frequencies (int) as values;

    1. an iterable (list, tuple, or set) of word (str) and frequency (int) tuples: (str, int); or

    2. an iterable of just words (str), without frequencies – in this case 1 will be assigned to every words.

    Default is from Thai National Corpus (around 40,000 words).

  • min_freq (int) – Minimum frequency of a word to keep (default = 2)

  • min_len (int) – Minimum length (in characters) of a word to keep (default = 2)

  • max_len (int) – Maximum length (in characters) of a word to keep (default = 40)

  • dict_filter (func) – A function to filter the dictionary. Default filter removes any word with number or non-Thai characters. If no filter is required, use None.

__weakref__

list of weak references to the object (if defined)

correct(word: str) str[source]

Returns the most possible word, using the probability from the spelling dictionary

Parameters

word (str) – A word to correct its spelling

Returns

the correct spelling of the given word

Return type

str

Example

from pythainlp.spell import NorvigSpellChecker

checker = NorvigSpellChecker()

checker.correct("ปัญชา")
# output: 'ปัญหา'

checker.correct("บิญชา")
# output: 'บัญชา'

checker.correct("มิตรภาบ")
# output: 'มิตรภาพ'
dictionary() ItemsView[str, int][source]

Returns the spelling dictionary currently used by this spell checker

Returns

spelling dictionary of this instance

Return type

list[tuple[str, int]]

Example

from pythainlp.spell import NorvigSpellChecker

dictionary= [("หวาน", 30), ("มะนาว", 2), ("แอบ", 3223)]

checker = NorvigSpellChecker(custom_dict=dictionary)
checker.dictionary()
# output: dict_items([('หวาน', 30), ('มะนาว', 2), ('แอบ', 3223)])
freq(word: str) int[source]

Returns the frequency of an input word, according to the spelling dictionary

Parameters

word (str) – A word to check its frequency

Returns

frequency of the given word in the spelling dictionary

Return type

int

Example

from pythainlp.spell import NorvigSpellChecker

checker = NorvigSpellChecker()

checker.freq("ปัญญา")
# output: 3639

checker.freq("บิญชา")
# output: 0
known(words: Iterable[str]) List[str][source]

Returns a list of given words that found in the spelling dictionary

Parameters

words (list[str]) – A list of words to check if they exist in the spelling dictionary

Returns

intersection of the given words list and words in the spelling dictionary

Return type

list[str]

Example

from pythainlp.spell import NorvigSpellChecker

checker = NorvigSpellChecker()

checker.known(["เพยน", "เพล", "เพลง"])
# output: ['เพล', 'เพลง']

checker.known(['ยกไ', 'ไฟล์ม'])
# output: []

checker.known([])
# output: []
prob(word: str) float[source]

Returns the probability of an input word, according to the spelling dictionary

Parameters

word (str) – A word to check its probability of occurrence

Returns

word occurrence probability

Return type

float

Example

from pythainlp.spell import NorvigSpellChecker

checker = NorvigSpellChecker()

checker.prob("ครัช")
# output: 0.0

checker.prob("รัก")
# output: 0.0006959172792052158

checker.prob("น่ารัก")
# output: 9.482306849763902e-05
spell(word: str) List[str][source]

Returns a list of all correctly-spelled words whose spelling is similar to the given word by edit distance metrics. The returned list of words will be sorted by the decreasing order of word frequencies in the word spelling dictionary.

First, if the input word is spelled-correctly, this method returns the list of exactly one word which is itself. Next, this method looks for a list of all correctly-spelled words whose edit distance value is 1 within the input word. If there is no such word, that the search expands to a list of words whose edit distance value is 2. And if that still fails, the list of input word is returned.

Parameters

word (str) – A word to check its spelling

Returns

list of possible correct words within 1 or 2 edit distance and sorted by frequency of word occurrence in the spelling dictionary in descending order.

Return type

list[str]

Example

from pythainlp.spell import NorvigSpellChecker

checker = NorvigSpellChecker()

checker.spell("เส้นตรบ")
# output: ['เส้นตรง']

checker.spell("ครัช")
# output: ['ครับ', 'ครัว', 'รัช', 'ครัม', 'ครัน',
# 'วรัช', 'ครัส', 'ปรัช', 'บรัช', 'ครัง',
#'คัช', 'คลัช', 'ครัย', 'ครัด']
pythainlp.spell.DEFAULT_SPELL_CHECKER = Default instance of standard NorvigSpellChecker, using word list from Thai National Corpus: http://www.arts.chula.ac.th/ling/tnc/

References

1(1,2,3,4)

Peter Norvig (2007). How to Write a Spelling Corrector.