ctnx.misc module¶

class ctnx.misc.DictBasedOnePassStrReplacer(dictionary: dict, use_atomic_group=False, case_sensitive=True, word_boundary='')[source]¶

Bases: object

A helper class that compiles a dictionary of substring replacements into a single trie-based regular expression for efficient, one-pass string replacements.

replace(text: str) → str[source]¶

class ctnx.misc.IYNormalizer(use_atomic_group=False, ignore_likely_proper_nouns=True, h: Literal['i', 'y'] = 'y', k='y', l='y', m='y', qu='y', s='i', t='y', v='i', i='y', use_sinoviet_heuristic=True, i_override_list: Iterable[str] | None = None, max_repl_cache_size: int | None = 0)[source]¶

Bases: DictBasedOnePassStrReplacer

String replacer for normalizing the placement of ‘i’ and ‘y’ in Vietnamese syllables, following configurable preset styles and exception lists.

DEFAULT_I_OVERRIDE_LIST = ['hi hi', 'hì hì', 'hí hí', 'hị hị', 'hì hục', 'hì hụi', 'hỉ hả', 'hỉ mũi', 'hí hoáy', 'hí húi', 'hí hửng', 'hí hởn', 'hủ hỉ', 'hậu hĩ', 'ki bo', 'ki cóp', 'ki-lô-gam', 'ki-ốt', 'kì cạch', 'kì cọ', 'kì kèo', 'kì cùng', 'kì đà', 'kì giông', 'kí ninh', 'kĩ tính', 'kĩ càng', 'cũ kĩ', 'cụ kị', 'ô li', 'li bì', 'li ti', 'li-ti', 'chi li', 'cu li', 'mi li', 'lâm li', 'va li', 'phẳng lì', 'nhẵn lì', 'lì loà', 'lì lợm', 'lì xì', 'lí nhí', 'lũ lĩ', 'kiết lị', 'mi-ca', 'mi-crô', 'mi mắt', 'cù mì', 'lúa mì', 'khoai mì', 'bột mì', 'mì sợi', 'mì chính', 'rễ mí', 'tỉ mỉ', 'mụ mị', 'cây si', 'nốt si', 'si-lic', 'đen sì', 'hôi sì', 'hàn sì', 'sì sụp', 'mua sỉ', 'ti hí', 'ti gôn', 'ti-tan', 'ti toe', 'đinh ti', 'ti trôn', 'ti ti', 'ti tỉ', 'ti tiện', 'tì tì', 'tì vết', 'tì tay', 'tù tì', 'tí toáy', 'tí tách', 'tí teo', 'tí hon', 'tỉ tê', 'bạc tỉ', 'tị nạnh', 'tí ti', 'ki ốt', 'si đa', 'ti tiện', 'tự ti', 'tị nạn', 'ghen tị', 'hồi tị', 'tị nạnh']¶

I_TO_Y_TRANS = {73: 89, 105: 121, 204: 7922, 205: 221, 236: 7923, 237: 253, 296: 7928, 297: 7929, 7880: 7926, 7881: 7927, 7882: 7924, 7883: 7925}¶

I_VARIANTS = 'iìíỉĩịIÌÍỈĨỊ'¶

LOWER_I_VARIANTS = 'iìíỉĩị'¶

LOWER_Y_VARIANTS = 'yỳýỷỹỵ'¶

ONSETS = ['qu', 'h', 'k', 'l', 'm', 's', 't', 'v']¶

POSSIBLE_PRESET_STYLES = ('i', 'unified_i', 'sinoviet_hklmqstv_y', 'hklmqstv_y', 'sinoviet_hklmqst_y', 'hklmqst_y', 'sinoviet_hklmqt_y', 'hklmqt_y')¶

SYLLABLE_PATTERN = '([hklmstv]|qu)?[iìíỉĩịyỳýỷỹỵ]'¶

TRANS_TABLE_ROUTER = {'i': {89: 73, 121: 105, 221: 205, 253: 237, 7922: 204, 7923: 236, 7924: 7882, 7925: 7883, 7926: 7880, 7927: 7881, 7928: 296, 7929: 297}, 'y': {73: 89, 105: 121, 204: 7922, 205: 221, 236: 7923, 237: 253, 296: 7928, 297: 7929, 7880: 7926, 7881: 7927, 7882: 7924, 7883: 7925}}¶

Y_TO_I_TRANS = {89: 73, 121: 105, 221: 205, 253: 237, 7922: 204, 7923: 236, 7924: 7882, 7925: 7883, 7926: 7880, 7927: 7881, 7928: 296, 7929: 297}¶

Y_VARIANTS = 'yỳýỷỹỵYỲÝỶỸỴ'¶

classmethod from_preset_style(style: Literal['i', 'unified_i', 'sinoviet_hklmqstv_y', 'hklmqstv_y', 'sinoviet_hklmqst_y', 'hklmqst_y', 'sinoviet_hklmqt_y', 'hklmqt_y'] = 'sinoviet_hklmqt_y', use_atomic_group=False, ignore_likely_proper_nouns=True, i_override_list=None, max_repl_cache_size: int | None = 0) → IYNormalizer[source]¶

property max_repl_cache_size¶

replace(text: str) → str[source]¶

ctnx.misc.clean_slug(text: str, sep='_')[source]¶: Generates an ASCII-only slug (URL-friendly string) from a Vietnamese text, replacing spaces and non-word characters with a separator.

ctnx.misc.generate_tone_placement_replace_mapping(old_to_new=True, includes_rare_casing=False) → dict[source]¶: Generates a mapping dictionary for replacing Vietnamese tone placements between old and new styles (or vice versa).

ctnx.misc.is_even_tone(tone: str) → bool[source]¶: Checks whether the given tone mark represents an even tone (ngang/unmarked or huyền/grave accent).

ctnx.misc.make_regex_str_from_tokens(tokens: list, use_atomic_group=False, case_sensitive=True, word_boundary='')[source]¶: Generates a trie-based regular expression string from a list of tokens.

ctnx.misc.nfc_normalize(text: str) → str[source]¶: Converts combining Unicode characters to their equivalent precomposed characters.

ctnx.misc.normalize_confusables(text: str) → str[source]¶

Converts a confusable text to a normal text.

Replaces similar-looking characters and homoglyphs with their equivalent Vietnamese characters. Small cap letters are converted to lowercase.

ctnx.misc.normalize_text(text: str, clean_redudant_spaces=True, strip_punctuation=False, do_normalize_confusables=False, normalize_tone_placement=True)[source]¶

Cleans and normalizes Vietnamese text.

Supports NFC normalization, removing redundant spaces, stripping punctuation, converting confusable characters, and normalizing tone placement.

ctnx.misc.place_tone_to_char(char, tone) → str[source]¶

ctnx.misc.remove_diacritics(text: str) → str[source]¶

Removes all diacritics from text.

Replaces characters with diacritics with their equivalent ASCII characters.

ctnx.misc.remove_tones(text: str) → str[source]¶

Removes tone marks from text.

Replaces characters with tone marks with their equivalent non-toned characters. Other diacritics are kept.

ctnx.misc.sep_tone_from_char(char: str)[source]¶

Extracts the tone mark from a character.

The returned tone is denoted as one of the following: ‘’: unmarked (ngang) ‘/’: acute accent (sắc) ‘': grave accent (huyền) ‘?’: hook above (hỏi) ‘~’: tilde (ngã) ‘.’: dot below (nặng)

Parameters:: char (str) – The character from which the tone will be extracted
Returns:: A tuple of the same character without the tone mark, and the tone mark itself
Return type:: tuple

ctnx.misc.separate_tone(text: str, all=False)[source]¶

Extracts the tone mark from text.

The returned tone is denoted as one of the following: ‘’: unmarked (ngang) ‘/’: acute accent (sắc) ‘': grave accent (huyền) ‘?’: hook above (hỏi) ‘~’: tilde (ngã) ‘.’: dot below (nặng)

Parameters:

text (str) – The text from which the tone will be extracted
all (bool, default : False) – If set to True, extracts the last tone instead of the first one

Returns:

A tuple of the text without tone marks, and the extracted tone mark

Return type:

tuple

ctnx.misc module¶

chiecthuyenngoaixa

Navigation

Related Topics