Owl_nlp_vocabularySourceNLP: Vocabulary module
Type of vocabulary (or dictionary).
``get_w2i v`` returns word -> index mapping of ``v``.
``get_i2w v`` returns index -> word mapping of ``v``.
``exits_w v w`` returns ``true`` if word ``w`` exists in the vocabulary ``v``.
``exits_i i w`` returns ``true`` if index ``i`` exists in the vocabulary ``v``.
``word2index v w`` converts word ``w`` to its index using vocabulary ``v``.
``index2word v i`` converts index ``i`` to its corresponding word using vocabulary ``v``.
``freq_w v w`` returns the frequency of word ``w`` in the vocabulary ``v``.
``sort_freq v`` returns the vocabulary as a ``(index, freq) array`` in increasing or decreasing frequency specified by parameter ``inc``.
``top v k`` returns the top ``k`` words in vocabulary ``v``.
``bottom v k`` returns the bottom ``k`` words in vocabulary ``v``.
val build :
?lo:float ->
?hi:float ->
?alphabet:bool ->
?stopwords:(string, 'a) Hashtbl.t ->
string ->
t``build ~lo ~hi ~stopwords fname`` builds a vocabulary from a text corpus file of name ``fname``. If ``alphabet=false`` then tokens are the words separated by white spaces; if ``alphabet=true`` then tokens are the characters and a vocabulary of alphabets is returned.
Parameters: * ``lo``: percentage of lower bound of word frequency. * ``hi``: percentage of higher bound of word frequency. * ``alphabet`` : build vocabulary for alphabets or words. * ``fname``: file name of the text corpus, each line contains a doc.
val build_from_string :
?lo:float ->
?hi:float ->
?alphabet:bool ->
?stopwords:(string, 'a) Hashtbl.t ->
string ->
t``build_from_string`` is similar to ``build`` but builds the vocabulary from an input string rather than a file.
``trim_percent ~lo ~hi v`` remove extremely low and high frequency words based on percentage of frequency.
Parameters: * ``lo``: the percentage of lower bound. * ``hi``: the percentage of higher bound.
``trim_count ~lo ~hi v`` remove extremely low and high frequency words based on absolute count of words.
Parameters: * ``lo``: the lower bound of number of occurrence. * ``hi``: the higher bound of number of occurrence.
``remove_stopwords stopwords v`` removes the stopwords defined in a hashtbl from vocabulary ``v``.
``tokenise v s`` tokenises the string ``s`` according to the vocabulary ``v``.
``w2w2i_to_tuples v`` converts vocabulary ``v`` to a list of ``(word, index)`` tuples.
``to_array v`` converts a vocabulary to a (index, word) array.
``of_array v`` converts a (index, word) array to a vocabulary.
``save v fname`` serialises the vocabulary and saves it to a file of name ``s``.
``load fname`` loads the serialised vocabulary from a file of name ``fname``.
``save_txt v fname`` saves the vocabulary in the text format to a file of name ``s``.
Pretty printer for vocabulary type.