Owl_nlp_vocabularySourceNLP: Vocabulary module
Type of vocabulary (or dictionary).
exits_w v w returns true if word w exists in the vocabulary v.
word2index v w converts word w to its index using vocabulary v.
index2word v i converts index i to its corresponding word using vocabulary v.
freq_w v w returns the frequency of word w in the vocabulary v.
sort_freq v returns the vocabulary as a (index, freq) array in increasing or decreasing frequency specified by parameter inc.
bottom v k returns the bottom k words in vocabulary v.
val build :
?lo:float ->
?hi:float ->
?alphabet:bool ->
?stopwords:(string, 'a) Hashtbl.t ->
string ->
tbuild ~lo ~hi ~stopwords fname builds a vocabulary from a text corpus file of name fname. If alphabet=false then tokens are the words separated by white spaces; if alphabet=true then tokens are the characters and a vocabulary of alphabets is returned.
Parameters: * lo: percentage of lower bound of word frequency. * hi: percentage of higher bound of word frequency. * alphabet : build vocabulary for alphabets or words. * fname: file name of the text corpus, each line contains a doc.
val build_from_string :
?lo:float ->
?hi:float ->
?alphabet:bool ->
?stopwords:(string, 'a) Hashtbl.t ->
string ->
tbuild_from_string is similar to build but builds the vocabulary from an input string rather than a file.
trim_percent ~lo ~hi v remove extremely low and high frequency words based on percentage of frequency.
Parameters: * lo: the percentage of lower bound. * hi: the percentage of higher bound.
trim_count ~lo ~hi v remove extremely low and high frequency words based on absolute count of words.
Parameters: * lo: the lower bound of number of occurrence. * hi: the higher bound of number of occurrence.
remove_stopwords stopwords v removes the stopwords defined in a hashtbl from vocabulary v.
tokenise v s tokenises the string s according to the vocabulary v.
w2w2i_to_tuples v converts vocabulary v to a list of (word, index) tuples.
to_array v converts a vocabulary to a (index, word) array.
of_array v converts a (index, word) array to a vocabulary.
save v fname serialises the vocabulary and saves it to a file of name s.
save_txt v fname saves the vocabulary in the text format to a file of name s.
Pretty printer for vocabulary type.