Owl_nlp_tfidfSourceNLP: TFIDF module
Type of a TFIDF model
term_freq term_count num_words calculates the term frequency weight.
doc_freq doc_count num_docs calculates the document frequency weight.
Return the corpus contained in TFIDF model
Get the file handle associated with TFIDF model.
doc_count_of tfidf w calculate document frequency for a given word w.
doc_count vocab fname count occurrency in all documents contained in the raw text corpus of file fname, for all words
term_count count doc counts the term occurrency in a document, and saves the result in count hashtbl.
val doc_to_vec :
(float, 'a) Bigarray.kind ->
t ->
(int * float) array ->
(float, 'a) Owl_dense.Ndarray.Generic.tdoc_to_vec kind tfidf vec converts a TFIDF vector from its sparse represents to dense ndarray vector whose length equals the vocabulary size.
Return the ith TFIDF vector in the model. The format of return is (vocabulary index, weight) tuple array of a document.
Return the next document vector in the model. The format of return is (vocabulary index, weight) tuple array of a document.
Return the next batch of document vectors in the model, the default size is 100.
Iterate all the document vectors in a TFIDF model. The format of document vector is (vocabulary index, weight) tuple array of a document.
Map all the document vectors in a TFIDF model. The format of document vector is (vocabulary index, weight) tuple array of a document.
This function builds up a TFIDF model according to the passed in parameters.
Parameters: * norm: whether to normalise the vectors in the TFIDF model, default is false. * sort: whether to sort the terms in a TFIDF vector in increasing order w.r.t their vocabulary indices. The default is false. * tf: type of term frequency used in building TFIDF. The default is Count. * df: type of document frequency used in building TFIDF. The default is Idf. * corpus: the corpus built by Owl_nlp_corpus model atop of which TFIDF will be built.
save tfidf fname saves the TFIDF to a file of given file name fname.
Convert a TFIDF to its string representation, contains summary information.
Convert a single document according to a given model
normalise x makes x a unit vector by dividing its l2norm.
Wrap up a TFIDF model type. Low-level function and you are not supposed to use it.
val all_pairwise_distance :
Owl_nlp_similarity.t ->
t ->
('a * float) array ->
(int * float) arrayCalculate pairwise distance for the whole model, return format is (id,dist) array.
val nearest :
?typ:Owl_nlp_similarity.t ->
t ->
('a * float) array ->
int ->
(int * float) arrayReturn K-nearest neighbours, it is very slow due to linear search.