Document worth reading: “Text Similarity in Vector Space Models: A Comparative Study”

Automatic measurement of semantic textual content material similarity is a vital exercise in pure language processing. In this paper, we take into account the effectivity of assorted vector home fashions to hold out this exercise. We deal with the real-world draw back of modeling patent-to-patent similarity and study TFIDF (and related extensions), matter fashions (e.g., latent semantic indexing), and neural fashions (e.g., paragraph vectors). Contrary to expectations, the added computational worth of textual content material embedding methods is justified solely when: 1) the aim textual content material is condensed; and a few) the similarity comparability is trivial. Otherwise, TFIDF performs surprisingly correctly in completely different circumstances: in express for longer and further technical texts or for making finer-grained distinctions between nearest neighbors. Unexpectedly, extensions to the TFIDF method, akin to together with noun phrases or calculating time interval weights incrementally, weren’t helpful in our context. Text Similarity in Vector Space Models: A Comparative Study