Document worth reading: “On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis”
In this paper we look at the have an effect on of straightforward textual content material preprocessing selections (notably tokenizing, lemmatizing, lowercasing and multiword grouping on the effectivity of a state-of-the-art textual content material classifier primarily based on convolutional neural networks. Despite in all probability affecting the final effectivity of any given model, this facet has not acquired a substantial curiosity in the deep finding out literature. We perform an intensive evaluation in commonplace benchmarks from textual content material categorization and sentiment analysis. Our outcomes current {{that a}} straightforward tokenization of the enter textual content material is normally ample, however in addition highlight the significance of being fixed in the preprocessing of the evaluation set and the corpus used for teaching phrase embeddings. On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis