- Tokenization: Tokenization is the process of breaking the raw text into smaller chunks called tokens, which are the basic building blocks used for further analysis. These tokens can be either words or units, and help in understanding the context by analysing the sequence of the words. Tokenization can be as simple as splitting the text by whitespace, but more advanced techniques may be used.
- Text Cleaning: This task involves removing any irrelevant or noisy elements from the text. This commonly begins by converting all text data into lowercase or uppercase to ensure uniformity, followed by removing special characters, punctuation and numbers. Finally, the task is concluded by removing stop words, which are common words that do not carry any significant value that contributes to the comprehension of the text and can be safely removed. This way, only unique words that offer the most information about the text remain, a process not unlike entity recognition.
- Lemmatization and Stemming: This step aims to reduce inflections and variations of words to their base or root form. Stemming removes prefixes or suffixes from words while Lemmatization goes beyond and ensures that the root form is a valid words. It leverages language-specific knowledge to obtain the base dictionary form of a word.
- Syntactic analysis: Also known as sentence parsing, this step deals with assigning each individual word a class (classification or part-of-speech tagging), combining them into word groups or ‘phrases’, ultimately establishing syntactic relationships between different word groups.