Updated: May 23
Artificial Intelligence(AI) algorithms are great at working with structured and tabular data. The information is organized in structured data! Since each row is represented by several columns, it is straightforward to answer various questions about the data such as relationships between columns/rows, or predicting the value of a column from the value of the other columns. On the other hand, humans seem to approach problems in a completely different way. We only need to reach a description of the problem in a known language. Once we understand the concept of the problem, we create a model either in our brain or on paper to solve it. Although humans may not resemble the computers’ speed in computing, there is no doubt that conventional AI algorithms and human problem solving differ fundamentally.
The ultimate goal of AI is to have a human understanding as well as computers’ computing power. Most researchers and philosophers believe that natural language understanding (NLU) is the term that separates AI from humans, and if AI systems can reach the level of human language understanding, they can perceive as humans do. Natural Language Processing (NLP) is a subfield of machine learning that aims to process text at all levels. From the most fundamental processes, like spell-checking, to the most conceptual processes, like content generation. In this blog post, we will learn more about the methods that are used in the NLP field for different text processing tasks, from simple to complex ones.
Rule-based vs. Statistical NLP
NLP algorithms can be divided into two main categories: rule-based and statistical. Rule-based algorithms process their input text with a series of handcrafted rules. We might use a rule-based algorithm to classify the sentiment of text into three categories: negative, neutral, and positive, based on predefined sets of keywords. The model predicts the class with the most significant number of keywords in the input text. However, statistical NLP uses machine learning algorithms to train a model that predicts text sentiment using labeled data. Unlike rule-based algorithms, statistical methods do not require handcrafted keyword design, and the more data they are provided, the higher the accuracy.
In recent years, rule-based methods have become less common as deep neural networks have gained significant performance in various tasks across multiple domains.
Bag-of-words and n-gram representations
Specific NLP problems, such as text classification, grammar, and word order, do not provide information about the text class, and the text class can only be determined by its keywords. This simple idea gave rise to the bag-of-words methods. The single letters and grammatical words are removed from the text, and the model sees the remaining words as a set. Thus, the model becomes significantly less complex so that the task can be learned faster with higher accuracy. The n-gram representation method is a more general form of the bag-of-word that stores word order up to n words in a row.
We can easily say that two words are near or far in concept. For example, we know that the word "ticket" is closer to "airplane" than "ticket" is to "chips." The ability of an algorithm to make these comparisons correctly would indicate that the algorithm is close to understanding the target language. In the word vector method, the goal is to assign a vector to each language word so that similar words have similar vectors. Furthermore, the addition and subtraction of these vectors make sense in the natural language space. For example, we can get the vector corresponding to "King" by subtracting the vector of "women" and adding the vector of "man" to the "Queen" vector. Facebook's famous NLP package, FastText, uses word vector models.
Words were at the core of all the methods introduced so far. However, to fully comprehend the language and design more sophisticated intelligent systems, actual sentences must be injected into the models so that they can generate text on their own. To address this need, transformer-based models have been developed recently. Texts and sentences are the main focus of transformers in contrast to previous methods. Texts are both the input and output of these models. In other words, they "transform" text from one domain to another. Google's BERT, Facebook's Robert, and Open-AI's GPT-3 are the most prominent available transformer-based models. They are used for a variety of purposes:
• Blog post and content generation
• Text translation
• Product description generation
• Ad generation
• Question answering
• AI-based chatbot
• Auto completing
• Predictive typing
As the amount of labeled and unlabeled textual data increases, it is soon expected that the output of transformer models will not be separable from human texts.
As an integrated and unified system, our brain performs all the text processing tasks we need. These tasks range from basic ones such as part of speech tagging (determining the grammatical role of each word in a sentence) to more sophisticated ones such as understanding the content and answering questions in a given text. The previous models, however, had to be trained separately for each task. They also need a reasonably large amount of data to learn a new task. On the other hand, human learning is not like this. With a much smaller number of samples, we can process the desired task with high accuracy.
GPT-3 is Open-AI's latest transformer-based neural network. The network tries to mimic the behavior of the human brain in language understanding. The model performs well on many NLP tasks, and more importantly, it can learn a new task after just seeing a few samples (few-shot learning in AI literature). GPT-3 has proven itself in almost all NLP fields and has become a basis for other transformer-based models.
Six months ago, at Hengam, we published a product description app on the Shopify platform called RightCopy based on the GPT-3 model to produce high quality, SEO-friendly and accurate product descriptions in seconds. We hope that AI-based content production can take a step forward with this path.