Buomsoo Kim

Attention in Neural Networks - 23. BERT (2) Introduction to BERT (Bidirectional Encoder Representations from Transformers)

|

Attention Mechanism in Neural Networks - 23. BERT (2)

In the previous posting, we had a brief look at BERT. As explained, BERT is based on sheer developments in natural language processing during the last decade, especially in unsupervised pre-training and supervised fine-tuning. Thus, it is essential to review what have been done so far in those fields and what is new in BERT (actually, this is how most academic papers are written). I won’t be going into granular details of all important methods in the field. Instead, I will try to intuitively explain those that are essential in understanding BERT. If you want to learn more, you can read the papers that I have embedded the hyperlinks!

Unsupervised pre-training

As described in the previous posting, unsupervised word embedding models such as Word2vec and GloVe have become a crucial part in NLP. They represent each word as an n-dimensional vector, hence the name “word2vec.”

[Mikolov et al. 2013]

Those vectors are learned by a shallow neural network with a single hidden layer called the “projection layer.” The Skip-gram model, a type of word2vec, updates weights in the hidden layer while attempting to predict words close to the word of interest.

[Mikolov et al. 2013]

Since the learned vectors have the same dimensionality, they enable arithmetic operations between words. For instance, the similarity between words can be easily calculated using metrics such as cosine distance and relationships between them can be represented as equations as below.

\begin{equation} vec(“Montreal Canadiens”) - vec(“Montreal”) + vec(“Toronto”) = vec(“TorontoMaple Leafs”) \end{equation} \begin{equation} vec(“Russia”) + vec(“river”) = vec(“Volga River”) \end{equation} \begin{equation} vec(“Germany”) + vec(“capital”) = vec(“Berlin”) \end{equation}

Furthermore, they can be used as input features for various machine learning models to carry out downstream NLP tasks. For instance, it can be used to classify the sentiment the speaker is expressing at the point of speech (opinion mining/sentiment analysis), or find appropriate tags for a given image (image tagging). They are extended to represent not only words but also sentences and paragraphs in corpora.

However, they are not contextualized representations of words. That is, they are unable to model how the meanings of words can differ depending on linguistic contexts, i.e., modeling polysemy. Most words that we use frequently are highly polysemous. For instance, consider the use of word “bright” in two sentences below. In the first sentence, the word “bright” is synonymous to “smart” or “intelligent,” but in the second sentence, the word is antonymous to “dark.”

Jane was a bright student. The room was glowing with bright, purplish light pouring down from the ceiling.

However, it is difficult to model such context with unsupervised word embedding models such as word2vec and glove since they only look at word-level patterns. Therefore, contextualized word representation methods have been proposed recently to model such patterns. Embeddings from Language Models (ELMo) is one of the successfuly attempts to deeply contextualize word vectors.

[Image Source]

ELMo consists of multiple bidirectional long short-term memory (LSTM) layers. However, instead of using just outputs from the top LSTM layer, a linear combination of the vectors stacked above each word is used for the downstream task. By doing so, they can jointly learn both syntactical and contextual features of words. My interpretation is that this is reminiscent of a hierarchy of convolutional neural network layers explained by Zeiler and Fergus (2014). This point was discussed in the earlier posting.

“Using intrinsic evaluations, we show that the higher-level LSTM states capture context-dependent aspects of word meaning (e.g., they can be used without modification to perform well on supervised word sense disambiguation tasks) while lower-level states model aspects of syntax (e.g., they can be used to do part-of-speech tagging).”

As a result, ELMo was able to improve the word representation significantly compared to existing methods and became one of state-of-the-art language models in 2018. However, Devlin et al. (2019) argued that ELMo is still “not deeply bidirectional” and feature-based, i.e., not fine-tuned for the downstream task. Therefore, they utilized the Transformer architecture that was already being used for fine-tuning language models such as OpenAI GPT. Now, let’s switch gears and have a look into supervised fine-tuning approaches.

Supervised fine-tuning

Unsupervised pre-training methods, contextualized or not, are somehow limited in terms of applicability since they are not aligned with downstream tasks. That is, they are not specificially tuned for a supervised task of interest. Therefore, NLP researchers started to borrow insights from computer vision, in which the concept of transfer learning has been en vogue. In practice, convolutional neural networks (CNN) are rarely trained from scratch nowadays. The image recognition field has standard, widely-accepted large-scale datasets such as CIFAR-10 and ImageNet. The images in those datasets are meticulously tagged and reliably verified by a number of studies. Established deep CNN architectures such as GoogleNet and VGG are pre-trained and publicly avaiable to anyone. Those pre-trained models show a remarkable capability for feature extraction in any given image. A classifier that is suitable for the downstream task, such as image segmentation and object detection, is placed on the top of the CNN and the CNN is retrained. For more information on transfer learning CNNs, please refer to this posting by CS231n (CNN for visual recognition) or Oquab et al (2013).

[Oquab et al. 2013]

Motivated by the intuition in computer vision, Howard and Ruder (2018) proposed the Universal Language Model Fine-Tuning (ULMFiT). ULMFiT is one of the most successful attempts to apply inductive transfer learning for NLP tasks. It consists of two components - the language model (LM) and classifier. LM is a three-layer LSTM network followed after an embedding layer. First, LM is pre-trained on a general-domain corpus (usually large in scale) then fine-tuned on target task data. Finally, the classifier is fine-tuned on the target task.

[Howard and Ruder 2018]

OpenAI’s Generative Pretrained Transformer (GPT) by Radford et al. (2018) takes a similar approach of generative pre-training followed by discriminative fine-tuning. Similar to ULMFiT, a standard LM is first pre-trained with an unsupervised corpus. Then, the overall model including the classifier is fine-tuned according to the target task. A novelty in GPT is that they use a multi-layer Transformer decoder, which is basically a variant of Transformer (Vaswani et al. 2017). Since different target tasks require different input structures, inputs are transformed depending on the target task. Below figure shows some examples of such transformations.

[Radford et al. 2018]

A few days ago, OpenAI announced the introduction of GPT-3, which is the latest version of GPT. It is trained on about 400 billion encoded tokens, which amounts to around 570GB of compressed plaintext after filtering and 45TB before filtering. Further, it boasts about 175 billion parameters, which is 10x more than any previous non-sparse model. Arguably, the state-of-the-art of NLP in 2020 is Transformer-based transductive learning architectures.

Compared to GPT, BERT employs a similar, yet slightly different, mechanisms in pre-training and fine-tuning. In the next posting, let’s see how BERT is designed and implemented.

References

Attention in Neural Networks - 22. BERT (1) Introduction to BERT (Bidirectional Encoder Representations from Transformers)

|

Attention Mechanism in Neural Networks - 22. BERT (1)

In a few previous postings, we looked into Transformer and tried implementing it in Pytorch. However, as we have seen in this posting, implementing and training a Transformer-based deep learning model from scratch is challenging and requires lots of data and computational resources. Fortunately, we don’t need to train the model from scratch every time. Transformer-based pre-trained models such as BERT and OpenAI GPT are readily available from Python packages such as transformers by HuggingFace. Utilizing those pre-trained models, we can achieve state-of-the-art (SOTA) performances in various natural language understanding tasks in record time!

[Image Source]

As explained in earlier postings, BERT (Bidirectional Encoder Representations from Transformers) is one of the pioneering methods for pre-training with Transformer- and attention-inspired deep learning models. It showed SOTA results in a number of tasks in 2019 and opened a new era of natural language processing. Since then, many Transformer-based language models such as XLNet, RoBERTa, DistillBERT, and ALBERT, have been proposed. All those variants have slightly different architectures from each other, but it is easier to grasp and apply any of them for your project if you have a firm understanding of BERT.

So in this posting, let’s start with understanding the great BERT achitecture!

Supervised learning and unsupervised learning

In the abstract, BERT combines unsupervised learning and supervised learning to provide a generic language model that can be used for virtually any NLP task. Many of you would know, but just for recap, unsupervised learning is inferring patterns in the data without a definite target label. Techniques to explore distibutions in data such as clustering analysis and principal component analysis are classic examples of unsupervised learning techniques. In contrast, supervised learning is concerned with predicting a labeled target responses. Classification and regression are two major tasks in supervised learning. Many machine learning models such as linear/logistic regression, decision trees, and support vector machines can be used for both classification and regression.

Unsupervised word embedding models such as Word2vec and GloVe have become an indispensable part of contemporary NLP systems. However, despite being highly effective in representing semantics and syntactic regularities, they are not trained in an end-to-end manner. That is, a separate classifier using pre-trained embeddings as input features has to be trained. Therefore, more recent methods have been increasingly training the embeddings and classifier simultaneously in a single neural network architecture (e.g., Kim 2014).

[Kim 2014]

Each of the two methods has own pros and cons. We can train word embedding models with a tremendously large unlabled text data, capturing as many language features as possible. Also, the trained embeddings are generalizable - it can be used for virtually any downstream tasks. However, this is based on a strong assumption that syntactic and semantic patterns in corpora are largely similar across different tasks and datasets. Our common sense argues that this is not always the case. Even when the same person is writing, languages used for different tasks and objectives can differ significantly. For instance, vocabularies that I use for Amazon reviews or Tweets will be dramatically different from those that I use for these kind of postings or manuscripts for academic journals.

[Devlin et al. 2019]

BERT overcomes this challenge by combining unsupervised pre-training and supervised fine-tuning. That is, the word and sentence embeddings are first trained with large-scale, generic text data. The Transformer architecture is utilized here for better representation. Then, the overall model including the embeddings is fine-tuned while performing downstream tasks such as question answering.

Transfer learning

As a result, a single BERT model can achieve a remarkable performance in not only one task but in many NLP tasks. A direct application of this property is transfer learning. Transfer learning, or knowledge transfer, is concerned with transferring knowledge from one domain from another. Generally, one has sufficient labeled training data in the former but not in the latter (Pan and Yang 2010).

Transfer learning has been actively researched and utilized in the image recognition field as well. In their seminal work, Zeiler and Fergus (2014) showed that different filters of convolutional neural networks (CNN) learn distinct features that are activated by common motifs in multiple images. For instance, The first layer below learns low-level features such as colors and edges, wheareas the fourth and fifth layers learn more abstract concepts such as dogs and wheels. It has become a standard practice in image recognition to utilize pre-trained CNNs with large-scale datasets such as ImageNet. FYI, Pytorch provides pretrained CNN models such as AlexNet and GoogleNet.

[Zeiler and Fergus 2014]

BERT can be used to transfer knowledge from one domain to another. Similar to using pre-trained convolutional layers for object detection or image classification, one can use pre-trained embedding layers that have been already used for other tasks. This has the potential to significantly reduce the cost of gathering and labeling new training data and improve text representations.

In this posting, we had a brief introduction to BERT for intuitive understanding. In the following postings, let’s dig deeper into the key components of BERT.

References

Recommender systems with Python - (4) Memory-based collaborative filtering - 1

|

So far, we have broadly reviewed the recommender systems and collaborative filtering (CF) field. Now, let’s narrow down a bit and look into the memory-based (or neighborhood/heuristic) methods. As explained in the previous posting, memory-based CF systems use simple heuristic functions to infer ratings for prospective user-item interactions using previous rating records. Therefore, they have advantanges such as simplicity, justifiability, efficiency, and stability. In this posting, let’s try intuitive understanding of the CF methods.

Intuitive understanding of memory-based CF

In many cases, the naming of the method reveals many characteristics of the method. Therefore, even though we do not know the details of the memory-based CF systems, we can try to understand it by examining the alternative names like blind men groping elephants.

Memory

As the name suggests, memory-based CF primarily relies on “memories” of user-item interactions. However, this statement might be misleading at the first glance, since virtually every CF methods should memorize some patterns of such interactions to make inferences later. To make this more sense, we can introduce a contrastive concept of “generalization.”

[Photo by Fredy Jacob on Unsplash]

To start with, memorization is concerned with explicit co-occurrence or correlation patterns present in previous data. In general, those patterns can be expressed in simple decision rules such as IF-THEN. For example, “if the user liked the movie Star Wars, he/she will like the movie Pulp Fiction.” Those patterns can be easily recognized and memorized.

“Memorization can be loosely defined as learning the frequent co-occurrence of items or features and exploiting the correlation available in the historical data.” (Cheng et al. 2016)

However, in many cases, human behavior is complicated than that. For instance, even though the likings of Star Wars and Pulp Fiction are correlated, there might be other moderating/mediating variables. It might be the case that many users who like Star Wars like the movie Fight Club and those who like Fight Club like Pulp Fiction. In fact, most users who like Star Wars and do not like Fight Club might dislike Pulp Fiction. Furthermore, with the number of items and users getting astronomically big in many practical applications nowadays, the problem space gets infinitely convoluted. Those patterns are difficult not only to elucidate, but also to memorize them. Thus, many recent methods such as embedding-based ones attempt to achieve generalize such patterns in a high-dimensional latent space.

“Generalization, on the other hand, is based on transitivity of correlation and explores new feature combinations that have never or rarely occurred in the past.” (Cheng et al. 2016)

[Photo by Maxime VALCARCE on Unsplash]

There is a fine line between memorization and generalization. However, the general rule of thumb is that if “I know it when I see” the pattern, it is likely to be a memorizable pattern. Cheng et al. (2016) provide more in-depth discussions on generalization and memorization in describing their Wide & Deep learning framework.

Heuristic

Now we now that memory-based CF systems in general relies on memorizing straightforward patterns in data. Besides, they are heuristic methods that exploit simple, memorizable patterns in previous data. This characteristic makes the method heavily rely on rules of thumbs derived from in-depth domain knowledge.

[Photo by sk on Unsplash]

The central concept of memory-based methods is the notion of similarity, or closeness between items and users. Based on pre-defined metrics for similarity or proximity, major tasks in recommender systems such as top-K items recommendation or ratings prediction can be carried out. However, though it may sound straightforward, methodically defining similarity and quantifying it is not so simple and quick. For instance, how would you measure the similarity between movies Fight Club and Pulp Fiction? How would they compare to the similarity between Fight Club and Star Wars? From the user’s perspective, if Peter and Jane both loves Star Wars, and Peter likes Pulp Fiction but Jane hates Pulp Fiction. Would you regard that those two users are similar or dissimilar?

Furthermore, even though after we decide how to measure the similarity between users and items, how can we use them to make a decision and actually recommend items to users? To investigate the similarity measure deeper, we need to understand the concept of neighborhood.

Neighborhood

Memory-based methods borrow intuition from the k-neareast neighbors (k-NN) algorithm in pattern recognition. In the abstract, k-NN assigns a value to an instance of interest by averaging the values of neighbors that are close to the instance, i.e., neareast neighbors. k-NN can be used for both classification and regression tasks.

[ Image source]

In the recommender systems context, we have two types of instances - users and items. Accordingly, nearest neighbors can be identified in both types of instances and inferences can be made in both ways. Therefore, we have two branches of memory-based methods - i.e., user-based and item-based. Assume that we want to predict the rating for the movie Fight Club by John, who hasn’t watched the movie yet. Then, the two methods carries out the same task slightly differently as follows.

  • User-based methods: guess future rating for Fight Club by John using the ratings to Fight Club by Casey, Sean, and Mike who have similar likings to John.

  • Item-based methods: guess future rating for Fight Club by John using the ratings to Star Wars and Pulp Fiction, which are similar items to Fight Club, by John.

I delibrately avoided the usage of rigorous definitions and mathematical details since I wanted to provide an “intuitive understanding” in this posting. From the next posting, let’s see how the two methods for memory-based CF are defined rigorously and (relatively easily) implemented in Python.

References

  • Ricci, F., Rokach, L., & Shapira, B. (2011). Introduction to recommender systems handbook. In Recommender systems handbook (pp. 1-35). Springer, Boston, MA.
  • Cheng, H. T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., … & Anil, R. (2016, September). Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems (pp. 7-10).
  • Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30-37.

Recommender systems with Python - (3) Introduction to Surprise package in Python

|

In the previous posting, we overviewed collaborative filtering (CF) and two types of CF methods - memory-based and model-based methods. In this posting, before going into the details of two CF methods, let’s have a quick look at the Surprise package in Python.

What is Surprise!?

Surprise is a Python scikit specialized for recommender systems. It provides built-in public datasets, ready-to-deploy CF algorithms, and evaluation metrics.

Installing Surprise

Installing Surprise is straightforward like any other scikit libraries. You can conveniently install it using pip. In terminal console, run below command.

pip install surprise

If you are using Google colaboratory or Jupyter Notebook, run below code in any cell.

!pip install surprise

After installation, let’s import necessary submodules for this exercise.

from surprise import Dataset
from surprise import BaselineOnly
from surprise.model_selection import cross_validate

Built-in datasets

Surprise provides built-in datasets and tools to create custom data as well. The built-in datasets provided are from MovieLens, a non-commercial movie recommendation system, and Jester, a joke recommender system. Here, let’s use the Jester built-in dataset for demonstration. For more information on Jester dataset, please refer to this page.

Load dataset

The built-in dataset can be loaded using load_builtin() function. Just type in the argument 'jester' into the function.

dataset = Dataset.load_builtin('jester')

If you haven’t downloaded the dataset before, it will ask whether you want to download it. Type in “Y” and press Enter to download.

Data exploration

You don’t need to know the details of the dataset to build a prediction model for now, but let’s briefly see how the data looks like. The raw data can be retrieved using raw_ratings attribute. Let’s print out the first two instances.

ratings = dataset.raw_ratings

print(ratings[0])
print(ratings[1])

The first two elements in each instance refer to the user ID and joke ID, respectively. The third element shows ratings and I honestly don’t know about the fourth element (let me know if you do!). Therefore, the first instance shows User #1’s rating information for Joke #5.

('1', '5', 0.219, None)
('1', '7', -9.281, None)

Now let’s see how many users, items, and rating records are in the dataset.

print("Number of rating instances: ", len(ratings))
print("Number of unique users: ", len(set([x[0] for x in ratings])))
print("Number of unique items (jokes): ", len(set([x[1] for x in ratings])))

It seems that 59,132 users left 1,761,439 ratings on 140 jokes. That seems like a lot of ratings for jokes!

Number of rating instances: 1761439
Number of unique users: 59132
Number of unique items (jokes): 140

Prediction and evaluation

There are a few prediction algorithms that can be readily used in Surprise. The list includes widely-used CF methods such as k-nearest neighbor and probabilistic matrix factorization.

But since we haven’t looked into the details of those methods, let’s use the BaselineOnly algorithm, which predicts “baseline estimates,” i.e., calculating ratings using just bias terms of users and items. To put it simply, it does not take into account complex interaction patterns between users and items - considers only “averaged” preference patterns pertaining to users and items. For more information on the baseline model, please refer to Koren (2010)

clf = BaselineOnly()

Let’s try 3-fold cross validation, which partitions the dataset into three and use different test set for each round.

cross_validate(clf, dataset, measures=['MAE'], cv=3, verbose=True)

On average, the prediction shows a mean average error (MAE) of 3.42. Not that bad, considering that we used a very naive algorithm. In the following postings, let’s see how the prediction performance will improve with more sophisticated prediction algorithms!

References

  • Koren, Y. (2010). Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(1), 1-24.
  • Ricci, F., Rokach, L., & Shapira, B. (2011). Introduction to recommender systems handbook. In Recommender systems handbook (pp. 1-35). Springer, Boston, MA.

Recommender systems with Python - (2) What is collaborative filtering?

|

Most recommendation engines can be classified into either (1) collaborative filtering (CF) system, (2) content-based (CB) system, or (3) hybrid of the two. In the previous posting, we went through the concepts of the three and differences. To give you a little re-cap, content-based systems recommend items that are close to the items that the user liked before. For example, if I liked the movie Iron Man, it is likely that I will also like the movie Avengers, which is common-sensically similar to Iron Man. In contrast, CF systems suggest items that similar users liked, i.e., “people-to-people” correlation. In general, there are two steps in CF - (1) identifying users with similar likings in the past and (2) recommending items that such users prefer.

In this posting, let’s look into the details of CF.

Pros and cons of CF recommender systems

As described in the earlier posting, CB and CF approaches have their own advantages and disadvantages - they are more like the two sides of a coin. So to talk about the pros of CF, it would be better to start with the shortcomings of the CB approach. One of the main limitations of applying the CB approach to practical recommendation problems is the unavailability of content data. In many cases, it is difficult to obtain content information pertaining to all items of interest. For instance, how would you (numerically) describe the content of rock ‘n’ roll songs? In other words, how can you measure the (dis)similarity of two arbitrary songs, say Led Zepplin’s Stairway to heaven and Deep purple’s Smoke on the water?

Maybe we can try very simple features such as the duration of the song and which musical instruments are played? But this does not necessarily contain the “content” information of the song. How about more descriptive ones such as lyrics as text information? Then, can would you handle subtle information hidden in the context such as euphemisms and metaphors? Things often get easily complicated in the problem space of CB systems. Thus, the CF method bypasses these problems by ignoring the explicit content information and indirectly model them based on users’ past history. Therefore, one critical assumption of the CF method is that the users’ preference is relatively stationary over time. Furthermore, the user’s preferences are similar between a group of like-minded users, but differs between different user groups. For example, there should be some “collaborating” patterns that are time-invariant such as “users who like Beatles also like Nirvana.” If these assumptions do not hold, CF models are most likely to fail.

Another disadvantage of CB is that the method is highly domain-dependent. Let us assume that you found a feature extraction method to neatly analyze the music notes and lyrics of rock ‘n’ roll songs. Would that method apply to movie recommendation? What about restaurant recommendation? In practice, it is hardly possible to find a single CB method that works great in more than one domain. In contrast, CF is highly applicable since it relies on only user-item interaction patterns.

At the same time, CF is not applicable at all when user-item interaction patterns are non-existent. This is termed the “cold-start problem” since recommendations have to be made with a cold start, i.e., without sufficient information. Cold-start problems can arise in various scenarios. When we have a small amount of user-item interaction data or no significant findings can be inferred from the data, we have a general cold-start problem. However, there can be cold-start problems even when we have a large amount of training data and inferred patterns. For example, a new item does not have any interaction record with users is very difficult to be recommended. This is usually called the item cold-start problem. Problems for new users are, naturally, the user cold-start problem.

Memory-based CF systems

There are largely two branches of CF. The memory-based (aka heuristic-based or neighborhood) approach utilizes pre-computed user-item rating records, i.e., “memory,” to infer ratings for other items that the user has not encountered yet. In the abstract, the closest items to a certain user according to some metrics are recommended to that user, thereby having an alternative name of “neighborhood methods.” Then, the problem boils down to how to define and measure the concept of “closeness,” or “proximity” in the user and item space. After that, what is left to do is just picking the closest ones to the user.

[Image source: Koren et al.]

Advantages of memory-based methods include, but are not limited to:

  • Simplicity: intuitive and simple to implement.
  • Justifiability: results are interpretable and the reasons for recommendation can be inferred by examining neighbors in most cases.
  • Efficiency: does not require large-scale model training and neighbors can be pre-computed and stored.
  • Stability: less affected by the addition of users, items, and ratings.

There are two methods to implement memory-based CF systems - (1) user-based and (2) item-based. The user-based approach first finds similar users to a user of interest, i.e., neighbors. Then, the rating for a new item is inferred based on rating patterns of the neighbors. In contrast, the rating for an item is predicted with ratings of the user for items that are similar to the item of interest. We will see how these two approaches differ in detail in the next postings.

[Amazon's item-based recommendation]

Model-based CF systems

Model-based CF systems use various predictive models to estimate the ratings for a certain user-item pair. A wide variety of models are used in practice for estimation. One of the most salient family of methods is latent factor models that attempts to characterize both items and users with a finite number of latent factors. Matrix factorization is one of the most established methods for such approach. We will also discuss this in later postings.

[Image source: Koren et al.]

References

  • Ricci, F., Rokach, L., & Shapira, B. (2011). Introduction to recommender systems handbook. In Recommender systems handbook (pp. 1-35). Springer, Boston, MA.
  • Gomez-Uribe, C. A., & Hunt, N. (2015). The netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems (TMIS), 6(4), 1-19.
  • Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30-37.