Buomsoo Kim

Attention in Neural Networks - 24. BERT (3) Introduction to BERT (Bidirectional Encoder Representations from Transformers)


Attention Mechanism in Neural Networks - 24. BERT (3)

In the previous posting, we had a close look into unsupervised pre-training and supervised fine-tuning, which are fundamental building blocks of BERT. BERT essentially improves upon state-of-the-art developments in pre-training and fine-tuning approaches. If you were able to follow concepts and ideas so far, it will be much easier to understand the details of BERT, which will be elaborated in this posting.

Unsupervised pre-training

The objective of pre-training in unsupervised fashion is similar to that of embedding methods such as Word2vec and GloVe.

[Devlin et al. 2019]

Similar to word embedding methods, vector representations of word and sentences are learned while performing two unsupervised tasks, namely masked language model (LM) and next sentence prediction (NSP).

Masked language model

Conventional LMs such as “bidirectional” recurrent neural networks are not truly bidirectional since they learn in one direction at a time, e.g., right-to-left or left-to-right. To overcome this and obtain deep bidirectional representations, BERT is pre-trained with a masked LM procedure, or the cloze task. The procedure is quite simple - some percentage of the input tokens are “masked” at random and predicted by the model. In a sense, it is similar to a “fill-in-the-blank” question, in which words to fill in are chosen at random. For instance, assume that we have an input sentence “To be or not to be, that is the question” and two tokens, not and question are masked. Then, the input and target sentences are:

Input: “To be or [MASK] to be, that is the [MASK]”

Output: “To be or not to be, that is the question”

In the paper, it is mentioned that tokens are masked with the probability of 15%. For more information on masked LM and Python (Keras) implementation, please refer to this posting

Next sentence prediction

The masked LM procedure models relationships between tokens. However, it does not capture relations between sentences, which can be critical for many downstream tasks such as question answering and natural language inference. NSP is essentially a binary classification task. For an arbitrary sentence pair A and B, the model is pre-trained to classify if the two sentences are adjacent (“IsNext”) or not (“NotNext”) - 50% of the time, B is actually the next sentence in the corpus and the other 50% of the time, it is a random sentence chosen from the corpus.

Supervised fine-tuning

Supervised fine-tuning is carried out in a similar manner to previous methods such as ULMFit. The task-specific inputs and outputs are plugged into the pre-trained BERT model and all the parameters are trained end-to-end. The authors show in the paper that pre-trained BERT outperforms state-of-the-art methods in various end tasks including natural language understanding and question answering.

[Devlin et al. 2019]


Neural collaborative filtering with fast.ai - Collaborative filtering with Python 17


In the previous posting, we learned how to train and evaluate a matrix factorization (MF) model with the fast.ai package. Nowadays, with sheer developments in relevant fields, neural extensions of MF such as NeuMF (He et al. 2017) and Deep MF (Xue et al. 2017) became very popular. In this posting, let’s have a look at a very simple variant of MF using multilayer perceptron.

Data Import & Preparation

These steps are identical to preparing for MF. If you haven’t yet, please have a look at this previous posting for importing and preparing the data.

Creating and training a neural collaborative filtering model

We use the same collab_learner() function that was used for implementing the MF model. Parameters that should be changed to implement a neural collaborative filtering model are use_nn and layers. Setting use_nn to True implements a neural network. Recall that the MF model had only embedding layers for users and items. layers parameter lets us define the architecture of the neural network. In specific, we can designate the numbers of nodes in hidden layers. Here, let’s set it to [30, 30] - by doing so, we are generating a neural network with two hidden layers having 30 nodes each.

learn = collab_learner(databunch, n_factors=50, y_range=(0, 5), use_nn = True, layers = [30, 30])

You can see that the resulting model has three additional Linear() layers. The final one is the output layer and the first two are the hidden layers that we have configured. Note that out_features for the two layers are set to 30.

  (embeds): ModuleList(
    (0): Embedding(944, 74)
    (1): Embedding(1625, 101)
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(0, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=175, out_features=30, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Linear(in_features=30, out_features=30, bias=True)
    (4): ReLU(inplace=True)
    (5): BatchNorm1d(30, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): Linear(in_features=30, out_features=1, bias=True)

To train the model with the given data, we use fit() function. We train 5 epochs here.

epoch train_loss  valid_loss  time
0 0.969993  0.921394  00:07
1 0.907009  0.894607  00:06
2 0.866429  0.886609  00:06
3 0.863333  0.877121  00:06
4 0.822304  0.874548  00:06

To evaluate the model on the test data, we can use get_preds() function to get model predictions and convert them into a NumPy array.

from sklearn.metrics import *

y_pred = learn.get_preds(ds_type = DatasetType.Test)[0].numpy()
print(mean_absolute_error(test_df["rating"], y_pred))

The model shows slightly improved performance compared to MF. You can experiment on other configurations, e.g., making the model deeper by adding more layers or wider by adding nodes to the hidden layers.



  • Collaborative filtering tutorial. (https://docs.fast.ai/tutorial.collab)
  • Collaborative filtering using fastai. (https://towardsdatascience.com/collaborative-filtering-using-fastai-a2ec5a2a4049)
  • He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T. S. (2017, April). Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web (pp. 173-182).
  • Xue, H. J., Dai, X., Zhang, J., Huang, S., & Chen, J. (2017, August). Deep Matrix Factorization Models for Recommender Systems. In IJCAI (Vol. 17, pp. 3203-3209).

How to concentrate by Swami Sarvapriyananda


This is a short talk by Swami Sarvapriyananda on how to concentrate. Before starting this posting, I want to emphasize that this is not a religious or spiritual talk, though some ideas from yoga and Swami Vivekananda’s philosophy are mentioned. Swami Sarvapriyananda provides very practical methods to maximize concentration and focus with his profound knowledge in science, philosophy, and yoga. Personally, I have been quite interested in concentration and also intensely practicing Raja yoga and Zen during the COVID-19 pandemic and greatly improved my focus (and my life overall). In my opinion, this lecture can be a great starting point for someone who wants to improve their ability to focus (and life).

Secret in success

“Every problem could be solved if one could concentrate hard enough” - Von Neumann

This quote by Von Neumann, a great mathematician/physicist, that Swami Sarvapriyananda accidentally encountered is reminiscent of Swami Vivekananda’s idea. According to Swami Vivekananda, the secret to all knowledge lies in concentration. The difference between an ordinary person and a great person lies in the degree of concentration. Interestingly, Warren Buffet and Bill Gates, two of the brightest and richest people in the last century, picked “focus” as the single most factor in their succcess through life.

Impediments to concentration - modern technology

[Image source]

In short, modern information technologies, such as social media and mobile devices, are very detrimental to concentration. As competition gets fiercer and fiercer in the digital space, many IT companies are increasingly offering their products gratuitously. We can use many web and mobile applications without charge, such as Google search, YouTube, Facebook, Instagram, and Whatsapp, to name just a few. However, instead of charging users explicitly, they are striving to grab their attention. And the user’s engagement while using the service is a new type of currency in the attention economy of the 21st century. As users use an application more frequently and subconciously, the company creates monetary value from them by advertising, subsidy, freemium services, etc. If you want to know more on how IT companies make money in free economy, refer to literature on two-sided markets and attention economy.

From the user’s perspective, companies enticing users to use their applications without any charge might look like a dream world. We can now use a large number of enormously sophsticated applications for free, ranging from social media, news, messengers, and so on. However, as always, there is no free lunch. As companies are trying hard to grab users’ attention, we are somehow becoming addicted to little dopamine rushes everyday. A new notification with a “ting” on your phone, e.g., a new message in Whatsapp, a new video uploaded by your favorite YouTuber, and a new photo update of your best friend on Instagram, gives you tiny pleasure in anticipation of something new and interesting. Thousands of brilliant people with knowledge in technology and human psychology work day and night to keep you engaged and loyal to the application. This makes you divert from work that you are doing and invest some mental effort in reacting to the notification. And this happens more often as we get addicted to the dopamine rush. The fact that this is a very subtle and relatively less damaging addiction makes it difficult to regulate and prevent from happening.

This has become a quite serious issue. When I observe people nowadays, many get very anxious when they cannot check their phones or laptops periodically. A few years ago, I used to note such patterns among young people, but recently, I recognize more and more older people than me becoming addicted to mobile services and lose their focus periodically. So it’s becoming a universal problem to people with access to developed IT. To be honest, I sometimes become very sticky to such apps - I find myself binge-watching YouTube videos or unconsciously checking the Facebook newsfeed.

Then how can we regain our ability to focus? Swami Sarvapriyananda gives some hints in this talk based on science of focus and Swami Vivekananda’s education philosophy. Below are some of the key points of the lecture, with my comments.

The Science of FLOW

According to Mihaly Csikszentmihalyi, the author of Flow: The Psychology of Optimal Experience, our cognitive bandwith is quite limited. Specifically, we can process about 110 bits of information every second. For example, it is difficult to talk to two people at the same time separately. Then, Csikszentmihalyi argues that concentration is how much cognitive bandwith can we take and limit our attention on one subject at a time. Thus, if we can deeply focus on one thing, there is little mental capacity to attend to other external things.

Balance challenge and skills

An important technique to enhance the ability to concentrate is to balance between the difficulty of a task and your capability. If the task is too challenging or demanding, it is likely to make you anxious and worried. In contrast, if your skill exceeds the difficulty by a great margin, you are likely to be bored easily. Hence, to maximize your attention, you should carefully choose the appropriate task after examining your own capability. And such state of maximum concentration is called flow.

Concentration is an enjoyable experience

Flow is not only productive and engaging, but also pleasurable. After you deeply focus on something, you tend to feel very satisfying, happy, fulfilled, and integrated. In contrast, after wasting hours surfing internet or binge watching Netflix, you are likely to feel lethargic and disoriented. Furthermore, how much you can concentrate on one thing is related to overall happiness and satisfaction in life. If you can divert yourself from unnecessary worries and concerns and focus on important things in life, you can be more successful and happy in the long run.

Practice makes perfect

Persistence is the key in concentration as well. Many people think mental capabilities are something that we can easily acquire or cannot be developed after birth. However, like when training your muscles, consistently working hard to develop the ability to focus is of great importance. In my experience, this creates a positive loop in your life. As one develops the ability to focus, things are done much impeccably. Then, one is further motivated to develop the ability and performances improve much more, and so on. This is why some people like Gates and Buffet succeeds enormously, even though they don’t look extraordinarily different from other people.

Raja yoga - the best method for concentration.

[Image source]

There is a snake surrounding the emblem of Ramakrishna order, which was designed by Swami Vivekananda himself. The serpent implies (raja) yoga and awakened kundalini. When a snake is approaching a prey, it is intensely focusing. Also, it is extending its hood, which can be interpreted as cutting out all distractions. Finally, it holds onto one thing for a long period of time. To summarize, three key aspects of concentration are (1) focusing on one thing, (2) removing all other things that can be distractions, and (3) keeping the attention for as long as possible.

Hence, if you want to maximize your flow while working on something, you would better get rid of everthing that can potentially distract you. Nowadays, information technologies that we discussed such as mobile phones and social media should be avoided at all costs when you are focusing.

According to Csikszentmihalyi, the best guide to improve focus is Yoga sutras of Patajali. And Daniel Goleman also dedicates pages describing the science behind meditation. I won’t go into the details of Yoga and meditation here since you can find them easily online nowadays. But, I can assure you that they really work.

Be a single-tasker

This is actually quite opposite to what we usually do nowadays. People are so used to multi-tasking - while working, they listen to music, consistently check emails and messages, think about what to do after work, and so on. And it might be productive in the short run, since time is limited. However, in the long run, it is desirable to focus on one thing at a time to develop your focus and maintain productivity in the long run. One practical suggestion given by Swami is to intensely focus on one thing for a short span of time and take rest after that. Also, even when you are doing relatively petty things such as chatting with a friend or watching a movie on Netflix, focus on just that thing that you are doing.

Towards unselfishness

Swami Vivekananda emphasized unselfishness over concentration. The ability to concentrate is a great psychic power. And we have seen throughout the history that if that power is misused, there can be catastrophic consequences. It is important to balance self-centeredness and unselfishness when applying your focus in daily work you perform.


  • The Social Dilemma
  • Mihaly Csikszentmihalyi, Flow: The Psychology of Optimal Experience
  • Daniel Goleman, Focus: The Hidden Driver of Excellence

Matrix Factorization with fast.ai - Collaborative filtering with Python 16


In this posting, let’s start getting our hands dirty with fast.ai. fast.ai is a Python package for deep learning that uses Pytorch as a backend. It provides modules and functions that can makes implementing many deep learning models very convinient. More information on fast.ai can be found at the documentation. Here, we will be just implementing collaborative filtering models, but if you want to learn more about deep learning and fastai, I strongly recommend starting with the Practical deep learning with coders course by Jeremy Howard.

Data Import

Let’s start with importing the MovieLens 100k data that we used before with the Surprise package. You can use functions provided by fast.ai, but let us try doing it from scratch so that you can import any data later on. If you are manually downloading the data, please download the zip file by clicking and unzip it. If you are using Google Colab or Jupyter Notebook like me, use below command. For more information on downloading files from the Web in Colab, please refer to this posting.

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip ml-100k.zip
--2020-11-27 22:14:57--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)...
Connecting to files.grouplens.org (files.grouplens.org)||:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ml-100k.zip

ml-100k.zip         100%[===================>]   4.70M  12.2MB/s    in 0.4s    

2020-11-27 22:14:58 (12.2 MB/s) - ml-100k.zip saved [4924029/4924029]

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-100k/u3.test         
  inflating: ml-100k/u4.base         
  inflating: ml-100k/u4.test         
  inflating: ml-100k/u5.base         
  inflating: ml-100k/u5.test         
  inflating: ml-100k/ua.base         
  inflating: ml-100k/ua.test         
  inflating: ml-100k/ub.base         
  inflating: ml-100k/ub.test  

Just to check if the file is downloaded and unzipped properly, run below command.


If you see a ml-100k folder, it is well done!

ml-100k  ml-100k.zip  sample_data

Finally, we can import the downloaded data with read_csv function in Pandas.

import pandas as pd

data = pd.read_csv('ml-100k/u.data', sep="\t", header = None, 
            usecols = [0,1,2],
            names = ['user', 'item', 'rating', 'timestamp'])


Creating Data Bunch

The primary data structured used in fastai is data bunch, which utilizes data loader in Pytorch. For collaborative filtering tasks, fastai provides CollabDataBunch, which makes our life much easier.

Let’s start by dividing the data frame into training and testing data. We divide the data in a 7-3 ratio.

train_df = data.iloc[:70000]
test_df = data.iloc[70000:]

Since we the data is in data frame format, we use from_df() function to create a CollabDataBunch. Note that the test data is passed onto the test parameter. Other key parameters are valid_pct, which is the proportion of valid dataset and bs, which refers to the batch size.

databunch = CollabDataBunch.from_df(train_df, test = test_df, valid_pct = 0.1, bs=128)

Creating and training a matrix factorization model

Simple collaborative filtering models can be implemented with collab_learner(). Note that we have to set y_range, which shows possible range of values that the target variable, i.e., rating in this case, can take.

learn = collab_learner(databunch, n_factors=50, y_range=(0, 5))

The basic collab_learner model is EmbeddingDotBias - this is identical to the SVD model that we have seen before. The model has four parameters - u_weight, i_weight, u_bias, and i_bias; we will later see what these parameters refer to .

  (u_weight): Embedding(944, 50)
  (i_weight): Embedding(1622, 50)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1622, 1)

To train the model with the given data, we use fit() function. We train 5 epochs here.

epoch	train_loss	valid_loss	time
0	0.953882	0.913318	00:06
1	0.808686	0.853414	00:06
2	0.677575	0.839932	00:06
3	0.551616	0.858383	00:06
4	0.422990	0.894348	00:06

To evaluate the model on the test data, we can use get_preds() function to get model predictions and convert them into a NumPy array.

from sklearn.metrics import *

y_pred = learn.get_preds(ds_type = DatasetType.Test)[0].numpy()
print(mean_absolute_error(test_df["rating"], y_pred))

The model shows test MAE of around 0.75. This seems to on a similar level with the performance shown by MF using Surprise, although we did not run cross validation here.


In this posting, we have seen how to import data and implement a simple matrix factorization model using fastai. In following postings, let’s see we can implement deep recommender models with fastai.


  • Collaborative filtering tutorial. (https://docs.fast.ai/tutorial.collab)
  • Collaborative filtering using fastai. (https://towardsdatascience.com/collaborative-filtering-using-fastai-a2ec5a2a4049)

Deep Recommender Systems - Collaborative filtering with Python 15


In previous postings, we have reviewed core concepts and models in collaborative filtering. We also implemented models that marked seminal developments in the field, including k-NN and SVD. Now, let’s switch gears and look at deep learning models that demonstrates state-of-the-art results in many recommender tasks. Deep recommender systems is such a rapidly developing sub-field that it requires a substantial part of this series.

Deep recommender systems

Photo by Alina Grubnyak on Unsplashs

Recently, deep recommender systems, or deep learning-based recommender systems have become an indispensable tool for many online and mobile service providers. Deep learning models’ capacity to effectively capture non-linear patterns in data attracts many data analysts and marketers. It has been reported that deep learning is used for recommendations in Youtube (Covington et al. 2016), Google Play (Cheng et al. 2016), and FaceBook (Naumov et al. 2019).

Accordingly, there has been surmountable work in academia as well, proposing numerous novel architectures for deep recommender systems. Since 2016, RecSys, one of the most prestigious conferences in recommender systems, started to organize deep learning workshops (DLRS) and deep learning paper sessions from 2018.

Why deep learning?

There are many reasons for advocating the use of deep learning in recommender systems (or many other applications). Here, major advantages of deep learning are highlighted. For more comprehensive review on deep recommender systems, please refer to Zhang et al (2019).


Modern deep neural networks have the ability to represent patterns in non-linear data. Multiple layers provide higher levels of abstraction, resembling human’s cognitive process. As a result, they can capture complex collaborative (and content) patterns that simpler models such as memory-based algorithms and SVD cannot learn.


Deep learning models can flexibly learn patterns from diverse types of data structures. Also, many recent proposed architectures are flexible enough to learn from both conventional data for collaborative patterns and unstructured data, e.g., image, text, and video in a single model. These models are often known as “hybrid models” since they combine collaborative filtering and content-based filtering. Hence, it can fully utilize side information from diverse data sources, potentially leading to improvements in predictive accuracy and recommendations.

Inductive bias

Virtually every machine learning model exploits inductive biases in data. That is, there are some assumptions about the data that makes training process efficient and effective. For instance, recurrent neural networks are optimal methods for sequential information such as text and convolutional neural networks for grid-like data such as image (or maybe Transformers nowadays? :) please refer to this posting if you are interested in Transformers and attention).

Such rules of thumb do not work every single time and most deep models need fine-tuning, but are generally accepted. This makes the decision and design process of neural networks very efficient and deployable.

Is deep learning the “silver bullet”?

Finally, I want to conclude this posting with a word of caution. In short, Deep learning models are basically not “the silver bullet” for recommender systems or any other applications. First of all, it is difficult to meticulously tune very deep models since there are a lot of model parameters. If not properly trained, deep models are likely to underperform, sometimes showing inferior performances to simpler alternatives. Also, even though they show improved prediction accuracy, there remains the issue of interpretability, or explainability. Modern deep recommender systems are too complex to be completely understood by humans.

Finally, “more sophisticated and complicated is better” is not the mantra. As Dacrema et al. (2019) pointed out, many recently proposed methods in top-tier outlets fail to show comparable performance to simple heuristic methods. Even amongst deep learning models, I have seen ample cases where simple single-layer multi-layer perceptron model shows superior performance to sophisticated RNN models when modeling time series data. Furthermore, in many cases, it would be difficult to deploy very deep models in practice due to computational reasons even though they show great experimental performance.

Hence, it should be again emphasized that different data structures and application context require different algorithms - there is no one-size-fits-all solution. Although deep learning is a very powerful tool, we shouldn’t have blind faith in it. Always start with learning about the application and do as many experiments as possible!

With that said, let’s see how we can (easily) implement deep recommender systems with Python and how effective they are in recommendation tasks!


  • Covington, P., Adams, J., & Sargin, E. (2016, September). Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems (pp. 191-198).
  • Cheng, H. T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., … & Anil, R. (2016, September). Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems (pp. 7-10).
  • Dacrema, M. F., Cremonesi, P., & Jannach, D. (2019, September). Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM Conference on Recommender Systems (pp. 101-109).
  • Naumov, M., Mudigere, D., Shi, H. J. M., Huang, J., Sundaraman, N., Park, J., … & Dzhulgakov, D. (2019). Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091.
  • Zhang, S., Yao, L., Sun, A., & Tay, Y. (2019). Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR), 52(1), 1-38.