Buomsoo Kim

Downloading & unzipping compressed file formats in Google Colab

|

In previous posting, we went through downloading and importing datasets from the Web. With functions in NumPy and Pandas, we can import most datasets available on the Web, e.g., csv, tsv, spreadsheets, and txt files.

However, datasets, especially the large one, are provided in compressed formats such as .zip or .gz. In such cases, we should first unzip them to have access to the raw data. By taking advantage of terminal commands, we can conveniently hack this problem. Don’t be frightened with terminal commands, espeically if you are new to this. It is much easier and quicker than you think!

Finding the data source

First and foremost, we need to know basic details of the data source. In this tutorial, let’s try downloading and importing a dataset from MovieLens. Among many datasets, let’s try Small MovieLens Latest Datasets recommended for education and development. The dataset that we want is contained in a zip file named ml-latest-small.zip.

As before, we first need to copy the url to the zip file. FYI, the url looks like this here: http://files.grouplens.org/datasets/movielens/ml-latest-small.zip. And if you download the zip file and open it, you will see that there are four csv files contained in a folder ml-latest-small.

Here, let’s try opening the ratings.csv file. The file has 100,836 instances with four attributes - userId, movieId, rating, and timestamp.

Download and unzip the compressed file

Then, now we can create a colab file and download and unzip the compressed file (ml-latest-small.zip). To download the compressed file (or any file in general), you can use the !wget command as below.

!wget url_to_the_zip_file

Then, you will need to unzip the compressed file to open the files contained in it. !unzip command will work for most files with extension .zip. However, for .gz or .tgz files, try the !gunzip command instead.

!unzip compressed_file_name.zip

Now check with the !ls command to check out whether the file is properly downloaded and unzipped. You should see ml-latest-small folder and ml-latest-small.zip file as below.

Import data

Finally, you can import the data using functions such as read_csv() or np.loadtxt() as we have seen in the previous posting.

In addition, if you want to move the working directory to the folder that you created while unzipping, you can use the %cd command (“change directory”).

%cd directory_name

In this posting, we have gone through the process of downloading, unzipping, and importing compressed datasets. Combining techniques outlined here and in other postings, you will be able to fetch most data to Colab with relative ease.