Ex) MovieLense 1M Data Set


Data


The MovieLense Data was collected by GroupLens Research from the MoveLens website. The data contains user IDs, movie information, ratings etc. We use 1M dataset, which has 1 million ratings from 6040 users on 4000 movies. You can download the MovieLense Data Set, usage licenses and other details.

Python Code


(1) Import NUMPY library first
import numpy as np 
(2) Review What the Data Looks Like
Before we road the data, check the data file itself.
# 1::1193::5::978300760
# 1::661::3::978302109
# 1::914::3::978301968
# 1::3408::4::978300275
# 1::2355::5::978824291
# 1::1197::3::978302268
# ... 
The data format is, UserID::MovieID::Rating::Timestamp
The MovieLens 1M dataset uses a double colon :: as separator.
Note that in Python,
the 1st column (index 0) = user IDs
the 3rd column (index 2) = ratings
(3) Read the Data
data = np.loadtxt("data/ml-1m/ratings.dat", 
                  delimiter = "::", dtype=np.int64)
Once the data is loaded, check the first few rows, and shape of data.
data[:7, :]  # check the first 7 rows 
array([[        1,      1193,         5, 978300760],
       [        1,       661,         3, 978302109],
       [        1,       914,         3, 978301968],
       [        1,      3408,         4, 978300275],
       [        1,      2355,         5, 978824291],
       [        1,      1197,         3, 978302268],
       [        1,      1287,         5, 978302039]])
data.shape 
(1000209, 4)
(4) Calculate the Mean of Total Rating
Let’s calculate the total mean, mean of the index 2
totalmean_rate = data[:, 2].mean()
totalmean_rate
3.5815644530293169
3.58 is the mean of tatal ratings. But each user has at least 20 ratings in the data. What is the mean of each user’s rating?

(5) Calculate the Mean Rating of the Each User
Firstly, need to extract the unique user IDs from the data so that how many (unique) users in the data set.
ids = np.unique(data[:,0]) 
ids 
array([   1,    2,    3, ..., 6038, 6039, 6040])
So there are 6040 user IDs in the data. We will calculate the mean ratings in each users. So the final results is the list of 6040 elements consisting of user ID and its mean value.
Let’s make the empty list first.
mean_rating_user = [] 
for user_id in ids: 
    data_for_user = data[data[:, 0] == user_id, :]  
    mean_rating_id = data_for_user[:, 2].mean()   
    mean_rating_user.append([user_id, 
                            mean_rating_id]) 
For example, when the user_id is equal to 1, then all ratings (from the user ID = 1) are stored into data_for_user list. As this data_for_user list only contains the rating of the ID=1. So that now we can simply use the mean command. Then the information of user_id and Mean_rating_id are appended in to the Mean_rating_user list. Loop users from 1 to 6040.
Now let’s check the results.

mean_rating_user[:5]
[[1, 4.1886792452830193],
 [2, 3.7131782945736433],
 [3, 3.9019607843137254],
 [4, 4.1904761904761907],
 [5, 3.1464646464646466]]
The first column is user ID, and the second column is its mean rating.
We can also change it as a array format.
mean_rating_user_array = np.array(mean_rating_user, dtype=np.float32)
mean_rating_user_array[:5]
array([[ 1.        ,  4.18867922],
       [ 2.        ,  3.7131784 ],
       [ 3.        ,  3.90196085],
       [ 4.        ,  4.19047642],
       [ 5.        ,  3.14646459]], dtype=float32)
(6) Save the Result as a CSV File
np.savetxt("mean_rating_user.csv", 
           mean_rating_user_array, fmt="%3f",
           delimiter=",")

Thanks! Have fun in Python!

No comments:

Post a Comment