GPS Demo: predicting user involvement¶

Tags: sdf, nonnegativity, missing elements, imposing structure

Using real life data, we try to predict if a user participates in an activity at a certain location using Tensorlab’s structured data fusion framework. The GPS dataset consists of a tensor and some additional matrices [1]:

UserLocAct: a third order tensor containing how many times a user has participated in an activity at a certain location;
UserUser: similarity between users;
LocFea: each location is described by features. Each feature counts certain point of interest;
ActAct: correlation between activities;
UserLoc: how many times a user has visited a location.

We try to approximate these datasets using a low-rank model with shared components as shown in Figure 61. Following [2], we formulate the problem mathematically as

\[\begin{split}\underset{\mat{U},\mat{L},\mat{A},\mat{F},\vec{d}_{\text{uu}},\vec{d}_{\text{aa}},\vec{d}_{\text{ul}}}{\operatorname{minimize}}~~ &\frac{\omega_1}{2}\norm{\ten{T}-\cp{\mat{U},\mat{L},\mat{A}}}_{\scriptsize{\text{F}}}^2 + \frac{\omega_2}{2}\norm{\mat{M}_{\text{uu}}-\mat{U}\mat{D}_{\text{uu}}\transpose{\mat{U}}}_{\scriptsize{\text{F}}}^2 +\\ &\frac{\omega_3}{2}\norm{\mat{M}_{\text{ul}}-\mat{U}\mat{D}_{\text{ul}}\transpose{\mat{L}}}_{\scriptsize{\text{F}}}^2+ \frac{\omega_4}{2}\norm{\mat{M}_{\text{lf}}-\mat{L}\transpose{\mat{F}}}_{\scriptsize{\text{F}}}^2+\\ &\frac{\omega_5}{2}\norm{\mat{M}_{\text{aa}}-\mat{A}\mat{D}_{\text{aa}}\transpose{\mat{A}}}_{\scriptsize{\text{F}}}^2.\end{split}\]

The matrices \(\mat{U}\), \(\mat{A}\), \(\mat{L}\) and \(\mat{F}\) are the variables which are used to predict the user involvement. The extra variables \(\vec{d}_{\text{uu}}\), \(\vec{d}_{\text{ul}}\), and \(\vec{d}_{\text{aa}}\) are needed to capture scaling differences between datasets. The matrices \(\mat{D}_{\text{uu}}\), \(\mat{D}_{\text{ul}}\), and \(\mat{D}_{\text{aa}}\) are diagonal matrices with the respective vectors on their diagonal. The parameters \(\omega_i\) are the weights for each term.

The demo_gps.m file contains a Matlab implementation of this demo and can be downloaded here.

Figure 61: The different datasets and their relations.

Obtaining the data¶

The GPS dataset can be obtained from Microsoft Research. The demo file tries to download the files automatically.

Preprocessing¶

Here, we will only predict whether or not a user participates in an activity at a certain location. Therefore, the number of participations is ignored:

UserLocAct(UserLocAct > 0) = 1;

Users without any user-location-activity data are ignored as we are unable to evaluate how good the algorithms predicts whether a user participates in an activity at a certain location. This way 18 users are removed from UserLocAct, UserLoc and UserUser.

u = any(any(UserLocAct,2),3);
UserLocAct = UserLocAct(u,:,:);
UserLoc = UserLoc(u,:);
UserUser = UserUser(u,u);

Finally, we normalize the location-feature data by normalizing the columns such that they sum to one:

LocFea = bsxfun(@rdivide,LocFea,sum(LocFea));
LocFea(~isfinite(LocFea)) = 0;

Predicting random events¶

First, we try to predict if a user will perform an activity at a certain location by randomly removing entries of UserLocAct (see Figure 62, left) . This can be achieved easily by setting the missing entries to nan. The fmt function reformats the tensor as an incomplete tensor. (This last step is automatically performed by all algorithms if needed.)

missing = 0.80;
sel = randperm(numel(UserLocAct), round(missing*numel(UserLocAct)));
UserLocAct_missing = UserLocAct;
UserLocAct_missing(sel) = nan;
UserLocAct_missing = fmt(UserLocAct_missing);

Figure 62: The two scenarios: 20% of known entries (left) and 50 missing users (right).

Without data fusion¶

As a baseline test, only the UserLocAct_missing tensor will be used. We model the tensor by a rank \(R=2\) CPD and add regularization to prevent overfitting. Mathematically, we solve:

\[\begin{split}\underset{\mat{U},\mat{L},\mat{A}}{\operatorname{minimize}}~~ &\frac{\omega_1}{2}\norm{\ten{T}-\cp{\mat{U},\mat{L},\mat{A}}}_{\scriptsize{\text{F}}}^2 + \frac{\omega_2}{2}(\norm{\mat{U}}_{\scriptsize{\text{F}}}^2+\norm{\mat{L}}_{\scriptsize{\text{F}}}^2+\norm{\mat{A}}_{\scriptsize{\text{F}}}^2).\end{split}\]

In Tensorlab we use the sdf_nls method to solve this model. The cpd_nls method can be used as well if regularization is not required. As usual, the SDF model is defined in three steps: variable definition, factor matrix definition and model choice for each dataset.

First three variables are defined: u for the users, l for the location and a for the activities. For each variable, we define a random initial guess. A better guess can be used if available from prior knowledge, or from previously computed models.

R = 2;
model.variables.u   = rand(nbUsers,R);
model.variables.l   = rand(nbLocations,R);
model.variables.a   = rand(nbActivities,R);

Next, we define the factor matrices. To illustrate the use of constraints, we impose nonnegativity constraints on all factor matrices.

model.factors.U   = {'u', @struct_nonneg};
model.factors.L   = {'l', @struct_nonneg};
model.factors.A   = {'v', @struct_nonneg};

Finally, the tensor is modeled. A CPD will be used here. We also add L2 regularization by defining a second factorization. In the case of regularization, the data part can be omitted if it is zero. Note that regL2 is implemented to add a term \(\norm{\cdot}^2_{\scriptsize{\text{F}}}\) for each factor matrix separately.

model.factorizations.ula.data = UserLocAct_missing;
model.factorizations.ula.cpd  = {'U','L','A'};

model.factorizations.reg.regL2 = {'U','L','A'};

We can investigate the resulting model using the language parser sdf_check:

sdf_check(model, 'print');

If no errors have been made, the following overview is printed:

gpscode1¶

Factor               Size                 Comment
------------------------------------------------------------------------
U                    [146x2]
L                    [168x2]
A                    [5x2]

Factorization        Type       Factors              Comments
------------------------------------------------------------------------
ula (id = 1)         cpd        U,L,A
reg (id = 2)         regL2      U,L,A

In the Matlab Command Window, the factors are links. These links can be used to investigate the model further. For example, when clicking on the factor U, the steps in its construction are revealed:

gpscode2¶

Expansion of factor U (id = 1) (view):
  (1,1)  u               [146x2]    struct_nonneg        Factor         [146x2]

We can now call the sdf_nls routine to solve the problem using extra optimization options. The one that most affects the results here is the RelWeights option, which gives a weight to each factorization, in casu the tensor and the regularization. The lambdareg parameter has experimentally been chosen equal to 0.01:

options = struct;
options.Display    = 10;
options.MaxIter    = 500;
options.TolFun     = 1e-8;
options.TolX       = 1e-4;
options.CGMaxIter  = 150;
options.RelWeights = [1 lambdareg];

% call the solver
[sol,out] = sdf_nls(model,options);

To measure the quality, we use performance curves and the area-under-curve (AUC) value. Here we compare the predicted values pred with the removed values. First the predicted values are computed by reconstructing the tensor using the computed factor matrices. The result can be seen in Figure 1.

pred = cpdgen({sol.factors.U,sol.factors.L,sol.factors.A});
lab  = UserLocAct(sel);
pred = pred(sel);

[X,Y,~,AUC] = perfcurve(lab(:), pred(:),1);
plot(X,Y);

Figure 1: Performance curves for the regularized CPD model and the model using data fusion. The respective AUC values are 0.865 and 0.948.

With data fusion¶

Now, we enrich the model by using the other datasets as well. We will need an extra variable f to model the features in the LocFea tensor.

model.variables.f = rand(size(LocFea, 2), R);

Also, because the data in the tensor and the matrices may be scaled differently, we introduce extra variables duu, daa and dul to capture the scaling differences in the UserUser, the ActAct and the UserLoc matrices. We do not need an extra variable for the LocFea matrix as the scaling differences are captured in F.

model.variables.duu = rand(1,R);
model.variables.daa = rand(1,R);
model.variables.dul = rand(1,R);

Like before, we create the corresponding factor matrices. We use a more dynamic approach to apply constraints here. If constr = {}, cat(2, 'u', constr) is equivalent to {'u'}; if constr = {@struct_nonneg}, cat(2, 'u', constr) becomes {'u', @struct_nonneg} like we used above. This more dynamic notation becomes useful when experimenting with different types of constraints.

constr = {};
model.factors.u   = cat(2,'u',constr);
model.factors.l   = cat(2,'l',constr);
model.factors.a   = cat(2,'a',constr);
model.factors.f   = cat(2,'f',constr);
model.factors.duu = cat(2,'duu',constr);
model.factors.daa = cat(2,'daa',constr);
model.factors.dul = cat(2,'dul',constr);

Alternatively, the transform field can be used. Instead of defining the factors, we can write:

model.transform = {{'u', 'l', 'a', 'f', 'duu', 'daa', 'dul'}, ...
                   {'U', 'L', 'A', 'F', 'Duu', 'Daa', 'Dul'}, ...
                   @struct_nonneg};

We define the models used for the different matrices. Here we choose a CP model for all matrices and apply L2 regularization to all factor matrices. Note that the factor matrices Duu, Daa and Dul can be added without any problem, because a matrix is a tensor with the third dimension equal to 1.

model.factorizations.ula.data = UserLocAct_missing;
model.factorizations.ula.cpdi = {'U','L','A'};

Model.factorizations.uu.data = UserUser;
model.factorizations.uu.cpd  = {'U','U','Duu'};

model.factorizations.lf.data = LocFea;
model.factorizations.lf.cpd  = {'L','F'};

model.factorizations.aa.data = ActAct;
model.factorizations.aa.cpd  = {'A','A','Daa'};

model.factorizations.ul.data = UserLoc;
model.factorizations.ul.cpd  = {'U','L','Dul'};

model.factorizations.reg.regL2 = {'U','L','A','F','Duu','Daa','Dul'};

Note that we used cpdi instead of cpd for the ula factorization. The cpdi factorization type activates a specialized kernel for incomplete tensors, which usually improves the convergence behavior compared to cpd.

We can again use the sdf_check method to get an overview of the model:

sdf_check(model, 'print');

gpsbaladflasd¶

Factor               Size                 Comment
------------------------------------------------------------------------
U                    [146x2]
L                    [168x2]
A                    [5x2]
F                    [14x2]
Duu                  [1x2]
Daa                  [1x2]
Dul                  [1x2]

Factorization        Type       Factors              Comments
------------------------------------------------------------------------
ula (id = 1)         cpdi       U,L,A
uu (id = 2)          cpd        U,U,Duu
lf (id = 3)          cpd        L,F
aa (id = 4)          cpd        A,A,Daa
ul (id = 5)          cpd        U,L,Dul
reg (id = 6)         regL2      U,L,A,F,Duu,Daa,Dul

The sdf_nls method is used to compute the results for the model. The relative weights have been determined using grid search.

options.RelWeights = [1 0.005 0.001 0.005 0.001 1e-6];
[sol,out] = sdf_nls(model,options);

The ROC curve of the result can be seen in Figure 1. Using the extra information clearly improved the area-under-curve (AUC).

Predicting for new persons¶

In this second scenario 50 random persons are removed from the tensor to simulate a cold start scenario (see Figure 62, right). We are unable to predict values for the deleted users when we only use the UserLocAct tensor. Therefore, we use the information from the other datasets to predict the missing values.

The strategy is the same as above. First the user-location-activity information for 50 persons is removed and the interactions between the datasets are modeled. Next the model is solved using the sdf_nls model for different parameters. Finally, we compute the performance for the different models.

nbmissingpersons = 50;
sel = randperm(size(UserLocAct,1), nbmissing);
UserLocAct_missing = UserLocAct;
UserLocAct_missing(sel,:,:) = nan; % Remove persons
UserLocAct_missing = fmt(UserLocAct_missing); % format as incomplete tensor

The model is the same as in the previous section, with the following parameters, found using grid search:

constr = {@struct_nonneg};
options.RelWeights = [1 1e-4 0.1 1e-4 1e-4 0.08];

The performance can be computed as follows and the ROC curve for the result is shown in Figure 2. The area under curve (AUC) is 0.968.

pred = cpdgen({sol.factors.U,sol.factors.L,sol.factors.A});
lab = UserLocAct(sel,:,:);
pred = pred(sel,:,:);
[X,Y,T,AUC] = perfcurve(lab(:),pred(:),1);

plot(X,Y);

Figure 2: Performance curves for the SDF model where 50 users are removed randomly. The AUC is 0.968.

Further explorations hints¶

The goal of this demo was to introduce some of the features of the modeling language used in the Structured Data Fusion framework. Many additions to these models can be explored, such as:

changing the model for a dataset, e.g., change the model for the tensor to an LMLRA by defining an extra core tensor:

model.variables.s = randn(R,R,R);
model.factors.S = {'s'};
model.factorizations.ula.data  = UserLocAct_missing;
model.factorizations.ula.lmlra = {'U','L','A','S'};

allowing extra rank-1 terms for some matrices

model.variables.u2 = randn(size(UserLocAct,1), 1);
model.factors.U2 = {'u', 'u2'};
model.UserUser.data = {'U2', 'U2', 'Duu'};

using L0 or L1 regularization instead of L2;
investigating the influence of the individual datasets;
using proper validation of the models;
changing constraints;
changing the weights.