GPS Demo: predicting user involvement¶
Tags: sdf
, nonnegativity
, missing elements
, imposing structure
Using real life data, we try to predict if a user participates in an activity at a certain location using Tensorlab’s structured data fusion framework. The GPS dataset consists of a tensor and some additional matrices [1]:
UserLocAct
: a third order tensor containing how many times a user has participated in an activity at a certain location;UserUser
: similarity between users;LocFea
: each location is described by features. Each feature counts certain point of interest;ActAct
: correlation between activities;UserLoc
: how many times a user has visited a location.
We try to approximate these datasets using a low-rank model with shared components as shown in Figure 61. Following [2], we formulate the problem mathematically as
The matrices \(\mat{U}\), \(\mat{A}\), \(\mat{L}\) and \(\mat{F}\) are the variables which are used to predict the user involvement. The extra variables \(\vec{d}_{\text{uu}}\), \(\vec{d}_{\text{ul}}\), and \(\vec{d}_{\text{aa}}\) are needed to capture scaling differences between datasets. The matrices \(\mat{D}_{\text{uu}}\), \(\mat{D}_{\text{ul}}\), and \(\mat{D}_{\text{aa}}\) are diagonal matrices with the respective vectors on their diagonal. The parameters \(\omega_i\) are the weights for each term.
The demo_gps.m
file contains a Matlab implementation of this demo and can be
downloaded here.
Obtaining the data¶
The GPS dataset can be obtained from Microsoft Research. The demo file tries to download the files automatically.
Preprocessing¶
Here, we will only predict whether or not a user participates in an activity at a certain location. Therefore, the number of participations is ignored:
UserLocAct(UserLocAct > 0) = 1;
Users without any user-location-activity data are ignored as we are unable to
evaluate how good the algorithms predicts whether a user participates in an
activity at a certain location. This way 18 users are removed from
UserLocAct
, UserLoc
and UserUser
.
u = any(any(UserLocAct,2),3);
UserLocAct = UserLocAct(u,:,:);
UserLoc = UserLoc(u,:);
UserUser = UserUser(u,u);
Finally, we normalize the location-feature data by normalizing the columns such that they sum to one:
LocFea = bsxfun(@rdivide,LocFea,sum(LocFea));
LocFea(~isfinite(LocFea)) = 0;
Predicting random events¶
First, we try to predict if a user will perform an activity at a certain
location by randomly removing entries of UserLocAct
(see Figure 62, left) . This can be achieved easily by setting the missing
entries to nan
. The fmt
function reformats the tensor as an incomplete
tensor. (This last step is automatically performed by all algorithms if needed.)
missing = 0.80;
sel = randperm(numel(UserLocAct), round(missing*numel(UserLocAct)));
UserLocAct_missing = UserLocAct;
UserLocAct_missing(sel) = nan;
UserLocAct_missing = fmt(UserLocAct_missing);
Without data fusion¶
As a baseline test, only the UserLocAct_missing
tensor will be used. We
model the tensor by a rank \(R=2\) CPD and add regularization to prevent
overfitting. Mathematically, we solve:
In Tensorlab we use the sdf_nls
method to solve this model. The cpd_nls
method can be used as well if regularization is not required. As usual, the SDF
model is defined in three steps: variable definition, factor matrix definition
and model choice for each dataset.
First three variables are defined: u
for the users, l
for the location
and a
for the activities. For each variable, we define a random initial
guess. A better guess can be used if available from prior knowledge, or
from previously computed models.
R = 2;
model.variables.u = rand(nbUsers,R);
model.variables.l = rand(nbLocations,R);
model.variables.a = rand(nbActivities,R);
Next, we define the factor matrices. To illustrate the use of constraints, we impose nonnegativity constraints on all factor matrices.
model.factors.U = {'u', @struct_nonneg};
model.factors.L = {'l', @struct_nonneg};
model.factors.A = {'v', @struct_nonneg};
Finally, the tensor is modeled. A CPD will be used here. We also add L2
regularization by defining a second factorization. In the case of
regularization, the data
part can be omitted if it is zero. Note that
regL2
is implemented to add a term
\(\norm{\cdot}^2_{\scriptsize{\text{F}}}\) for each factor matrix
separately.
model.factorizations.ula.data = UserLocAct_missing;
model.factorizations.ula.cpd = {'U','L','A'};
model.factorizations.reg.regL2 = {'U','L','A'};
We can investigate the resulting model using the language parser sdf_check
:
sdf_check(model, 'print');
If no errors have been made, the following overview is printed:
Factor Size Comment
------------------------------------------------------------------------
U [146x2]
L [168x2]
A [5x2]
Factorization Type Factors Comments
------------------------------------------------------------------------
ula (id = 1) cpd U,L,A
reg (id = 2) regL2 U,L,A
In the Matlab Command Window, the factors are links. These links can be used to
investigate the model further. For example, when clicking on the factor U
,
the steps in its construction are revealed:
Expansion of factor U (id = 1) (view):
(1,1) u [146x2] struct_nonneg Factor [146x2]
We can now call the sdf_nls
routine to solve the problem using extra
optimization options. The one that most affects the results here is the
RelWeights
option, which gives a weight to each factorization, in casu the
tensor and the regularization. The lambdareg
parameter has experimentally
been chosen equal to 0.01
:
options = struct;
options.Display = 10;
options.MaxIter = 500;
options.TolFun = 1e-8;
options.TolX = 1e-4;
options.CGMaxIter = 150;
options.RelWeights = [1 lambdareg];
% call the solver
[sol,out] = sdf_nls(model,options);
To measure the quality, we use performance curves and the area-under-curve (AUC)
value. Here we compare the predicted values pred
with the removed values.
First the predicted values are computed by reconstructing the tensor using the
computed factor matrices. The result can be seen in Figure 1.
pred = cpdgen({sol.factors.U,sol.factors.L,sol.factors.A});
lab = UserLocAct(sel);
pred = pred(sel);
[X,Y,~,AUC] = perfcurve(lab(:), pred(:),1);
plot(X,Y);
With data fusion¶
Now, we enrich the model by using the other datasets as well. We will need an
extra variable f
to model the features in the LocFea
tensor.
model.variables.f = rand(size(LocFea, 2), R);
Also, because the data in the tensor and the matrices may be scaled differently,
we introduce extra variables duu
, daa
and dul
to capture the scaling
differences in the UserUser
, the ActAct
and the UserLoc
matrices. We
do not need an extra variable for the LocFea
matrix as the scaling
differences are captured in F
.
model.variables.duu = rand(1,R);
model.variables.daa = rand(1,R);
model.variables.dul = rand(1,R);
Like before, we create the corresponding factor matrices. We use a more dynamic
approach to apply constraints here. If constr = {}
, cat(2, 'u', constr)
is equivalent to {'u'}
; if constr = {@struct_nonneg}
,
cat(2, 'u', constr)
becomes {'u', @struct_nonneg}
like we used above. This more
dynamic notation becomes useful when experimenting with different types of
constraints.
constr = {};
model.factors.u = cat(2,'u',constr);
model.factors.l = cat(2,'l',constr);
model.factors.a = cat(2,'a',constr);
model.factors.f = cat(2,'f',constr);
model.factors.duu = cat(2,'duu',constr);
model.factors.daa = cat(2,'daa',constr);
model.factors.dul = cat(2,'dul',constr);
Alternatively, the transform
field can be used. Instead of defining the
factors, we can write:
model.transform = {{'u', 'l', 'a', 'f', 'duu', 'daa', 'dul'}, ...
{'U', 'L', 'A', 'F', 'Duu', 'Daa', 'Dul'}, ...
@struct_nonneg};
We define the models used for the different matrices. Here we choose a CP model
for all matrices and apply L2 regularization to all factor matrices. Note that
the factor matrices Duu
, Daa
and Dul
can be added without any
problem, because a matrix is a tensor with the third dimension equal to 1.
model.factorizations.ula.data = UserLocAct_missing;
model.factorizations.ula.cpdi = {'U','L','A'};
Model.factorizations.uu.data = UserUser;
model.factorizations.uu.cpd = {'U','U','Duu'};
model.factorizations.lf.data = LocFea;
model.factorizations.lf.cpd = {'L','F'};
model.factorizations.aa.data = ActAct;
model.factorizations.aa.cpd = {'A','A','Daa'};
model.factorizations.ul.data = UserLoc;
model.factorizations.ul.cpd = {'U','L','Dul'};
model.factorizations.reg.regL2 = {'U','L','A','F','Duu','Daa','Dul'};
Note that we used cpdi
instead of cpd
for the ula
factorization.
The cpdi
factorization type activates a specialized kernel for incomplete
tensors, which usually improves the convergence behavior compared to cpd
.
We can again use the sdf_check
method to get an overview of the model:
sdf_check(model, 'print');
Factor Size Comment
------------------------------------------------------------------------
U [146x2]
L [168x2]
A [5x2]
F [14x2]
Duu [1x2]
Daa [1x2]
Dul [1x2]
Factorization Type Factors Comments
------------------------------------------------------------------------
ula (id = 1) cpdi U,L,A
uu (id = 2) cpd U,U,Duu
lf (id = 3) cpd L,F
aa (id = 4) cpd A,A,Daa
ul (id = 5) cpd U,L,Dul
reg (id = 6) regL2 U,L,A,F,Duu,Daa,Dul
The sdf_nls
method is used to compute the results for the model. The
relative weights have been determined using grid search.
options.RelWeights = [1 0.005 0.001 0.005 0.001 1e-6];
[sol,out] = sdf_nls(model,options);
The ROC curve of the result can be seen in Figure 1. Using the extra information clearly improved the area-under-curve (AUC).
Predicting for new persons¶
In this second scenario 50 random persons are removed from the tensor to
simulate a cold start scenario (see Figure 62, right). We are
unable to predict values for the deleted users when we only use the UserLocAct
tensor. Therefore, we use the information from the other datasets to predict the
missing values.
The strategy is the same as above. First the user-location-activity information
for 50 persons is removed and the interactions between the datasets are modeled.
Next the model is solved using the sdf_nls
model for different parameters.
Finally, we compute the performance for the different models.
nbmissingpersons = 50;
sel = randperm(size(UserLocAct,1), nbmissing);
UserLocAct_missing = UserLocAct;
UserLocAct_missing(sel,:,:) = nan; % Remove persons
UserLocAct_missing = fmt(UserLocAct_missing); % format as incomplete tensor
The model is the same as in the previous section, with the following parameters, found using grid search:
constr = {@struct_nonneg};
options.RelWeights = [1 1e-4 0.1 1e-4 1e-4 0.08];
The performance can be computed as follows and the ROC curve for the result is shown in Figure 2. The area under curve (AUC) is 0.968.
pred = cpdgen({sol.factors.U,sol.factors.L,sol.factors.A});
lab = UserLocAct(sel,:,:);
pred = pred(sel,:,:);
[X,Y,T,AUC] = perfcurve(lab(:),pred(:),1);
plot(X,Y);
Further explorations hints¶
The goal of this demo was to introduce some of the features of the modeling language used in the Structured Data Fusion framework. Many additions to these models can be explored, such as:
changing the model for a dataset, e.g., change the model for the tensor to an LMLRA by defining an extra core tensor:
model.variables.s = randn(R,R,R); model.factors.S = {'s'}; model.factorizations.ula.data = UserLocAct_missing; model.factorizations.ula.lmlra = {'U','L','A','S'};
allowing extra rank-1 terms for some matrices
model.variables.u2 = randn(size(UserLocAct,1), 1); model.factors.U2 = {'u', 'u2'}; model.UserUser.data = {'U2', 'U2', 'Duu'};
using L0 or L1 regularization instead of L2;
investigating the influence of the individual datasets;
using proper validation of the models;
changing constraints;
changing the weights.