## Online Learning

The online learning setting allows us to model problems where we have a continuous flood or a continuous stream of data coming in and we would like an algorithm to learn from that. Today, many of the largest websites use different versions of online learning algorithms to learn from the flood of users that keep on coming to, back to the website. Specifically, if you have a continuous stream of data generated by a continuous stream of users coming to your website, what you can do is sometimes use an online learning algorithm to learn user preferences from the stream of data and use that to optimize some of the decisions on your website.

Online learning is very similar to stochastic gradient descent algorithm, only instead of scanning through a fixed training set, we're getting one example from one user, learning from that example and discarding it and moving on.

One advantage of online learning is also that if you have changing pool of users or things you're trying to predict are slowly changing like your user taste is slowly changing, it can slowly adapt your learned hypothesis/model to whatever the latest sets of user behaviors are like. For example, there are some changes in economy, users are getting more price-sensitive, this algorithm can help adapt to user preferences and keep track of changing population of users.

### Shipping service example

In this example, we want a learning algorithm to help us to optimize what is the asking price that we want to offer to our users so that get can get more profits. What the algorithm learns is probability that a user want to buy it given a set of user properties as features. I.e. $p(y=1|x;\theta )$

The features $x$ capture properties of users. For example, the features can be the origin/destination and asking price(the price we offer based on the model learned in the last round and the new user's properties, i.e. input the new user's properties as features to the model in last round then return the asking price so that the new user has the largest chance to buy it. Then record whether the new user bought it or not and include the price in this around. Then we get a new example $(x,y)$). Next, we use the new user's example $(x,y)$ to update the model in the last round and use the updated model to estimate the asking price for the further new user. We must know that offering the asking price and learning after offering the price are two separate processes(one is inferring, the other is learning). Expressing learning process (e.g. logistic regression) in pseudocode is:

Repeat forever {

Get $(x,y)$ corresponding to user

Update $\theta$ using $(x,y)$:

${ \theta }_{ j }:={ \theta }_{ j }-\alpha ({ h }_{ \theta }(x)-y)\cdot { x }_{ j }$

Discard $(x,y)$

}

Our website keeps on staying up. $\theta$ are parameters of the model, $j$ is the number of features, $j=0,1,...,n$, $\alpha$ is the learning rate. If the website has enough continuous stream of data, we can discard each example after the learning process. Otherwise, we may need to save all the data.

### Product search example (learning to search)

We want to apply learning algorithm to give good search listings to a user (In this example, it is showing user the 10 phones that they're most likely to click on). The principles are nearly the same as the previous shipping example. What is different is the 10 results are actually 10 pairs of examples $(x,y)$ which may need 10 steps of gradient descent. The problem of learning this is called predicted CTR (abbr. click through rate).

Other examples are like choosing special offers to show a user, a customized selection of news articles, product recommendation, etc. I'll talk about recommender system in the next section.

## Recommender System

Recommender system is widely used in many companies (e.g. Amazon, Netflix, etc) to promote sales.

In machine learning, the features you choose will have a big effect on the performance of your learning algorithm. For some problems, there are algorithms that can try to automatically learn a good set of features for you rather than using hand-coded features. This is what a recommender system can do.

For example, you are a company which sells or rents out movies. You let your users rate movies using one to five stars (we use 0 to 5 for easier mathematical computation).

In the realistic setting, each of your users may have rated only a minuscule fraction of movies. So the recommender system problem is given this dataset $r(i,j)$ and $y(i,j)$ and try to predict the missing ratings (what the question marks should be). Then we can look at the movies that the user has not yet watched and recommend new movies to that user.

### Content-based recommendations

This algorithm is called content-based recommendations because we assume that we have available features $({x}_{1},{x}_{2},...)$ for the different movies. These features capture what is the content of these movies, like how romantic is this movie, how much action is in this movie, etc.

This algorithm based on the assumption of the features, but for many movies, we don't actually have such features, so we need an approach which is called collaborative filtering to automatically learn features for itself.

### Collaborative filtering

Assume that we already know the how much of each user like romantic movies and action movies (i.e. parameters $\theta$). Then we can use $\theta$ to calculate features of movies $x$.

So combining the problem we talked about earlier, should we calculate features first or parameters first? This is a chicken and egg problem. What we can do is randomly guess some value of parameters $\theta$, then learn features $x$ of different movies. Then use $x$ to better estimate parameters $\theta$ and just keep iterating and finally it causes your algorithm to converge to reasonable sets of both features and parameters. This is possible because only because each user rates multiple movies and hopefully each movie is rated by multiple users

To improve this algorithm, we can simultaneously learn parameters and features of different movies.

In the above objective cost function, if you hold features $x$ as constant, and just minimize respect to parameters $\theta$, you solve the first cost function. If you do the opposite, it becomes equivalent to the second cost function.

In this new way, we don't need to do the convention of adding feature ${x}_{0}$, so we have $x\in { \mathbb{R} }^{ n }$ and $\theta \in { \mathbb{R} }^{ n }$.

So finally, if the user $j$ hasn't rated movie $i$ yet, we can use ${ { (\theta }^{ (j) } })^{ T }({ x }^{ (i) })$ to calculate the rating of that movie.

Note: The first step serves as symmetry breaking (similar to the random initialization of a neural network's parameters) and ensures the algorithm learns features ${x}^{(1)},{x}^{(2)},..., {x}^{({n}_{m})}$ that are different from each other.

### Low rank matrix factorization

In this section, first to do vectorization of the collaborative filtering. Giving another name for this vectorized collaborative filtering is ** Low-Rank Matrix Factorization**.

We can use the learned features to find related movies. To speak generally, for example, a user has recently been looking at one product, are there other related products that you could recommend to this user?

It's actually pretty difficult to go into the learned features come up with a human understandable interpretation of what these features really are. But it doesn't impact the algorithm and the results.

However, in practice, there is one situation where there is a user that has not rated any movies. Seen from the following cost function, there is no value $r(i,j)$ for this user that is equal to 1. So only the regularization part has some impact on it, which will finally make ${\theta}^{(5)}$ a zero vector and make all the ratings of the user will also be zero. To solve this problem, we can use **Mean Normalization** as a pre-processing step.

Mean normalization computes average ratings of each movie, then subtracts off the mean ratings to make each movie have an average rating of 0.

We use the normalized data as training data to learn $\theta$ and $x$ and calculate ${ { (\theta }^{ (j) } })^{ T }({ x }^{ (i) })$ and add the mean back ${ { (\theta }^{ (j) } })^{ T }({ x }^{ (i) })+{\mu}_{i}$.

That is to say that if the user hasn't rated any movies, what we're going to do is to predict for each of movies of this user the average rating that the movie got. Similarly, for the movies that don't have any ratings, we can also average each column and normalize to have a mean of 0, but this situation is less important since maybe we don't need to recommend the movie without any ratings to users.