Support Vector Machines are popular models in machine learning applications. Although they follow simple principles, they have already proven to be very powerful in various tasks and datasets. In this article I want to demonstrate how to implement an SVM capable of handling data that arrives in real time without having to store them in memory.

## Main ways to train a machine learning model

There are three popular ways to train a model: batch learning, mini-batch learning, and stochastic learning.

** Batch Learning **: In the first mode, we store all the training data in an array and feed it to the algorithm, reducing the loss function based on all the examples at once. This is not always possible due to the size of the dataset. In these cases we have to resort to the two other ways.

** Mini-Batch Learning **: In this case, we select a N number of examples and divide the training set into blocks of this size. So we train the model in one block at a time.

** Stochastic learning **: this is a variation of the mini-batch, but with N = 1. We use only one example at a time to train the model. This is the preferred mode in big data solutions.

And we have two main divisions:

** Offline **: In this case we have all the data stored in a database, but it does not fit in memory, so we need to train on one example (or batch) at a time.

** Online **: In this case we receive the examples in real time, we train and we discard, without the need to store them.

In addition to not having to load all the data in memory to train the model, in some cases, using algorithms that train on one example at a time may be faster than the format that uses all the data at the same time.

## Dataset

This article will use the Skin Segmentation Data Set that can be downloaded from the UCI database. This data refers to a classification task in which RGB values of random points are extracted from the face picture of a person, and the task is to determine, according to these values, whether that point corresponds to the skin or another part of the image.

According to the researchers, images of people from various ages, races and genres were collected. A practical application of this task would be to identify images with inappropriate content for minors on the internet.

There are about 245,000 examples, and it’s a binary classification problem.

## Training Algorithm

The algorithm used for training will be PEGASOS. Academic work with the details can be found here .

In its simplest version, we will minimize the primal function of the SVM using one example at a time. As this method applies to the subgradient, it does not guarantee error reduction in all iterations. Still, there are convergence guarantees.

The only parameter to choose is the constant that measures the degree of regularization. Usually it is chosen using some validation method. As the purpose of the article is to demonstrate an implementation of SVM that can be easily adapted to data that does not fit in memory, or are received in real time, I will set it to 0.1.

In an actual application, it may be interesting to adjust this parameter using a smaller sample of the data, offline, with cross-validation.

## Evaluation of the result

In a real online environment, one way to evaluate the performance of the model would be to calculate the loss before predicting a new example and averaging in a time window. This method is called progressive validation.

In our case, we are simulating an online environment, and to make the model more effective, I will use a separate test set with 20% of the data.

In the repository of this article you will find the files train.csv and test.csv, already prepared. Each contains approximately 25% positive examples, so in addition to loss, I want to evaluate the number of correct answers and errors in each class (real positives and true negatives).

The training set has 196,045 examples, and the test set, 49,012. We will use each example of the training set only once, and then evaluate the performance of the model in the test set.

Since we are not going to use this test set to adjust hyperparameters, it is a legitimate estimate of the model’s performance in new examples. If you are going to use it to validate parameters, you will need another test set.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
def data(self,test=False): if test: with open('test.csv','r') as f: samples = f.readlines() for t,row in enumerate(samples): row = row.replace('\n','') row = row.split(',') target = -1. if row[3] == '1': target = 1. del row[3] x = [1.] + [float(c) for c in row] # bias + inputs yield t, x,target else: with open('train.csv','r') as f: samples = f.readlines() shuffle(samples) for t,row in enumerate(samples): row = row.replace('\n','') row = row.split(',') target = -1. if row[3] == '1': target = 1. del row[3] x = [1.] + [float(c) for c in row] # bias + inputs yield t, x,target |

Then we go to the implementation of the algorithm to train the SVM. We created some support functions, one to calculate the hinge loss, which is the loss we are minimizing, and another to determine the SVM output signal. The initial weights of the SVM can be 0, since we do not have problems with symmetry, as in neural networks.

1 2 3 4 5 6 7 8 9 10 |
def __init__(self,lmbd,D): self.lmbd = lmbd self.D = D + 1 self.w = [0.] * self.D def sign(self, x): return -1. if x <= 0 else 1. def hinge_loss(self,target,y): return max(0, 1 - target*y) |

Then we create a function to generate forecasts. It will simply take the dot product between the vector of weights and the vector of attributes.

1 2 3 4 5 6 7 |
def predict(self,x): wTx = 0. for i in xrange(len(x)): wTx += self.w[i]*x[i] return wTx |

Now we go to the function that will train the model. We will update the weights with the learning rate (alpha) and the gradient. This function trains one example at a time. The learning rate adjusts, being smaller with each new example seen.

The gradient is different for the case where the current prediction is wrong (the example is on the wrong side of the hyperplane), and for the examples that are correct.

It would be possible to implement this function using fewer lines, but in this case I opted for clarity. To check this possibility, see the formula in the original publication of this algorithm.

1 2 3 4 5 6 7 8 9 10 11 |
def train(self,x,y,alpha): #se y estiver incorreto if y*self.predict(x) < 1: for i in xrange(len(x)): self.w[i] = (1. - alpha*self.lmbd)*self.w[i] + alpha*y*x[i] else: for i in xrange(len(x)): self.w[i] = (1. - alpha*self.lmbd)*self.w[i] |

Now we create the function to train and evaluate the model in the data. In the case of online learning, we would receive the data through a stream, so they would be seen only once.

For each example of the training set we invoke the function ‘train’, which adjusts the weights, according to the learning rate. Being a method that uses the subgradient, it does not guarantee the reduction of loss in all iterations.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
def fit(self): ... #last = 0 for t, x,target in self.data(test=False): #if target == last: # continue alpha = 1./(self.lmbd*(t+1.)) self.train(x,target,alpha) #last = target for t,x,target in self.data(test=True): pred = self.predict(x) loss += self.hinge_loss(target,pred) pred = self.sign(pred) ... |

We predict an example as positive if the distance from it to the hyperplane is positive, and the opposite to negative examples. It is extremely rare that an example is absolutely 0, but we will consider that if it does, it will be classified as negative.

In addition, due to the stochastic nature of the algorithm, we will train it several times, and average the performance by modifying the order of the examples. In order for the results to be reproduced, we determine a seed for the RNG.

1 2 3 4 5 6 7 |
for i in range(100): seed(i) svm = SVMPegasos(1e-1,3) l,tp,tn = svm.fit() |

## Training Set Influence

Because it is an algorithm that considers one example at a time, the order in which they are presented changes the behavior, causing it to search for different paths to reach the minimum. All numbers reported in this paragraph are evaluated in the “test.csv” samples.

Training in our original training set, without shuffling the data, the performance is as follows:

1 2 3 4 |
Loss 0.857527620114 True Negatives 95.43 % True Positives 6.66 % Accuracy 77.05 % |

Now, to have a real estimate of performance in practice, let’s test it 100 times by modifying the order of arrival of the examples, the average performance in this case is:

1 2 3 |
Mean Loss 0.589555758345 True Positives 63.86 % True Negatives 92.20 % |

A detail of this dataset is the fact that the proportion of classes is unbalanced (75/25). A simple remedy to this, if we have a lot of data is to reduce the proportion of the larger class to achieve equilibrium (50/50). In addition, some experts suggest that ensuring that the next example seen by the algorithm is different from the previous one may improve performance. Let’s try these two options by adding the following code:

1 2 3 4 5 6 7 8 |
<strong>last = 0</strong> for t, x,target in self.data(test=False): <strong>if target == last: continue</strong> alpha = 1./(self.lmbd*(t+1.)) self.train(x,target,alpha) <strong>last = target</strong> |

It will cause the algorithm to always receive an example different from the previous one, and the proportion of classes to be approximately equal. Let’s not touch the test set, it continues with the original ratio (75/25).

In this case, using only the order of the original training set, the performance becomes:

1 2 3 4 |
Loss 70.5074554527 True Negatives 91.55% True Positives 95.87% Accuracy 92.45% |

And averaging 100 shuffled datasets:

1 2 3 |
Mean Loss 6.59906602428 True Positives 94.16% True Negatives 90.48% |

We see a significant improvement in the number of true positive hits, despite a large increase in the loss. This increase is due to the fact that the loss is calculated based on all the examples. As we are “pulling” the hyperplane that separates the classes in order to accept more errors in the negative class, in exchange for a better performance in the positive, we end up increasing the distance of some incorrectly classified points.

An alternative would be to keep all the data and assign a greater, proportional, loss to errors in lower class examples.

In cases that have disproportionate classes, such as this, it is interesting to use metrics that are not sensitive to the number of examples. So the choice to report true positives / negatives.

## Results and Characteristics

The SVM primal is a convex function, so we have the possibility to reach the global minimum point. However, the fact that we receive examples in a random way, training only once in each one, with a learning rate decreasing with time, can prevent us from reaching this point in a practical application. Despite this, there are theoretical guarantees of convergence for the global minimum.

In Machine Learning we seek to minimize the generalization error, that is, outside the examples used for training. In most cases, the best generalization performance occurs before we reach the minimum in the training set.