Support vector machines are considered by some people to be the best classifier models out there and with minimum tweaking the results will have low error rates.

However implementing SVM from scratch is a hard task that I’d recommend leave to those who have spent years writing highly optimised code.

This 101 is not about SVM, so you should be at least familiar with the concept of kernels and what **C** and **gamma** means.

This post is about using LIBSVM efficiently.

Note: All instructions are for *nix based OS - I’ll make especial notes for Windows users.

## Background

LIBSVM is an integrated software for support vector classification (it supports multi-class classification), (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM) written by Chih-Chung Chang and Chih-Jen Lin.

The package includes the source code of the library in C++ and Java, and a simple program for scaling training data. A README file with detailed explanation is provided.

It comes with interfaces and extensions for JAVA, Matlab/Octave, Python, R, Weka, Ruby, C#, Weka, .NET etc.

For MS Windows users, there is a subdirectory in the zip file containing binary executable files. Precompiled Java class archive is also included.

## Getting LIBSVM

The current release (Version 3.17, April 2013) of LIBSVM can be obtained here

## Minimum Requirements

In order to get full advantage of LIBSVM you need the following:

- C++ compiler (mandatory). Mac Users need to download Xcode and install the Command line tool additionally. Instructions on how install CLT here.
- Python 2.7.x (optional)
- gnuplot (optional). Instructions for Mac Users here.

## Installing LIBSVM

- Download LIBSVM and unzip the contents to any convenient location on your computer.
- Move into the LIBSVM folder
- On Unix systems, type
**make**to build**svm-scale**,**svm-train**and**svm-predict**programs. Run them without arguments to show the usages of them.

If don’t get any errors after running **make** (few warnings is ok!) then you are almost ready to go.

## Adding LIBSVM to your Bash Profile

The last thing we need is to make the **svm-train** & friends available globally. Also we’ll be using some of the tools available in `libsvm-3.17/tools`

(hence the necessity of having Python installed).

There are several ways and flavours for doing this. I normally use **alias** - see mine below:

```
# Create alias for some LIBSVM functions
alias svm-scale='/path2urLIBSVM/libsvm-3.17/svm-scale'
alias svm-predict='/path2urLIBSVM/libsvm-3.17/svm-predict'
alias svm-train='/path2urLIBSVM/libsvm-3.17/svm-train'
alias checkdata='python /path2urLIBSVM/libsvm-3.17/tools/checkdata.py'
alias grid='python /path2urLIBSVM/libsvm-3.17/tools/grid.py'
alias subset='python /path2urLIBSVM/libsvm-3.17/tools/subset.py'
alias easy-svm='python /path2urLIBSVM/libsvm-3.17/tools/easy.py'
```

## Preprocessing Data

The format of training, cross-validation and testing data files is:

```
<label> <index1>:<value1> <index2>:<value2> ...
.
.
.
```

Label & features (indices) must be numeric (integer for class label). No header

As an example let’s use the data frame **iris** in R.

The first row of the data frame is:

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa |

- Species class is character, we need to convert it to integer as
- setosa = 1
- versicolor = 2
- virginica = 3

- Species class is also the last column so we need to move to the first position in the data frame.
- values need to be formatted to index:value
- Last, headers need to be removed

so when we export iris to a txt file the first row should look like this:

```
1 | 1:5.1 | 2:3.5 | 3:1.4 |4:0.2 | --- | --- | --- | --- | ---
```

I know that for some of you converting files to the above format may be daunting but if you use R then it is a pretty easy task. Just use the following snippet:

```
library(e1071)
if (require(SparseM)) {
data(iris)
x = as.matrix(iris[,1:4])
y = iris[,5]
xs = as.matrix.csr(x)
write.matrix.csr(xs, y = y, file = "iris.txt",fac=F)
}
```

Notice in the last line I’m using **fac=F**. Use it only when your class label (a.k.a. Y) is factor (character), otherwise omit it.

If you’re using

MATLAB/OCTAVEyou mustsparse()your matrix X before writing out txt files. See libsvmread/libsvmwrite in the matlab folder.

###checkdata.py

If you really want to ensure that your file is in the correct format then you can use the utility **checkdata.py** from the tools folder. If you set up your .bash_profile as mine then is as easy as:

```
$:~/tutorials/libsvm$checkdata iris.txt
No error.
```

## Size Matters

SVM is computational expensive and it can take long time to train when the number of samples is large (50k+). Also using cross validation add up an extra.

Luckily LIBSVM comes with **subset.py** which let you apply stratified sampling (classification) and random sampling for regression.

The syntax for subset.py is as follow:

*subset.py [options] dataset subset_size [output1] [output2]*

This script randomly selects a subset of the dataset.

options:

```
-s method : method of selection (default 0)
0 -- stratified selection (classification only)
1 -- random selection
output1 : the subset (optional)
output2 : rest of the data (optional)
```

For our **iris** data set is not required as it is only 150 rows, however let’s split it into training (100 rows) and cross-validation (50 rows).

Your syntax should look like this:

```
subset -s 0 iris.txt 100 iris.train iris.cv
```

and your new files should contains 100/50 rows respectively.

```
$wc -l iris.train iris.cv
100 iris.train
50 iris.cv
150 total
```

## Training

So far we’ve done these steps:

- Transfer data to a format that LIBSVM can read.
- Split original file into train & cross validation set (this step is optional).

Before training our SVM model we still require one more step: **feature scaling**

Feature scaling is a key step in SVM not only because can improve the convergence speed of the algorithm but also makes the contribution each feature approximately equals to the final score and not governed by one or more features with a broad range of values.

Note: Scaling should be defined only on the training set. This scaling then can be applied to the cross-validation and test sets.

LIBSVM comes to the rescue with **svm-scale** for this task.

The sintax of svm-scale is as follow:

*svm-scale [options] data_filename*

options:

```
-l lower : x scaling lower limit (default -1)
-u upper : x scaling upper limit (default +1)
-y y_lower y_upper : y scaling limits (default: no y scaling)
-s save_filename : save scaling parameters to save_filename
-r restore_filename : restore scaling parameters from restore_filename
```

Thus for scaling our training set your code should look like this:

```
svm-scale -s scaling_parameters iris.train > scaled_iris.train
```

Note: if you ever get some warnings about nonzero values use

`-l 0`

We have saved our scaling parameters in the file scaling_parameters and now we are ready to use them in the cross validation set.

```
svm-scale -r scaling_parameters iris.cv > scaled_iris.cv
```

Now we are ready to train our SVM model. The way we do this in LIBSVM is through **svm-train**

Although I mentioned earlier about knowing about **C** and **gamma** for RBF kernels; **svm-train** comes with other types of kernels like linear (no kernel), sigmoid and polynomial kernels.

We’ll focus on RBF kernels only.

The syntax of svm-train is:

*svm-train [options] training_set_file [model_file]*

options:

```
-s svm_type : set type of SVM (default 0)
0 -- C-SVC (multi-class classification)
1 -- nu-SVC (multi-class classification)
2 -- one-class SVM
3 -- epsilon-SVR (regression)
4 -- nu-SVR (regression)
-t kernel_type : set type of kernel function (default 2)
0 -- linear: u'*v
1 -- polynomial: (gamma*u'*v + coef0)^degree
2 -- radial basis function: exp(-gamma*|u-v|^2)
3 -- sigmoid: tanh(gamma*u'*v + coef0)
4 -- precomputed kernel (kernel values in training_set_file)
-d degree : set degree in kernel function (default 3)
-g gamma : set gamma in kernel function (default 1/num_features)
-r coef0 : set coef0 in kernel function (default 0)
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)
-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)
-m cachesize : set cache memory size in MB (default 100)
-e epsilon : set tolerance of termination criterion (default 0.001)
-h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1)
-b probability_estimates : whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
-wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1)
-v n: n-fold cross validation mode
-q : quiet mode (no outputs)
```

Let’s spend some time understanding **svm-train**

`-s`

option tells svm-train what type of SVM you want to implement (classification or regression). Default is multi-class classification`-s 0`

`-t`

option tells svm-train what kernel we want to use to train our model. Default value is RBF`-t 2`

- Rest of parameters will depend on the SVM/kernel chosen.
- For RBF kernels we need to focus on
`-g`

and`-c`

`-v`

sets number of folds for cross validation g.e. 5-folds ->`-v 5`

`-m`

set the cache memory - if your sample size is 10k+ i’d recommend to set this to 500 MB (based on personal experience).`-q`

use it when interested in see only the Accuracy displayed.

For implementing RBF multi-class classification SVM we only need to provide `-c`

,`-v`

and `-g`

parameters to **svm-train**.

So your code & results to train a SVM with cost = 5, gamma = 0.5 on a 5-folds cross validation should look like this:

```
$svm-train -c 0.5 -g 0.5 -v 5 -q scaled_iris.train
Cross Validation Accuracy = 94%
```

Not bad! what if I try a different set of parameters:

```
$svm-train -c 1 -g 0.5 -v 5 -q scaled_iris.train
Cross Validation Accuracy = 96%
```

Brilliant! let’s try again

```
$svm-train -c 2 -g 0.5 -v 5 -q scaled_iris.train
Cross Validation Accuracy = 95%
```

Ok, not as good as before but still a decent result. However you may not be that lucky with your own data set and you may spend hours tuning these parameters.

Luckily again, LIBSVM comes with **grid.py** which will run **svm-train** for a long range of `-c -g`

parameters.

Syntax for grid.py is as follow:

*grid.py [grid_options] [svm_options] dataset*

grid_options :

```
-log2c {begin,end,step | "null"} : set the range of c (default -5,15,2)
begin,end,step -- c_range = 2^{begin,...,begin+k*step,...,end}
"null" -- do not grid with c
-log2g {begin,end,step | "null"} : set the range of g (default 3,-15,-2)
begin,end,step -- g_range = 2^{begin,...,begin+k*step,...,end}
"null" -- do not grid with g
-v n : n-fold cross validation (default 5)
-svmtrain pathname : set svm executable path and name
-gnuplot {pathname | "null"} :
pathname -- set gnuplot executable path and name
"null" -- do not plot
-out {pathname | "null"} : (default dataset.out)
pathname -- set output file path and name
"null" -- do not output file
-png pathname : set graphic output file path and name (default dataset.png)
-resume [pathname] : resume the grid task using an existing output file (default pathname is dataset.out)
This is experimental. Try this option only if some parameters have been checked for the SAME data.
svm_options : additional options for svm-train
```

My strategy is normally to run **grid.py** with default parameters and then look for small ranges of `-log2c`

and `-log2g`

once the grid task has found best parameters.

```
grid -q scaled_iris.train
```

Result is:

```
$grid -q scaled_iris.train
[local] 5 -7 96.0 (best c=32.0, g=0.0078125, rate=96.0)
[local] -1 -7 68.0 (best c=32.0, g=0.0078125, rate=96.0)
[local] 5 -1 93.0 (best c=32.0, g=0.0078125, rate=96.0)
[local] -1 -1 94.0 (best c=32.0, g=0.0078125, rate=96.0)
[local] 11 -7 98.0 (best c=2048.0, g=0.0078125, rate=98.0)
[local] 11 -1 93.0 (best c=2048.0, g=0.0078125, rate=98.0)
---- Output omitted -----
[local] 13 -5 96.0 (best c=2048.0, g=0.0078125, rate=98.0)
[local] 13 -15 96.0 (best c=2048.0, g=0.0078125, rate=98.0)
[local] 13 3 91.0 (best c=2048.0, g=0.0078125, rate=98.0)
[local] 13 -9 98.0 (best c=2048.0, g=0.0078125, rate=98.0)
[local] 13 -3 94.0 (best c=2048.0, g=0.0078125, rate=98.0)
2048.0 0.0078125 98.0
```

Best C=2048, g = 0.0078125 with Accuracy of 98% on a 5-fold cross validation. With this information we can now train our model on the full train set and produce a model file to be use when predicting the cross-validation set.

```
$svm-train -c 2048 -g 0.0078125 -q scaled_iris.train model.train
```

## Predicting

Once we have our model file, predicting cross-validation or test sets is really easy. For this we’ll use **svm-predict**.

Syntax is a follow:

*svm-predict [options] test_file model_file output_file*

options:

```
-b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0); for one-class SVM only 0 is supported
-q : quiet mode (no outputs)
```

Our prediction code should look:

```
$svm-predict scaled_iris.cv model.train iris.predicted
Accuracy = 100% (50/50) (classification)
```

Predictions are saved in the iris.predicted.

We’ve got 100% on accuracy this is normally definitely not always the case.

## Wrap-Up

In summary these are the steps in order to use LIBSVM efficiently:

- Convert data to LIBSVM format
- Conduct simple scaling on the training data. Map scaling to CV and test sets.
- Consider RBF kernel
- Use cross-validation to find the best parameter C and gamma
- Use the best C and gamma to train the whole training set
- Test

I hope you have found useful this tutorial.