In this machine learning and pattern recognition series, we already talk about regression problem that the output prediction is in continuous value. In machine learning, predicting output in discrete value given input is called classification. For two possible outputs, we usually call it as binary classification. For example, predicting that certain bank transaction is fraud or not, predicting that the cancer is benign or malignant, predicting that tomorrow will be raining or not, and so on. Whereas, for more than two possible outputs, we call it multi-class classification. For example, classification in hand gesture recognition whether the hand is moving right, left, bottom or up, classifying digit number 0 to 9, and so on.

Even thought classification is similar with regression, and the difference is only that classification output is discrete, whereas regression output is continuous, we can’t use exactly same method of regression for classification. The reason are : (1) it will perform bad when we classify given input to **many classes**, and (2) it lacks robustness to **outliers**. To use regression approach for classification, we need so-called activation function. Such method is called logistic regression, and we will talk later here. It is called logistic “regression” because we use similar way with what we did in regression here, but instead of taking output as prediction output, we feed the output into logistic function. The logistic function that is often used is sigmoid function. Furthermore, even its name uses “regression”, it is for classification problem, not regression problem.

Some approaches to do classifications are : maximizing likelihood (i.e: logistic regression, naive bayes), minimizing cross-entropy (i.e: neural network), maximizing margin (i.e SVM), finding the nearest neighbor (i.e: kNN), build decision tree (i.e: decision tree and random process), and so on. For general classifiers like logistic regression, SVM and neural networks, either for binary or multi-class classification, generally, it will have output nodes with same number of the classes number. And each nodes, the output value ranges from 0 to 1 (just like probability). The class prediction is the class node whose value is the highest.

## Evaluating how good our classifier: using confusion matrix

To measure how good our classifier is, we can use confusion matrix. From confusion matrix, we can calculate some metric evaluation, such as accuracy, precision, recall, false alarm, and so on. See picture below.

From picture below, we have four conditions in our prediction, TN (True Negative), FP (False Positive), FN (False Negative) and TP (True Positive). TN is when we predict “no” for actual input “no”, FP is when we predict “yes” for input data “no”, and so on. You can just look at the picture above. Using those for conditions, we have some evaluation metrics as follows.

- Accuracy : (TN+TP)/total data. It picture above, it will be (50+100)/165.
- Miss classification rate/error rate : (FP+FN)/total data
- Recall/sensitivity: TP/actual yes
- Precision: TP/predicted yes
- False positive rate: FP/actual no
- Specificity: TN/actual no