Why XGBoost ?
Xgboost is short for eXtreme Gradient Boosting package.
BTW what is boosting?
Two common terms used in ML is Bagging & Boosting
Bagging: It is an approach where you take random samples of data, build learning algorithms and take simple means to find bagging probabilities.
Boosting: Boosting is similar, however the selection of sample is made more intelligently. We subsequently give more and more weight to hard to classify observations.
Now coming back to XGBoost, what is it so important ?
In broad terms, it’s the efficiency, accuracy and feasibility of this algorithm.
It has both linear model solver and tree learning algorithms. So, what makes it fast is its capacity to do parallel computation on a single machine.
It also has additional features for doing cross validation and finding important variables.
Features – XGBoost
- Speed: it can automatically do parallel computation on Windows and Linux, with OpenMP. It is generally over 10 times faster than the classical gbm.
- Input Type: it takes several types of input data:
- Dense Matrix: R’s dense matrix, i.e. matrix ;
- Sparse Matrix: R’s sparse matrix, i.e. Matrix::dgCMatrix ;
- Data File: local data files ;
- xgb.DMatrix: its own class (recommended).
- Sparsity: it accepts sparse input for both tree booster and linear booster, and is optimized for sparse input ;
- Customization: it supports customized objective functions and evaluation functions.
Numeric VS categorical variables
Xgboost manages only numeric vectors.
What to do when you have categorical data?
A simple method to convert categorical variable into numeric vector is One Hot Encoding.
Tree Boosting in a Nutshell
We first briefly review the learning objective in tree boosting. For a given data set with n examples and m features a tree ensemble model (shown in Fig. above ) uses K additive functions to predict the output.
It has also been widely adopted by industry users, including Google, Alibaba and Tencent, and various startup companies. According to a popular article in Forbes, xgboost can scale with hundreds of workers (with each worker utilizing multiple processors) smoothly and solve machine learning problems involving Terabytes of real world data.