How to Win a Data Science Competition

Posted on 2022-05-10 Edited on 2022-08-07 In Machine Learning Views: Symbols count in article: 5.5k Reading time ≈ 5 mins.

How to Win a Data Science Competition: Learn from Top Kagglers

linear model：适合高维稀疏线性可分空间

decision tree: 难以捕捉线性相关性，分成boxes

k-NN：

NN：非线性边界

Tips

1. feature preprocessing

类别数据：
- one hot encoding
- 随机森林不用transform类别数据
numeric features：
scale
- tree model不用变形数据
- 其他可能要变
  - scale可能影响梯度下降有效性
  - minmaxscaling
  - standardscaler
  - 有时可以适当放缩feature，来看模型有没有变好，这样使得那个feature更重要
outlier：
- 可以选择上下界（按照百分比）
  - winsorization：对financial data
  - 可以看直方图
rank
- 线性模型、knn、nn对于这个处理方式有效
- scipy.stats.rankdata
对于non-tree based models：
- log transform
- 开根号：np.sqrt(x+2/3)
- 减少异常值，增大0附近差异
- 可以混合不同模型（基于不同的处理方法）
feature generation：
- 加减乘除
- 取小数部分
- 要理解数据
categorical features/ ordinal features
- label encoding
  - 对tree友好
  - 要转化为数字
    - 按字母排序：sklearn.preprocessing.LabelEncode
    - 按出现顺序：pandas.factorize
    - 转化为出现频率作为特征（常用于tree）
      1
      2
      3
      encoding = titanic.groupby('Embarked').size()
      encoding = encoding/len(titanic)
      titanic['enc'] = titanic.Embarked.map(encoding)
    - rank
      1
      from scipy.stats import rankdata
    - one hot encoding:常用于non-tree
      - pandas.get_dummies
      - sklearn.preprocessing.OneHotEncoder
      - 可能对tree不友好
      - 需要稀疏矩阵（word）
      - 组合各种类别，形成多种类：如等级加性别
Datetime
- 周期
- 时间点
- 时间跨度（距离过去，距离未来）
坐标coordination
- 单坐标
  - 区域中心
  - 特殊位置
- 统计值
  - 区域统计值
  - 距离
- 旋转坐标可能有奇效
missing values
- fillna
  - -999，-1
    - 注意在后面均一化时会有重大影响
    - feature generation前 fillna可能有问题
  - 中位数、均值
  - reconstruct value
    - 时间序列：estimation
    - 其他的很难估计
- 添加一个isNull的feature，神经网络和tree
- 可以把outlier as missing values
- xgboost可以处理nan
text
- bag of words
- sklearn.feature_extraction.text.CountVectorizer
- TF-iDF
  - term frequency
  - inverse document frequency
    1
    2
    3
    4
    5
    tf = 1/x.sum(axis=1)[:,None]
    x = x*tf
    idf = np.log(x.shape[0]/(x>0).sum(0))
    x = x*idf
    # sklearn.feature_extraction.text.TfidfVectorizer
    可以降低高频词比重
- N-grams
  - n个字符组合
  - sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer
- lowercase
- lemmatization: car = cars转换为基本形式
- stemming: 取词干
- stopwords：
  - NLTK: natural language toolkit library
  - sklearn.feature_extraction.text.CountVectorizer: max_df(按频率删除)
- embedding(word2vec)
  - 邻近的词向量相近：向量相加减, 几百个维度
  - Word2vec, Glove, FastText
  - Doc2vec
  - pretrained model
image
- CNN
- 预先训练好的模型微调fine-tuning再训练对小数据集有时候很有效
  - keras等库有
- image augmentation：
  - 旋转图像，增加样本
  - 加噪音

2. EDA

查看数据逻辑，检查是否有error，探究error原因
了解数据如何被采集，可以有效设置validation scheme
plt.hist(x)
plt.plot(x)
plt.scatter(len(x), x, c=y)
x.describe
look at pairs/groups
检查train test是否分布相同
根据eda结果生成新feature
pd.scatter_matrix(df)
plt.matshow()
sort_values
检查重复的特征或常数特征（drop_duplicate）
traintest[f] = traintest[f].factorize()
.select_dtypes(include=['object'])可以选择不同类型columns

3. Validation and overfitting

validation：
- holdout: sklearn.model_selection.ShuffleSplit
- k fold: sklearn.model_selection.Kfold
- leave one out: sklearn.model_selection.LeaveOneOut
Stratification: 分层，保留相同分布
Data splitting：
- random: row independent
- time-based splits
  - moving window validation: sklearn.model_selection.TimeSeriesSplit
- by id
- combined
  - https://medium.com/@soumyachess1496/cross-validation-in-time-series-566ae4981ce4
- 有时需要看test set和train的分布差异
data leak:
- time series

4. metrics evaluation

regression metrics：
- 可以改loss function：
- MSE：常数的最优估计是均值
- RMSE：root mean square error
- R-squared
- MAE: mean average error: not sensetive than mse to outlier
  - 常数的最优估计是median
    - 对outlier友好
  - XGBoost不能用，因为二阶导为0
  - LightGBM可以用
  - 类似huber loss：特别是当error比较大
- MSPE：
  - 常数的最优估计是weighted target mean
- MAPE:
  - 常数最优估计：weighted target median
  - outlier会有很高权重，很少用
- RMSLE：log space
classification metric：
- accuracy
- log loss
  - binary
  - multi loss
- AUC：area under curve，包括order
- ROC
- cohen‘s Kappa
- confusion matrix
- weighted error
- quadratic and linear weighted kappa：多用于医学
optimization
- 有的模型不能用一些loss function来优化：
  - XGBoost不能用MSPE
  - custom loss for XGBoost:
    1
    2
    3
    4
    5
    def logregobj(preds,dtrains):
    labels = dtrain.get_label()
    preds = 1.0/(1.0+np.exp(-preds))# 有时要处理pred
    grad = preds - labels
    hess = preds * (1.0-preds)
- early stopping: 防止overfitting
- 变换target：比如指数、log等
- calibrate prediction
  - plat scaling: fit logistic regression to predictions(stacking)
  - isotonic regression: fit isotonic regression to predictions
  - stacking: fit XGBoost or neural net to predictions
- AUC(ROC) optimization:
  - pointwise loss: \(\min \Sigma L_{point}(\hat y_i;y_i)\)
  - pairwise loss:
    - \(\min \Sigma L_{pair}(\hat y_i,\hat y_j:y_i,y_j)\)
    - \[ \operatorname{Loss}=-\frac{1}{N_{0} N_{2}} \sum_{j: y_{j}=1}^{N_{1}} \sum_{i: y_{i}=0}^{N_{0}} \log \left(\operatorname{prob}\left(\hat{y}_{j}-\hat{y}_{i}\right)\right) \]
  - xgboost, lightboom可行
- quadratic weighted Kappa:
  - optimize MSE
  - \[ \begin{aligned} \operatorname{Kappa}(y, \hat{y}) & \approx 1-\frac{\frac{1}{N} \sum_{i=1}^{N}\left(\hat{y}_{i}-y_{i}\right)^{2}}{\text { hard to deal with part }} \\ &=1-\frac{\operatorname{MSE}(y, \hat{y})}{\text { hard to deal with part }} \end{aligned} \]
- optimize thresholds
mean encoding: 有很多分类特征
StratifiedKFold
分析树可能有新发现

5. hyper parameter tuning

libraries:
- hyperopt
- scikit-optimize
- spearmint
- gpyopt
- robo
- smac3
GBDT
- 绿色提升fit，红色降低fit
- min_child_weight很重要，increase model变保守（0，5，15，300……）
- sklearn.randomforest/ExtraTrees
  - N_estimators(higher the better)
neural networks
可以将数据存储为HDF5或MPI格式，或cpickle

6. pipeline

EDA:
- histgram
- feature vs target
- univariate predictability metrics（IV，R，auc）
- bining numerical features and correlation matrices
Decide the CV strategy
feature engineering:
Modeling
- 看文献
Ensemble

7. feaure

bray-curtis metric
KNN features
matrix factorization: 利用矩阵分解、加减乘除来增加、减少feature
feature interactions:
- f1 and f2: 组合形成one hot矩阵：先各自one hot，再pairwise相乘
- 积分、微分
- 可以用random forest来选择feature importance
- 分析树的结构创造新特征：
  - xgboost：predict（pred_leaf=True）
  - sklearn: apply()
tSNE
- 常用于EDA
- great tool for visualization
- 但是要调参
- dstill.pub 学习网站
- library: tsne, sklearn(慢)
- 结果可能不容易理解

8. Ensemble

bagging：
- sklearn: baggingClassifier, BaggingRegressor
boosting:
- weight based boosting: 关注错的更厉害的
  - adaboost
- residual based boosting
  - learning rate
  - num of estimator
  - models：
    - dart：只用之前一部分模型的预测结果
    - fully gradient based
    - xgboost
    - lightgbm
    - catboost
    - sklearn GBM
stacking：
- 用所有模型的predict输入到新模型
- 注意time
- stacknet
  - 可以用Kfold形式训练各个模型
  - 可以把原始feature加入stacking
- 小心data leakage
- stacknet：可用从说明文档中学习到各个模型哪些参数重要
- xcessiv
- stacked ensembles from h2o
catboost
- 快，预制了很多函数
- overfiting detector