通过最近一段机器学习任务的上手，感觉到做机器学习更像是当医生，很需要知识储备和经验累积。对于小白来说面对一堆数据，甚至有时候数据都需要自己想办法获取，毫无头绪，根本无从下手。即使有很多写好的框架，但是胡乱地套进去，会遇到很多预想不到的问题。不知道问题源自何处，如何解决。

这就需要在学习理论知识的同时（对各种算法模型的了解，即使不懂背后的数学原理，也要牢记具体的应用场景），增加“临床”经验。

所谓临床经验，我感觉很大一部分可以叫做对症下药。除了选择合适的机器学习、统计算法，调参同样是一件极为重要的事情，这就可以理解为开药方，各种药的配比，计量决定是毒药还是解药。

所以机器学习问题，不是拼蛮劲就能解决的问题。理性加经验的思考会让问题的解决事半功倍。盲目地尝试不同的算法，调整不同的参数也许是这门科学（或者说叫工程更合适，本人老师如是说）的一大忌讳。不过反过来说这也像神农尝百草，有些新问题也需要不断的探索，摸着石头过河。

本篇文章可以算作是医生诊疗病例的日志，也可以理解为不断搭建登上巨人肩膀的阶梯，这样才能看得更远，更广。

**Common ways to increase the accuracy of your model**

1、Feature Scaling and/or Normalization：对于很多线性机器学习算法，如果数据的feature之间大小相差数量级，那normalization就十分有必要。往往数大的数据会占主导（will end up dominating the others in a classifier like Logistic Regression.）

2、Class Imbalance：各类数据之间的不平衡往往把机器学习模型导向歧途，模型只需要每次预测较大量数据的那一类就可以取得较高的accurarcy。解决办法可以是平衡数据，或则这给不同的class不同的权重。（Most classifiers in SkLearn including LogisticRegression have a class_weight parameter.)

3、Optimize other scores - You can optimize on other metrics also such as Log Loss and F1-Score. The F1-Score could be useful, in case of class imbalance.

4、Hyperparameter Tuning - Grid Search

5、Explore more classifiers：SVM往往比一般的线性模型可以学习更加复杂的数据边际，进而解决更复杂的问题。Decision Trees can learn rules from your data。

6、Error Analysis：For each of your models, go back and look at the cases where they are failing.对于binary classification或者multi-class classification，confusion matrix是判断模型分类对错的一个很好的方法。（You might end up finding that some of your models work well on one part of the parameter space while others work better on other parts. If this is the case, then Ensemble Techniques such as VotingClassifier（sklearn） techniques often give the best results.）

7、Feature engineering - look for more features

8、对missing values的处理：对于continuous variables可以把缺失值替换成mean, median, mode value(value appears most often)或直接删除整个row如果数据量大，且缺失值不多的话；对于categorical variables，可以选择出现最频繁的值替换，可以用另外的模型预测缺失值如（KNN），或者直接删除整行。

9、对于outliers的处理：可以直接删除，或者做一些transformations

10、使用ensemble models（如bagging, boosting），random forest就是一种。

**Machine Learning Project Checklist**

*Reference: Aurelien Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow*

There are eight main steps of Machine learning Project(you should feel free to adapt this checklist to your needs):

- 1.Frame the problem and look at the big picture.
- 2.Get the data
- 3.Explore the data to gain insights
- 4.Prepare the data to better expose the underlying data patterns to Machine Learning algorithms
- 5.Explore many different models and short-list the best ones
- 6.Fine-tune your models and combine them into a great solution
- 7.Present your solution
- 8.Launch, monitor, and maintain your system

**1.Frame the problem and look at the big picture**

- Define the objective in business terms.
- How will your solution be used?
- What are the current solutions/workarounds(if any?)
- How should you frame this problem (supervised/unsupervised, online/offline, etc.)?
- How should performance be measured?
- Is the performance measure aligned with the business objective?
- What would be the minimum performance needed to reach the business objective?
- What are comparable problems? Can you reuse experience or tools?
- Is human expertise available?
- How would you solve the problem manually?
- List the assumptions you(or others) have made so far.
- Verify assumptions if possible

**2.Get the data**

Note: automate as much as possible so you can easily get fresh data.

- List the data you need and how much you need
- Find and document where you can get that data
- Check how much space it will take
- Check legal obligations, and get authorization if necessary
- Get access authorizations
- Create a workspace(with enough storage space)
- Get the data
- Convert the data to a format you can easily manipulate(without changing the data itself)
- Ensure sensitive information is deleted or protected(e.g., anonymized)
- Check the size and type of data(time series, sample, geographical, etc)
- Sample a test set, put it aside, and never look at it(no data snooping!)

**3.Explore the Data**

Note: try to get insights from a field expert for these steps.

- Create a copy of the data for exploration (sampling it down to a manageable size if necessary)
- Create a Jupyter notebook to keep a record of your data exploration
- Study each attribute and its characteristics"
- Name
- Type (categorical, int/float, bounded/unbounded, text, structured, etc)
- % of missing values
- Noisiness and type of noise(stochastic, outliers, rounding errors, etc)
- Possibly useful for the task?
- Type of distribution (Gaussian, uniform, logarithmic, etc)

- For supervised learning tasks, identify the target attributes(s)
- Visualize the data
- Study the correlations between attributes.
- Study how you would solve the problem manually
- Identify the promising transformations you may want to apply
- Identify extra data that would be useful(go back to "Get the data" step)
- Document what you have learned

**4.Prepare the data**

Note:

- Work on copies of the data (keep the original dataset intact)
- Write functions for all data transformations you apply, for five reasons:
- So you can easily prepare the data the next time you get a fresh dataset
- So you can apply these transformations in future projects
- To clean and prepare the test set
- To clean and prepare new data instances once your solution is live
- To make it easy to treat your preparation choices as hyperparameters.

Steps:

- Data cleaning:
- Fix or remove outliers (optional)
- Fill in missing values (e.g. with zero, mean, median...) or drop their rows(or columns).

- Feature selection (optional)
- Drop the attributes that provide no useful information for the task

- Feature engineering, where appropriate:
- Discretize continuous feature
- Decompose features (e.g., categorical, date/time, etc.).
- Add promising transformations of features(e.g., log(x), sqrt(x), x^2, etc)
- Aggregate features into promising new features.

- Feature scaling: standardize or normalize features.

**5.Short-list promising models**

Notes:

- If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time(be aware that this penalizes complex models such as large neural nets or Random Forests).
- Once again, try to automate these steps as much as possible

Steps:

- Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.
- Measure and compare their performance
- For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N-folds

- Analyze the most significant variables for each algorithm
- Analyze the types of errors the models make.
- What data would a human have used to avoid these errors?

- Have a quick round of feature selection and engineering
- Have one or two more quick iterations of the five previous steps
- Short-list the top three to five most promising models, preferring models that make different types of errors.

**6.Fine-tune the system**

Notes:

- You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning
- As always automate what you can.

Steps:

- Fine-tune the hyperparameters using cross-validation
- Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g. should I replace missing values with zero or with median value? Or just drop the rows?)
- Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g. using Gaussian process priors)
- Don't tweak your model after measuring the generalization error: you would just start overfitting the test set.

- Try Ensemble methods. Combining your best models will often perform better than running them individually
- Once you are confident about your final model, measure its performance on the test set to estimate the generalization error.

**7.Present your solution**

- Document what you have done
- Create a nice presentation
- Make sure you highlight the big picture first

- Explain why your solution achieves the business objective
- Don't forget to present interesting points you noticed along the way
- Describe what worked and what did not.
- List your assumptions and your system's limitations

- ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g. "the median income is the number-one predictor of housing prices")

**8.Launch**

- Get your solution ready for production (plug into production data inputs, write unit tests, etc)
- Write monitoring code to check your system's live performance at regular intervals and trigger alerts when it drops.
- Beware of slow degradation too: models tend to "rot" as data evolves
- Measuring performance may require a human pipeline(e.g. via a crowdsourcing service)
- Also, monitor your inputs' quality (e.g., a malfunctioning sensor sending random values, or another team's output becoming stale). This is particularly important for online learning systems.

- Retrain your models on a regular basis on fresh data(automate as much as possible).

**Positive & Negative**

*Source: https://en.wikipedia.org/wiki/Receiver_operating_characteristic*

### Learning Rate of Gradient Descent

Different learning rates give different costs and thus different predictions results. - If the learning rate is too large (0.01), the cost may oscillate up and down. It may even diverge (though in this example, using 0.01 still eventually ends up at a good value for the cost). - A lower cost doesn't mean a better model. You have to check if there is possibly overfitting. It happens when the training accuracy is a lot higher than the test accuracy. - In deep learning, we usually recommend that you: - Choose the learning rate that better minimizes the cost function. - If your model overfits, use other techniques to reduce overfitting.

### Activiation Function

- tanh (hyperbolic tangent) function always works better than the sigmoid function as activation function in the neural network. Because the output of tanh function is between -1 and 1, which has a mean around 0, then it helps the layers after it. But one exception is that in the output layer, it is better to use sigmoid function or other functions that have output range 0-1. However, one of the downsides of both sigmoid and tanh function is that if the input of these functions either too big or too small, the gradient or the derivative will be very small which decreasing the speed of gradient descent. Rectified Linear Unit(ReLU)
`a = max(0, z)`

can solve this problem. So ReLU becomes the best choice to be an activation function in hidden layers.

### Bias and Variance

Generally speaking, traditional machine learning algorithms need bias and variance trade-off. But in big data era, using deep learning you can both use more data and build a bigger network. And training a bigger neural network almost not hurts so long as you are regularizing.

High bias usually goes with underfitting. You can take a look at your training data performance (or error) to see whether high bias exists. If it exists, you can solve it by using a bigger network, training longer, or more suitable algorithm. To deal with high variance, you can check both training and development data performance (or error) and then use more data or use regularization or other algorithms.

### Regularization

**Dropout**

Intuition: Can't rely on any one feature, so have to spread out weights, namely it can't put all the bets on one or more particular nodes. It helps to shrink the squared norm of weights (similar to L2 regularization, but it can be more adaptive to the scale of different inputs). The keep_probability of dropout on different layers can be different depending on the size or order of the layer. Usually, dropout is not used in input data (layer). The more layers you to which apply dropout, the more hyperparameters you'll have to tune, so sometimes, only apply dropout to some layers, and don't to the others.

During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.

no dropout in test data, because it will cause your result random. And because you divide the output of the dropout layer by the keep_probability, you don't need to worry about scaling again in test data. Apply dropout both during forward and backward propagation.

**Data Augmentation**
Sometimes, getting additional training data can be expensive, so you can use data augmentation to generate more data. For example, you can flip a picture horizontally, rotate a picture or, distort or zoom in /out a picture, etc., then put them in the training data.

**Early Stopping**

But one reason why early stopping not frequently used is it can influence training data performance and dev data performance of the model simultaneously which obeys the rule of orthogonalization.

### Initialization

A well-chosen initialization can:

- Speed up the convergence of gradient descent
- Increase the odds of gradient descent converging to a lower training (and generalization) error

So:

- Different initializations lead to different results
- Random initialization is used to break symmetry and make sure different hidden units can learn different things
- Don't initialize to values that are too large
- "He" initialization works well for networks with ReLU activations.

### Baseline

A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset. You can use these predictions to measure the baseline's performance (e.g., accuracy)-- this metric will then become what you compare any other machine learning algorithm against.

In more detail:

A machine learning algorithm tries to learn a function that models the relationship between the input (feature) data and the target variable (or label). When you test it, you will typically measure performance in one way or another. For example, your algorithm may be 75% accurate. But what does this mean? You can infer this meaning by comparing with a baseline's performance.

Typical baselines include those supported by scikit-learn's "dummy" estimators:

Classification baselines:

- “stratified”: generates predictions by respecting the training set’s class distribution.
- “most_frequent”: always predicts the most frequent label in the training set.
- “prior”: always predicts the class that maximizes the class prior.
- “uniform”: generates predictions uniformly at random.
- “constant”: always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class.

Regression baselines:

- “median”: always predicts the median of the training set
- “quantile”: always predicts a specified quantile of the training set, provided with the quantile parameter.
- “constant”: always predicts a constant value that is provided by the user.

In general, you will want your approach to outperform the baselines you have selected. In the example above, you would want your 75% accuracy to be higher than any baseline you have run on the same data.