This post is a reflection on my Mechanism of Action Classification project. It describes how I made decisions on exploratory analysis, cleaning and modeling. My code is available on Github. If you want to discuss anything or find an error, please email me at Lvqingchuan@gmail.com :)

This project predicts 206 targets of the Mechanism of Action (MoA) response(s) of different samples, given 875 features such as gene expression data and cell viability data. Features g- signify gene expression data, and c- signify cell viability data. Control perturbations (cp_type = ctrl_vehicle) have no MoAs; cp_time and cp_dose indicate treatment duration (24…

In this post, we’ll walk through the basic concepts , properties, and practical values of Normal distribution. If you want to discuss anything or find an error, please email me at Lvqingchuan@gmail.com :)

Before getting started, I want you to take a close look at the above diagram. This depicts the density of 10,000 random samples drawn from standard Normal Distribution N(0, 1). What do you see here? This is a fairly symmetric graph with most data points centering around the mean 0. The further from the mean, the less many data points there are. …

In this post, we’ll walk through basic concepts behind Random Forest, discuss practical problems of implementation, such as highly correlated features, feature sparsity, imbalanced classes. Then, compare the concepts and performance of Random Forest to Boosting Trees and Decision Tree! If you want to discuss anything or find an error, feel free to email me at lvqingchuan@gmail.com :)

This post is part of efforts on improving my technical projects. Each post describes how I identified suspicious spots and updated them. You can find Part I: Regressors (Price) and Part II: Time Series (Sales) here. If you want to discuss anything, please feel free to leave a comment or message me on Linkedin.

This project extracts support phrases for English tweets’ sentimental labels. The training data includes 27,481 tweets with sentiments and selected texts (support phrases). Sentiments divide into 40% tweets being neutral, 31% being positive and 28% being negative.

- Investigate top common words, special characters, and proportional length…

This post is part of efforts on improving my technical projects. Each post describes how I identified suspicious spots and updated them. You can find Part I: Regressors and Part III: NLP here. If you want to discuss anything, please feel free to leave a comment or message me on Linkedin.

This project predicted daily sales for food categories at Walmart stores in California with the input data (06/19/2015–06/19/2016) covered item ids, item sales, item prices, departments, product categories, store ids and holiday/special events.

- Investigated the impact of holidays on daily sales
- Explored trends for each state, category, stores
- Trained…

In this post, I discuss how I edited feature engineering and modeling part. This post is the first part of “Redoing Projects” series. You can find Part II: Time Series (Sales) and Part III: Tweet Sentiment Extraction (NLP) here.

This project aims at predicting house prices in Ames, Iowa with 79 features (2006–2010). The available dataset had 1460 observations.

- Engineered features for tree algorithms and linear regression separately. Used Label Encoder for trees and One-Hot Encoder for linear regression with regularization
- Checked whether missing values are missing (completely) at random
- Built PCA pipelines within cross-validation and gradient descent frameworks

In this post, I’ll show you necessary assumptions for linear regression coefficient estimates to be **unbiased**, and discuss other “nice to have” properties. There are many versions of linear regression assumptions on the internet. Hopefully, this post will make it clear.

**“Must have” Assumption 1. conditional mean of residuals being zero**

E(ε | X) = 0 means the prediction errors of our regression is supposed not to exist (being zero) given the observed data. This is very straightforward, if you think of the definition of being unbiased: the mean of an estimator is the same as its true value. …

In this post, we’ll walk through basic techniques of data manipulation and simulation with Python.

**Data manipulation**

A. Group aggregation

Given a data frame, say we want to summarize customer orders by different genders, we can use a simple groupby and agg function:

`values_gender = csv_file\`

.groupby(['gender']) \

.agg(avg_order_values=('value','mean'),\

count_order=('value', 'size')) \

.reset_index()

In the code, we create a new data frame, values_gender, from the original data, csv_file. Groupby specifies which variable to summarize at, similar to “group by” in SQL. Agg let you specify which variable to be summarized at Gender level, and which summary statistics to use. In…

In this post, we’ll learn what is power, significant level, type I/II errors and how they relate to each other.

**Power**

Once we setup the null hypothesis and alternative hypothesis, we will collect data and compute test statistics. The power function (or simply “power) is the probability of rejecting the null hypothesis after seeing the data.

For example, your null hypothesis is population mean is no less than 22, and you alternative hypothesis is population mean < 22. In this case, you collect data of n samples, and compute the sample mean. You will reject the null hypothesis if your…

Just wanted to explain p-value in a very simple way.

Idea: p-value is the smallest probability at which you would reject the null hypothesis after seeing data. You reject the null hypothesis if and only if p-value is smaller than a pre-determined probability (usually 0.05).

Example: you want to test if the average height of 550 boys at your elementary school is not less than 5'. In this case, you have null and alternative hypothesis:

H_o: h = 5'; H_1: h>5'

Then, you collect data and get 6.5'’ as the average height of boys (pretty tall!). But does it mean…

Data Analysis, Machine Learning & Causal Inference