Projects

	Nov 2019 - Dec 2019	Columbia University
	Movie Recommendation System
	Collected movie information and user rating data from IMDB, TMDB and GroupLens, and loaded them into Google BigQuery using SQL. Ran exploratory data analysis by R, gave insight into the data. Performed ssociation rule and Single Value Decomposition methods by PySpark, give out personalized recommended movies. Used HTML, Django and SVG to build a graphical user interface system, and deployed it on Google App Engine.
	Nov 2019 - Dec 2019	Columbia University
	Cooking Recipe Cuisine Type Prediction
	Collected cooking recipe data from Kaggle and performed exploratory data analysis by R, reveal the difference between each cuisine type. Used ‘Tokenizer’ to break list ingredients into individual words, then compared two feature extraction methods ‘Word2Vec’ and ‘TF-IDF’, found out ‘TF-IDF’ is better in capturing the hidden feature. Compared four machine learning methods SVM, logistic regression, decision tree and random forest by using `Pyspark`, and found out logistic regression gives the highest prediction accuracy.
	Oct 2019 - Oct 2019	Columbia University
	Twitter Steaming Data Analysis
	Used `PySpark` and TCP protocol to capture twitter streaming data for 10 minutes through API, and loaded the result into Google BigQuery. Used SQL, Django and SVG to build a interactive website gadget to visualize the exploratory analysis.
	Mar 2019 - Apr 2019	Columbia University
	Predictive Model For Down Syndrome Disease
	Used R to build a predictive model based on logistic regression to facilicate down syndrome diagnosis. Compared methods including and Pathwise Coordinate Descent with regularized logistic regression and smoothed bootstrap estimation, found out that Smoothed Bootstrap Estimation provides a more accurate classification result. Identified a subset of proteins that are significantly associated with Down syndrome using the best method.
	Dec 2018 - Dec 2018	Columbia University
	Model Selection For Better Cancer Diagnose
	Collected the cancer statistics dataset containing information from cancer.gov, census.gov and clinicaltrials.gov, performed different model selection strategies like forward selection, backward selection and LASSO with cross validation in R, selected a set of predictors that ‘best’ predicts cancer mortality.
	Nov 2018 - Dec 2018	Columbia University
	Data Science Job Marketing Analysis
	Launched exploration from a general trend finding through data from National Occupational Employment Status and Wage Estimates from the US Bureau of Labour Statistics. Found out most required skills and background majors for data-related jobs, and the difference in qualification between Fortune 500 and non-Fortune 500 companies. Used `shiny` and `flexdashboard` R package to create Dashboard and multiple visualization tools to represent our findings.
	Oct 2018 - Oct 2018	Columbia University
	Congestive Heart Failure Profile Analysis
	Launched exploratory data analysis to study the patient’s profile, found out basic pattern of his activity. Applied linear regression and ANOVA on the activity data by using R. Found out the patterns of physical activity over long periods of time and helped to improve dynamic update of patient status.
	Dec 2015 - May 2017	Shanghai Jiao Tong University
	Assessment of Zero-Inflated Model
	Proposed a biological network module-based zero-augmented generalized linear model for differential expression analysis. Conducted two simulation studies by using parallel computing on cluster to compare the efficacy between zero-inflated model, hurdle model and negative binomial model, found out zero-inflated Poisson model gave out best performance. Applied models to CPTAC colorectal cancer data set, and identified subtler protein pathways and modules that can not been detected by normal model.

Course Works

Algorithm	Leetcode	string [easy]	string [medium]
		linked list [easy]	linked list [medium]
		tree [easy]	tree [medium]
		sort and search [easy]	sort and search [medium]
		dynamic programming [easy]	dynamic programming [medium]
		design [easy]	design [medium]
		math [easy]	math [medium]
		other [easy]	other [medium]
Machine Learning	Kaggle	decision tree	random forest
		Pandas	XGBoost
		visualization	machine learning
Data Science	P8105 Data Science I	data wrangling	data manipulation
		tidying data	visualization
		group by	iteration
		linear model	shiny
Biostatistics	P8131 Biostatistical Methods II	HW1	HW2
		HW3	HW4
		HW5	HW6
		HW7	HW8
		HW9	HW10

Projects

Movie Recommendation System

Cooking Recipe Cuisine Type Prediction

Twitter Steaming Data Analysis

Predictive Model For Down Syndrome Disease

Model Selection For Better Cancer Diagnose

Data Science Job Marketing Analysis

Congestive Heart Failure Profile Analysis

Assessment of Zero-Inflated Model

Course Works

Algorithm

Leetcode

Machine Learning

Kaggle

Data Science

P8105 Data Science I

Biostatistics

P8131 Biostatistical Methods II