Large size problem 연구를 하다보면 하는 흔한 실수

Question:

I have a big data problem with a large dataset (take for example 50 million rows and 200 columns). The dataset consists of about 100 numerical columns and 100 categorical columns and a response column that represents a binary class problem. The cardinality of each of the categorical columns is less than 50.

I want to know a priori whether I should go for deep learning methods or ensemble tree based methods (for example gradient boosting, adaboost, or random forests). Are there some exploratory data analysis or some other techniques that can help me decide for one method over the other?

Answer:

Why restrict yourself to those two approaches? Because they're cool? I would always start with a simple linear classifier \ regressor. So in this case a Linear SVM or Logistic Regression, preferably with an algorithm implementation that can take advantage of sparsity due to the size of the data. It will take a long time to run a DL algorithm on that dataset, and I would only normally try deep learning on specialist problems where there's some hierarchical structure in the data, such as images or text. It's overkill for a lot of simpler learning problems, and takes a lot of time and expertise to learn and also DL algorithms are very slow to train. Additionally, just because you have 50M rows, doesn't mean you need to use the entire dataset to get good results. Depending on the data, you may get good results with a sample of a few 100,000 rows or a few million. I would start simple, with a small sample and a linear classifier, and get more complicated from there if the results are not satisfactory. At least that way you'll get a baseline. We've often found simple linear models to out perform more sophisticated models on most tasks, so you want to always start there.

출처: http://datascience.stackexchange.com/questions/2504/deep-learning-vs-gradient-boosting-when-to-use-what

나 역시도 그랬고, 뭔가 처음 Deep Learning을 접하고나서는 그냥 말 그대로 WOW. 거의 결과는 논문이나 Challenge 등에서 많이 보고 이 결과들은 정말 잘 나온 결과들을 표기해놨기 때문에 혹 한다.

그래서 초반에 항상 할 수 있는 실수가 large scale classification문제는 무조건 Deep Learning이나 Boosting algorithm으로 문제를 풀어야 생각하는데, 이건 노노.

기억의 잔해

Large size problem 연구를 하다보면 하는 흔한 실수

티스토리툴바

Large size problem 연구를 하다보면 하는 흔한 실수

관련글

티스토리툴바