Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> There is so much focus on the machine learning algorithms rather than getting data ready for the algorithms.

Generally, once a problem at work has come to the point of being a "kaggle problem", it's trivially easy. The main problem is unstructured data, with infinite ways of specifying similar ways to measure the same attribute, and lots of leeway to build an unmaintainable data pipeline between the data generation process and the model at the end.



All Kaggle problems aren't created equal. Some look like a train matrix, a single target, and a test matrix.

Others are far more complex and start with much messier data and/or complex formulations.

Examples:

- www.kaggle.com/c/nips-2017-non-targeted-adversarial-attack/ - www.kaggle.com/c/the-allen-ai-science-challenge


I disagree that a "kaggle problem" style problem is trivially easy, but I strongly agree with the sentiment that dealing with unstructured data is often a much bigger, deeper, and broader problem than the choice of a particular algorithm or ensemble of them.

The ability to efficiently and effectively derive insights from such data is scarce.


Right, by "kaggle problem" I mean the general case where we roughly know what we're going to want to have on the right hand side of the model we're going to run (plus or minus some feature engineering, model choice and other hyperparameter specification, etc.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: