Companies already do pay close to $200k/year for entry level data scientists. *(...

bearmf · on April 28, 2012

The problem as I see it is that most companies are looking for the all-in-one perfect candidate. There is indeed a shortage of such people.

Say you need someone who knows a lot about Hadoop and Amazon EC and is also intimately familiar with most learning algorithms and has a PhD. You are having trouble finding the guy. You start crying about "the big data talent shortage".

And here is the problem. Most PhDs have no experience with Hadoop or Amazon EC. Some of them might know Java well enough.

Now, consider a smart guy with PhDwho knows Java and has done something parallel with it, working on real "dirty" data. He can pick up Hadoop in no time from your software engineers. He will learn to tweak and optimize in his time - it is domain specific and cannot be learned off the job.

Will he be hired? Probably not. But people will keep crying about shortage.

Tichy · on April 28, 2012

How hard can it be, though? Like taking a normal CS person and making them versatile with hadoop and so on? Could it be done for 20K$?

yummyfajitas · on April 28, 2012

Making a CS person versatile with hadoop is not that hard. Making a CS person versatile with statistics is much harder.

See Zed Shaw's seminal article "Programmers Need To Learn Statistics Or I Will Kill Them All".

http://www.zedshaw.com/essays/programmer_stats.html

Making a math/science person versatile in CS is somewhat easier, but even that can be tricky. Many of them are bored by file formats, architecture, etc, and simply don't have the mindset of of engineering.

achompas · on April 28, 2012

How hard can it be?

Very hard. You run into all types of candidates who just aren't there yet: people working on research that's irrelevant to real world applications, people who have done data analysis/BI work that brand themselves as "data scientists," those who have the pedigree but cannot process and explore real-world data, those who have good analytical chops but not the distributed or advanced modeling experience, etc.

I've witnessed it first-hand, and it's tough to find the right person.

bearmf · on April 28, 2012

If it is that hard the bar is probably set too high. Most of the skills are learned on the job after all. Most smart PhDs who can program well and have sound knowledge of statistics can learn to do this stuff.

achompas · on April 28, 2012

Given enough time, anyone smart enough to finish a PhD can acquire a set of skills. :)

But it's more than just solid statistics. We're talking about having enough mathematical fluency to develop models rigorously (not just "oh, we'll minimize MSE!!"), test those models, then implement those models--possibly using a distributed algorithm.

From what I hear, these skills take years to develop. Choosing to groom the wrong person is an extremely costly mistake, so making the choice is difficult.

bearmf · on April 29, 2012

All mathematics consists of rigorous models. But choosing and tweaking a model is more of an art. Most data scientists apply existing models to new data, they do not develop new ones.

I am sure it takes much less than "years" for any smart PhD in applied mathematics to learn most of data analysis tricks. It is not theoretical physics after all.

achompas · on April 29, 2012

Most data scientists apply existing models to new data, they do not develop new ones.

I meant "develop" in the software sense. Data scientists use off-the-shelf libraries during initial research, but those libraries usually lack an important feature preventing them from going into production (typically, no support for concurrency).

I am sure it takes much less than "years" ... to learn most of data analysis tricks.

I used to be cynical about "data science," too. After four months of working on a data science team, though, I'm a believer.

A data scientist is really a "full-stack data developer." He or she needs the ability to work with advanced models, use them to analyze large amounts of data, and modify those models to work concurrently or in a distributed system if desired (and its often desired). It's more than just "analysis tricks."

pnathan · on April 28, 2012

> do statistics on TBs worth of data, derive useful conclusions

That's gonna be the hard part. Most CS people I've met flee from math and, more generically, theory.

groth · on April 28, 2012

They do? Which companies? How do I find them? :p

yummyfajitas · on April 28, 2012

Build a demo project showing you can do data analysis and they will find you.

earl · on April 28, 2012

Who is paying $200k for entry besides maybe google?

anothermachine · on April 28, 2012

Startups.*

*Equity value, may vary unpredictably.

And it's entry post-PhD, not entry from college.

earl · on April 28, 2012

So $120k and (very expensive) lottery tickets is what you're saying =P

reinhardt · on April 28, 2012

$120K as a startup employee? Damn, I live in a wrong country.