Companies already do pay close to $200k/year for entry level data scientists.
(what other kind of scientists are there? The tea-leaf reading kind?)
"Data scientist" refers to the guy who can set up a hadoop cluster, do statistics on TBs worth of data, derive useful conclusions and speed it up by tweaking the low level data formats or microoptimizing the calculation.
The issue is rarely paying these guys an extra $20k, it's simply finding them.
Setting up some lasers and a photonic crystal, imaging the output, making a graph in excel or matlab and drawing conclusions is a different skillset. Someone who can do the latter is a scientist who uses data, but he is not a data scientist.
The problem as I see it is that most companies are looking for the all-in-one perfect candidate. There is indeed a shortage of such people.
Say you need someone who knows a lot about Hadoop and Amazon EC and is also intimately familiar with most learning algorithms and has a PhD. You are having trouble finding the guy. You start crying about "the big data talent shortage".
And here is the problem. Most PhDs have no experience with Hadoop or Amazon EC. Some of them might know Java well enough.
Now, consider a smart guy with PhDwho knows Java and has done something parallel with it, working on real "dirty" data. He can pick up Hadoop in no time from your software engineers. He will learn to tweak and optimize in his time - it is domain specific and cannot be learned off the job.
Will he be hired? Probably not. But people will keep crying about shortage.
Making a math/science person versatile in CS is somewhat easier, but even that can be tricky. Many of them are bored by file formats, architecture, etc, and simply don't have the mindset of of engineering.
Very hard. You run into all types of candidates who just aren't there yet: people working on research that's irrelevant to real world applications, people who have done data analysis/BI work that brand themselves as "data scientists," those who have the pedigree but cannot process and explore real-world data, those who have good analytical chops but not the distributed or advanced modeling experience, etc.
I've witnessed it first-hand, and it's tough to find the right person.
If it is that hard the bar is probably set too high. Most of the skills are learned on the job after all. Most smart PhDs who can program well and have sound knowledge of statistics can learn to do this stuff.
Given enough time, anyone smart enough to finish a PhD can acquire a set of skills. :)
But it's more than just solid statistics. We're talking about having enough mathematical fluency to develop models rigorously (not just "oh, we'll minimize MSE!!"), test those models, then implement those models--possibly using a distributed algorithm.
From what I hear, these skills take years to develop. Choosing to groom the wrong person is an extremely costly mistake, so making the choice is difficult.
All mathematics consists of rigorous models. But choosing and tweaking a model is more of an art. Most data scientists apply existing models to new data, they do not develop new ones.
I am sure it takes much less than "years" for any smart PhD in applied mathematics to learn most of data analysis tricks. It is not theoretical physics after all.
Most data scientists apply existing models to new data, they do not develop new ones.
I meant "develop" in the software sense. Data scientists use off-the-shelf libraries during initial research, but those libraries usually lack an important feature preventing them from going into production (typically, no support for concurrency).
I am sure it takes much less than "years" ... to learn most of data analysis tricks.
I used to be cynical about "data science," too. After four months of working on a data science team, though, I'm a believer.
A data scientist is really a "full-stack data developer." He or she needs the ability to work with advanced models, use them to analyze large amounts of data, and modify those models to work concurrently or in a distributed system if desired (and its often desired). It's more than just "analysis tricks."
(what other kind of scientists are there? The tea-leaf reading kind?)
"Data scientist" refers to the guy who can set up a hadoop cluster, do statistics on TBs worth of data, derive useful conclusions and speed it up by tweaking the low level data formats or microoptimizing the calculation.
The issue is rarely paying these guys an extra $20k, it's simply finding them.
Setting up some lasers and a photonic crystal, imaging the output, making a graph in excel or matlab and drawing conclusions is a different skillset. Someone who can do the latter is a scientist who uses data, but he is not a data scientist.