>NOTE: there are people in the world who would laugh at my definition and say that big data starts at 1Pb.
I commend them for having a larger penis ^H^H^H^H^H^H data stack than you.
I thought big data was less about the actual size of the data store and more about where it comes from (typically passive collection from user activity) and how it's accessed (through some kind of large map-reduce style framework) and used (to inform product decisions or learn more about human behavior)?
From the technology standpoint, where it comes from and how it's used doesn't really make a difference - if it can be processed on a single very beefy machine when done properly, then the appropriate/efficient way to work with this data is by avoiding big data techniques.
If it cannot, then you pay the price of all the complexity and overheads of big data processing techniques so that you can get your processing done.
It's correlated with data size, bot not so strictly - you can get, for example, NLP processing problems where you need a painful pipeline split over a huge cluster for a single gb of input data, and you can have problems where the best way to process a petabyte dataset is just to stick a single powerful machine to get the performance benefits of locality and low latency, and avoid managing splits/failed nodes/whatever.
So, in the first problem you would need to use Big Data techniques and the second problem you don't, it's not related to big data and the recommendations on how best to do that won't help people who need to do big data processing.
Yeah, for everyone but physicists it's really "big enough" data: it's a big enough data set that you've started recording things you didn't even try to record.
An excellent example was on HN the other day, using the NYC taxi data to determine which drivers are observant Muslims. It's not something anyone set out to record, but the data set has gotten so large that if you turn it sideways and shake, random facts like that fall out.
Do you think that when people make 1000+ table relational databases it's because a) it's fun b) they're stupid or c) because it's modelling something that is inherently complex?
I commend them for having a larger penis ^H^H^H^H^H^H data stack than you.
I thought big data was less about the actual size of the data store and more about where it comes from (typically passive collection from user activity) and how it's accessed (through some kind of large map-reduce style framework) and used (to inform product decisions or learn more about human behavior)?