Big Data is going to be for all of us according to ClearStoryData


I was thinking earlier tonight about SeekingAlpha’s illuminating article warning about the Hype over Big Data and just before I’m about to attend GigaOM’s Structure Data, this week.   I’m looking forward to hearing the latest news on what the experts on Big Data (and Big Analytics) are forecasting, all while keeping in mind the reality that there aren’t enough Data Scientists who can possibly work on Big Data, with the tools we have in place, today.

The Gap is too far apart between the skills needed and what the average analyst can muster up, and certainly, the average user can muster up.  This much is certain according to the New York Times.

So, what is the answer?  According to this mornings Bits/New York Times Article – it’s ClearStory.   According to the article:

ClearStory, which was formed last summer, is making software for ordinary business professionals. These customers will be able to blend their own corporate data with the large amounts of publicly available data, in search of new statistical insights. There just aren’t enough statisticians, the company figures, to address the demand corporations will have for Big Data.

I think that’s true.

Think about it – and keep in mind the SeekingAlpha story I linked to, above.  Here’s some quotes from it that I found relevant to this discussion.

  • “..Any industry that sees an immediate payoff from analyzing data has been doing it at least for the last 20+ years, if not earlier.” AND “…It’s clear that collecting and analyzing large quantities of data is not really a new phenomenon.”
  • “…data collection and analytics is quite mature in a number of industries. The low hanging fruits in the form of industries that benefit most from data analytics are already doing it.”  For there to be untapped value, one would then have to look for industries which a) collect data and b) don’t know the value of the data they collect and c) don’t know how to analyze the data they are collecting. It’s a pretty steep hurdle.
  • “…Despite the brouhaha, most data analytics work does not particularly necessitate a very high level of knowledge. The backbone of the industry is a recently graduated and poorly paid STEM graduate sitting in Bangalore with a few months of “data analytics” training. Those Ph.D.’s in math from MIT? Merely used to reel in the sales.
  • “….any reasonably competent person is able to perform tasks that a few years ago would have taken an entire team of analysts. That’s also the reason why vendors can mint “Analytics Experts” from STEM graduates with a few weeks of training.”
  • As a result …”….Given the low barriers to doing the analysis in house, I don’t see all these companies like IBM and Opera selling data analytic services for the long-term, especially at $100-200 per hour.”
  • “…There are a staggering number of industries with a staggering number of processes in each industry. It’s not possible for an IBM or EMC to have experts in every industry. Consulting firms by nature have people who are not experts in one industry but can be moved from project to project and industry to industry. “
  • “….Once you have a statistically sufficient amount of data, more data is of steeply diminishing value. A very big data set is not really that more valuable than a sufficient and representative data set, since both of them will be analyzed using sampling techniques, yielding very similar results.

This set of statements from the SeekingAlpha post are particularly illuminating – but also troubling in consideration of Big Data:

  1. Statistical techniques work best on what is known as coherent data sets.
  2. Yet, different pools of data in a company are usually different for a functional reason.
  3. The notion that all cross functional data within organizations, along with voicemail, email, chats and web stats can be linked for extra value is way overblown.


The SeekingAlpha article cites examples of areas where Big Data has just taken hold such as …

  1. Law firms’ revenues have been destroyed by revolutionary data analytics technology, which has replaced law associates with software, leading savvy clients to pay very much less for the same work. Where data analytics has been revolutionary, the results have not been as expected, i.e. with law firms.
  2. The relatively high pay for statistical skills in the U.S. is more a reflection of the American aversion to such careers than indicative of exceptional training, skills, talent or value they bring. Crack open the visa gates a bit and you’ll have a zillion statisticians from India, Russia, China and Philippines and the pay will come crashing down.
  3. According to the article – “...Data Analytics (Big Data) services will quickly become a commodity and prices will race to the bottom. After all, there is really no barrier to entry. Buy a statistical package or two, hire a few programmers in Bangalore, a couple of ex-cheerleaders for sales and you are in business. Any existing software services company should be doing this, if they are not already doing this.”

Now think back about the ClearStoryData site, and what it claims is the coming future for Big Data Analytics:

“We’re about making data consumable,” says Sharmila Shahani-Mulligan, ClearStory’s founder and chief executive. “The world is talking about the size of Big Data sources, but at the end of the day it will be about the ease of consumption. Indeed, the McKinsey Global Institute has projected that the United States needs up to 190,000 people with “analytical expertise” than it has.

“…“There aren’t enough experts at companies to handle all these data sources,” Ms. Shahani-Mulligan says. “Brick-and-mortar stores in the Midwest can’t get the kind of data scientists LinkedIn has. We can give them that.” The company offers a range of algorithms to wrangle the data. Longer term, ClearStory hopes to get companies to start sharing more of their own internal data too, what Ms. Shahani-Mulligan calls “the gold locked up in 30 years of relational databases.”

So there you have it …… Big Data is “too big”, it won’t scale unless the platform tools allow the average Joe and Jane with some analytics skill to work with it – much as Google Analytics has taken what was complicated (and still basically is) and given the world of the average Joe and Jane the feeling they are all “Web Analysts” (even if they really aren’t).

And now I want to share something else about my own experience as a Web Analyst and Social Media Analyst.

  •     Often, at companies like I found the Analytics work I did (then – 5+ years ago) wasn’t particularly advanced, and that most of the reports I did for various groups in IBM, were circumscribed by the tools and platforms I used (at the time, mostly IBM Surfaid – which no longer exists, as far as I know).   In other words, what I actually did was about 50% mastery of a platform (Surfaid) and 50% understanding how to work through the questions a stakeholder had, and create meaningful reporting (to the extent that meaningful reporting could take place, given the various asks of stakeholders, and the information that could be gathered via the platforms we had at our disposal within IBM).
  • In other kinds of Analyst work, particular at MARCOM type companies, I found the work was actually very little Analytics that was actually required, or in fact, possible, under the kind of reporting that Public Relations firms can understand, or deliver – there simply was a total submersion of Analytics to “Storytelling”.  Whereas at IBM and other jobs, such as, reporting was still limited by the platforms, in the PR space, the only reporting you could actually do was through the platforms, mainly Radian6 and Sysomos (thesedays) – and therefore, what the platforms presented to you as information (based on your queries) was pretty much, all you had/have to work with.  In other words, you could triangulate other data (Wikipedia, Gigya, Facebook, Twitter and the like) along with a listening platform, but at the end of the day, all you actually did was make up a story and then find some data to message around it.

The articles here make me think just how much the available tools provided for Analysis – if you didn’t have the right tools (and data), it was hard to do any kind of analysis that meant anything, really.  But the other part, thinking through how to present and arrange the information, that’s the part I found the most difficult, and what this post is really about.

For the SeekingAlpha’s illuminating article warning about the Hype over Big Data brought up the two basic kinds of Analytics, and the changes coming with new and simpler platforms to manipulate Big Data shows promise to make some changes:

  1. “….One type is purely statistically based analytics, which require little or no industry knowledge. These utilize various statistical techniques to identify trends, events and relationships. Finding those Target customers who look for maternity clothes and/or are pregnant, falls under this category.”
  2. “….The second type is model-based analytics. In this type of analytics, an industry expert develops a model for a particular process and the model is fed with data to calibrate the model. Various parameters are added and removed to fine tune the model. Finally, the calibrated model is used for predictive purposes. The most valuable type of analytics is this second type, because it directly provides information on a particular process, allowing it to be optimized and improved.”
      • ……skeptical that outside companies can really step in and extract any really deep insights of the second kind without deep subject matter expertise.”

Now, I’ll argue that most jobs are asking for skills that lead to the first kind of Analytics whereas what they really needed, and lack, is the second type of Analytics and the Second type of Analyst who is essentially, a subject area expert who also knows how to use the tools and platform at his or her disposal. This is was totally the case with several companies I have come across – not one truly understood the issue, and most mixed up both types, and made entirely unrealistic projections, and still do, based on hype, because for the most part, particularly in the MARCOM space, they don’t understand data, but do understand story and hype, and use the data to support the stories they weave for their clients.  This wasn’t Analytics, at all, it was more a form of storytelling, and was the basis of the stern warning I put into Chapter 10 of my book.

The reality I see is platforms and tools (and the knowledge/mastery of them), together with a deep industry knowledge (along with a good amount of storytelling) is what most Analytics require – and most of that doesn’t require anything close to the Analytics and Programming requirements that are being asked for, almost routinely.  I won’t say they are never used, but I think that’s exactly the work of Statisticians and Data Scientists – but that’s not what you see in most Analyst work  – it’s no those people who are doing the work for clients (and in most cases, the Data Scientists are being used for other things, not so much reporting, or even insights work – as the SeekingAlpha article pointed out – see below).

“……Those Ph.D.’s in math from MIT? Merely used to reel in the sales. They are usually far too few and expensive to actually work on the nitty gritty of client data. It’s the law firm associate model again. The employee churn at data analytics firms is about 30-40% a year. There are no pools of deep institutional knowledge at data analytics firms.”

Nuff said on these points – hope I managed to get my points across.  Platforms that work with “Big Data” such as ClearStory, will open up the possibilities of Analysts of the Second Type, to use Big Data, still somewhat circumscribed by the platforms they can master – and produce actionable analytics – but at least, they’ll be able to do it with a much richer palette.

Anyway, join me at StructureData on Wednesday and Thursday – there’s also a discount if you click on the link in my sidebar.

On Wednesday I’m looking forward to ..

8:40 AM

Structuring Decisions from Unstructured Data

What’s the holy grail of big data? Cracking the code of mining unstructured data. Text and otherwise, unstructured data represents the majority of the big data universe’s “dark matter.” But what approaches will work? What are the real benefits, and how can your enterprise start making sense of the vastness of the data? Hear from leaders in the field with real-world approaches, and thoughts about the future of technologies for unstructured data.

Moderated by: Seth Grimes – Principal Consultant, Alta Plana
Speakers: Jason Hunter – Deputy CTO, MarkLogic

Paul Speciale – VP, Products, Amplidata

Staffan Truve – CTO and Co-Founder, Recorded Future


10:00 AM

Smart Tools: Dissect, Digest and Deliver on Big Data

No one really needs an army of IT analysts. A new generation of tools is empowering business users of all abilities to derive value from big data, one digestible bite at a time. Intuitive interfaces on affordable and powerful cloud services mean that the right tools can be effective for the jobs at hand.

Speakers: Rachel Delacour – CEO and Co-Founder, We Are Cloud


12:45 PM

The Trillion-Row Spreadsheet™

We consider the question: What if your spreadsheet could handle a trillion rows? This clarifies the distinction between which aspects of big data problems can be solved by better software engineering within an existing paradigm — and which problems need a new paradigm for data management. We also consider some of the implications of adopting a spreadsheet metaphor for managing large amounts of data.

Speakers: Robert Lefkowitz – Director of Web Development, 1010data

2:50 PM

Realizing Real-Time Value on the Real-Time Web

To date, big data has been mostly focused on batch and data science workloads. This is about to change with the advent of the real-time web. We’ll discuss why the next big sea-change in big data will be focused on real-time analytics, and why this is critical for delivering compelling user experiences based on consumer intelligence.

Speakers: Todd Papaioannou – Founder and CEO, Continuuity


On Day 2 (Thursday)

12:05 PM

Applying Mars Mission Rocket Science to Real-Time Decision-Making

This session will trace how a combinatorial programming language developed by a team of MIT scientists to power NASA’s Mars Mission project has successfully been repurposed for an unlikely commercial use – digital advertising. Learn how the original inventor of this programming language applied “signals and systems” thinking to facilitate a digital advertising platform that uses scalable, intelligent machine-learning techniques for big data analytics and automated real-time decision making.

Speakers:Dr. Bill Simmons – CTO and Co-Founder, DataXu

4:40 PM

Mining the Mobile Data Deluge

The most successful complex technology of all time is the mobile phone. As everyone on the planet gets a phone, connects to a network and creates data, we sink deeper into an ocean of data. This ocean of data represents a huge opportunity for those willing to submerge into its depth and fish for the insights. We talk to the most innovative new thought leaders in this space about how they are creating new values from the insights they generate and what is still left to explore in the uncharted depths of the mobile data ocean.

Moderated by: Ryan Kim – Staff Writer, GigaOM
Speakers: Michael Driscoll – CTO, Metamarkets

Raj Aggarwal – CEO and Co-Founder, Localytics


I’ll try to take in as much of the content here as I can, while admitting, that some of it will be over my head, and I bet, over the head of most of the attendees.

But it doesn’t change the fact that …… only a set of tools that makes it easy to work with Big Data, will the benefits actually move to many other kinds and types of businesses and enterprises that need it, despite all the hype, otherwise.

5 thoughts on “Big Data is going to be for all of us according to ClearStoryData”

  1. I agree with  “A very big data set is not really that more valuable than a sufficient and representative data set, since both of them will be analyzed using sampling techniques, yielding very similar results.”   As a local social media strategist my data stream is quite small.  Tools like radian6 are not useful as they cannot register the small amount of noise my data streams produce on a local level.  I do find that the data I collect is significant and I use basic techniques and work with limited  data.  

  2. I agree with  “A very big data set is not really that more valuable than a sufficient and representative data set, since both of them will be analyzed using sampling techniques, yielding very similar results.”   As a local social media strategist my data stream is quite small.  Tools like radian6 are not useful as they cannot register the small amount of noise my data streams produce on a local level.  I do find that the data I collect is significant and I use basic techniques and work with limited  data.  

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>