Book Review: Big Data Demystified

Last year, our economics department launched a data analytics minor program. The first class is a simple 2 credit course called Foundations of Data Analytics. Originally, the idea was that liberal arts majors would take it and that this class would be a soft, non-technical intro of terminology and history.

However, it turned out that liberal arts majors didn’t take the class and that the most popular feedback was that the class lacked technical challenge. I’m prepping to teach the class and it will have two components. A Python training component where students simply learn Python. We won’t do super complicated things, but they will use Python extensively in future classes. The 2nd component is still in the vein of the old version of the course.

I’ll have the students read and discuss “Big Data Demystified” by David Stephenson. He spends 12 brief chapters introducing the reader to the importance of modern big data management, analytics, and how it fits into an organization’s key performance indicators. It reads like it’s for business majors, but any type of medium-to-large organization would find it useful.

Davidson starts with some flashy stories that illustrate the potential of data-driven business strategies. For example, Target corporation used predictive analytics to advertise baby and pregnancy products to mothers who didn’t even know that they were pregnant yet. He wets the appetite of the reader by noting that the supercomputers that could play Chess or Go relied on fundamentally different technologies.

The first several chapters of the book excite the reader with thoughts of unexploited potentialities. This is what I want to impress upon the students. I want them to know the difference between artificial intelligence (AI) and machine learning (ML). I want them to recognize which tool is better for the challenges that they might face and to see clear applications (and limitations).

AI uses brute force, iterating through possible next steps. There are multiple online tic-tac-toe AI that keep track records. If a student can play the optimal set of strategies 8 games in a row, then they can get the general idea behind testing a large variety of statistical models and explanatory variables, then choosing the best.

But ML is responsive to new data, according to what worked best on previous training data. There are multiple YouTubers out there who have used ML to beat Super Mario Brothers. Programmers identify an objective function and the ML program is off to the races. It tries a few things on a level, and then uses the training rounds to perform quite well on new levels that it has never encountered before.

There are a couple of chapters in the middle of the book that didn’t appeal to me. They discuss the question of how big data should inform a firm’s strategy and how data projects should be implemented. These chapters read like they are written for MBAs or for management. They were boring for me. But that’s ok, given that Stephenson is trying to appeal to a broad audience.

The final chapters are great. They describe the limitations of big data endeavors. Big data is not a panacea and projects can fail for a variety of what are very human reasons.

Stephenson emphasizes the importance of transaction costs (though he doesn’t say it that way). Medium sized companies should outsource to experts who can achieve (or fail) quickly such that big capital investments or labor costs can be avoided. Or, if internals will be hired instead, he discusses the trade-offs between using open source software, getting locked in, and reinventing the wheel. These are a great few chapters that remind the reader that data scientists and analysts are not magicians. They are people who specialize and can waste their time just as well as anyone else.

Overall, I strongly recommend this book. I kinda sorta knew what machine learning and artificial intelligence were prior to reading, but this book provides a very accessible introduction to big data environments, their possible uses, and organizational features that matter for success. Mid and upper level managers should read this book so that they can interact with these ideas prudentially. Those with a passing interest in programming should read it for greater clarity and to get a better handle on the various sub-fields. Hopefully, my students will read it and feel inspired to be on one side or the other of the manager- data analyst divide with greater confidence, understanding, and a little less hubris.