The Curse of SimplificationBack Blog
A short, non-technical narrative on complexity in Data Science projects and how to communicate it. Essentially, on problems that cannot be solved even in principle, and how to live with it. An illustrative example will probably do in place of a proper introduction.
Your football club played a decisive match just a day ago. The match ended in a draw; no goals were scored at all. A week ago, they were defeated badly by the same opponents who scored 3:0 at home. They made you feel miserable, and it took some time to talk out the emotions with your mates in a local pub. And now when the bad feeling is gone, forgiveness and hope are on the rise again, together with the recovering part of your identity still tightly coupled with your team's colours, what do you start doing next? Almost as by a rule: you start analysing. Probably very unaware of it, you're already in the Data Science business; more precisely, your inner, intuitive statistician is.
Let's see. The coach didn't lay out the strategy well. There was absolutely no need to have two men chasing the tail of the opponent's dangerous striker just for the sake of keeping him out of play for all 90 minutes. Couldn't we have played it at least a bit more offensive? The new kids in the midfield were confused and kept the ball for too long. They were simply not experienced enough for a such an important match; it was probably not their skill but the attitude that prevented them from getting more involved and keep the ball going. Finally, last year they paid half of the nation's GDP to get to that Brazilian guy who kept on falling in the penalty box. Such a shame - it always ends up like that with those who earn their fame too early! And yes, the left winger could have just stayed at home; he would nevertheless remain as invisible as he was during the match. They should have played 3-4-3! In the meantime, in the parallel universes of other pubs, tens of thousands of your countrymen share literally hundreds of thousands of similar explanatory narratives in the search for relief. Some think, contrary to your opinion, that they would have won if they had played 3-5-2 - or at least if they had done so during the whole course of the second half...
Ask yourself: could it be that all your explanations hold true at the same time? Logically, no. Our world, however complex it is, has an underlying causal structure, and the one not always immediately accessible to us (well, philosophically: it is never of immediate access to us). At least in the Laplacian world - the one with a structure more than good enough to describe the cause-effect relations that hold of practical importance in our everyday lives - there are no two, three, or four causal explanations for an event. There is always, under any circumstances, only one causal explanation that holds true. Only one! But we are somehow able to come up with many narratives that somehow explain the same event. Sometimes, we call such a narrative a viewpoint. Well, let me disappoint you: a narrative doesn't make an explanation just because it is a product of one's mind.
The naked truth, No. 1: (a) your team played an extra two matches in the national cup this season and a friendly match to raise charity only a month ago; they have been too tired and the responsibility lies solely with the club's management; (b) the left winger was invisible because of the confusion in the midfield, while he himself was perfectly prepared for the match; (c) the kids in the midfield were really bad - turns out that you were right about that one; (d) the coach was in opposition to the managerial decision to play the aforementioned friendly match, while they wouldn't listen; he has also been contemplating the thought of leaving to coach abroad for at least one year or so already, the fact of which the management was never aware of; with the match already lost in his mind even before it even started, he decided to experiment a bit; (e) finally, the Brazilian striker learned just yesterday that his supermodel girlfriend is having an affair with an Argentinian golf star; tormented by this thought, he couldn't get himself together to provide a decent performance, and kept falling deliberately, in a hope to provoke a penalty just for the sake of not letting you down completely. Of course this is all made up, and presents a hypothesis at best - but how do you tell whether it holds true or not?
The naked truth, No. 2: the world - especially the social world - is too complex for any model to encompass all of the causal factors (aka: variables, features, traits) that are needed to provide a full explanation, prediction, or the power to manipulate the events in it successfully. In the beginnings of the 20th century Cybernetics movement they were very aware of this fact. You have probably heard rumors of the End of Theory* that should somehow take place with the advent of Big Data and the IoT. Presumably, we will soon have access to such an immense amount of data from social interactions online, online sensors, and other sources, that we will be able to predict the future by simply doing number crunching on a purely empirical basis, without any need for some "explanatory model" to rely on. Of course - and if you have a proper background in mathematical statistics - you are already aware of the mountain of ignorance present in such and similar claims... Nothing really delivers in Data Science without some model underneath it; the only question is how much one chooses to focus on it and its interpretation. On the other hand, the claim that all the relevant data will be ever made available to us - given all the networks, sensors, and the semantics of social interactions under analysis - is only a proof that someone has seriously lost her or his mind during the very initial sketch of the idea. Take a closer look at the examples above and let me know what methodology exactly do you propose to collect the data on all relevant events? You can e-mail your suggestions to email@example.com.
The moral of the story. First, there are always models and explanations; the fact that sometimes in Data Science we look for a model that simply performs well and are less interested in its interpretation does not mean that we have not assumed any model at all. The model is there; whether you wish to take a closer look at it or not is up to you. Second, the fact that we work with a lot of data certainly does not imply that we have somehow reached the population of events where we do not need any mathematical statistics and models at all; any claim similar to this one should be considered as seriously laughable. Size does matter, but Big Data cannot match pure intelligence under any circumstances.
Finally, what's your take as a Data Scientist once you accept all this (again, N.B. - if you were given a proper training in probability, statistics, and methodology, you have already accepted all this, and most probably at a very young age)?
Top advice: educate your clients. Tell them that we're not magicians. Explain that all the narratives about models and our predictive machinery hold only in a closed world of our assumptions and within the limits of sampling and computational power. Find examples that will illustrate how a hit rate of 95% can drop to be under 80% just around the corner if the data structure changes too rapidly and starts violating the assumptions that you've made. Make them read the collected works of Nassim Nicholas Taleb (sure thing: even if they don't understand a word of it, they will claim that they do). Avoid mathematics altogether; it is not that complicated to explain such things even without a two-semester long Introduction to Probability and The Scientific Method 101. They will understand and learn to act preventively.
Learn to live with it. Model once, try to understand and express your conclusions verbally, change your assumptions, transform the data, model again, see whether your narrative about the data has changed. Do data storytelling, but avoid becoming a news reporter to your clients. Modeling is exact, under given constraints; understanding, on the other hand, is slippery. Of course, none of your clients care about the details of the former, but they're always eager to learn about the latter - which is, obviously, a demand almost impossible to satisfy in its totality. That is the curse of simplification.
Oh, and what was that that you read about the median Data Science salaries last year?
* In relation to original Anderson’s The End of Theory hypothesis, published in Wired, 06/23/2008, and this blog post, the reader might be interested to take a look at its most recent re-invention in 04/18/2017 Weinberger’s Alien Knowledge: When Machines Justify Knowledge, and John Timmer’s 05/28/2017 response on Ars Technica: First the cloud, now AI takes on the scientific method.