Big Data, No Data, and Metadata

Near-universal consensus has it that, sometime around 9/11, the world passed from the Age of Aquarius, through some vernal equinox noticed by few, straight into the Age of Big Data. That passage brought about a seismic epistemological shift. To be sure, any links to the events surrounding 9/11 are coincidental: the real reason for this transition was the coming of age of enabling technology. To that extent, whatever one may want to think of 9/11 conspiracy theorists conjecturing about the tragic events as having been brought on, or at least been aided and abetted, by someone or something other than al Quaeda: the acts and omissions after 9/11 point to its utility for the advancement of surveillance, for which political and civic tolerance could otherwise not have been expected. Very much the same goes for the speed by which authorizing legislation was whipped through the formalities of democratic rule-making processes, purportedly under the influence of those events. But such a pounce on an opportunity of this magnitude had no doubt have to have been incubated for quite some time, in lockstep with deep insights into the progress of technology and entirely independent of whatever statistically unpredictable Black Swan event would one day trigger its sudden political viability. It did not matter which event or who or what would cause it. That, in all likelihood, was indeed not known, and it did not need to be known. It was, in Donald Rumsfeld’s immortal dictum, one of the “known unknowns.”

The extent of surveillance capabilities that became available as a result to the U.S. government and to the other “Five Eyes” Canada, UK, Australia and New Zealand that do not spy on each other (at least in theory and at least for now) and otherwise cooperate to secure the endurance of occidental civilization would have been every totalitarian regime’s wet dream. Perhaps one day, cloning technology may enable resurrection of Feliks Dzierżyński’s or Lawrentii Beria’s DNA, or Heinrich Himmler’s, Erich Mielke’s or Klemens Metternich’s, not to mention Joseph Fouché’s or Philipp II’s or Kang Sheng’s or Pol Pot’s – and I predict the greatest possible unanimity of consensus among all these distinguished oppressors of the unrestrained human mind: no government can ever be secure of power without surveillance. So, does it really matter whether the chicken or the egg existed first, whether surveillance technology eviscerates pre-existing democratic structures and aspirations (those uncontrollable by powers that be) or whether it is created by a totalitarian ambition already thus entrenched? The bottom line remains crisp and clear: information is power.

Another heretical lesson from history is that power corrupts, and absolute power corrupts absolutely. This trite aphorism is not criticism of the fact that an open democratic society develops technology potentially capable of abuse. Every technology is. But serious concern arises out of the near-total absence of intelligent systems design aiming at establishing transparent and accountable checks and balances and procedures to safeguard against the patently obvious risk of creeping abolition of civil rights and liberties for the purported preservation of “security” – a term that is neither qualitatively nor quantitatively or statistically defined and therefore lacking any significant measure of transparency and accountability.

After all, creeping abolition of previously existing expectations has considerable tradition in the way we treat information. In many ways, technology’s relationship with power and its preservation has been ambivalent and multifaceted throughout history. Before the creation of commercial postal services in Northern Italy circa 1200 by the enterprising bergamaschi family of Tasso – later the princely house of Thurn and Taxis – organized conveyance of information was a purely private matter, often enough done ad hoc and characteristically affordable only to sovereigns and military commanders who could maintain relay stations for couriers and for their horses. But literacy, thus education, and presumably leveraged use of information were a matter of substantial privilege in the medieval world in the first place. The printing press vastly increased access to knowledge but with it came, perhaps inevitably, censorship of its use. As civilization progressed, the cost of transfer of information came down somewhat, but not dramatically while it was still largely based on, and limited by, the economics of the horse. Not until trade and thereby competition intensified dramatically, leading to the exchange of multiple scheduled couriers simultaneously on a given route, did cost decline. After all, another limiting factor was the physical condition of the medium – the weight of books and paper and the means and durability of storage. Disembodiment of information came only in the 19th century, first by rendering the horse obsolete, initially by pneumatic delivery through a letter chute, followed by sea-, land- and air-based applications of the steam and combustion engine, but quickly by reducing information itself to electric signals and electromagnetic waves, analog at first and later digital. Breaches of confidentiality of information in transit by interception of couriers and involuntary extraction of information had been a well-known risk since antiquity.  Cicero already complained about the trials and tribulations of finding trustworthy carriers for his letters.  Simultaneously, this gave rise to encryption reported, inter alia, by Plutarch. Encoding of information, too, was at first limited in its use by the same two closely related factors: quality and cost. Encryption simply told the unintended reader to mind his own business. During the dawn of encryption, surveillance faced a multitude of challenges of which almost none were merely nominal: first, and pretty much until the end of WWI, the fact of the transmission itself became known to interceptors only under highly serendipitous circumstances – through sheer luck, treason, or incompetence. Only once data transmission occurred through electromagnetic signals carried outside dedicated cables did encryption become the headache it is known to be today. Wars were lost due to timely code breaking: the Red Army’s operational designs in post-WWI Poland[1] and Hitler’s communications passing through the Enigma system did not remain as secret as intended.[2] Since the days of the Enigma machine, secrecy correlated near-perfectly with maintenance of crypto-technological superiority, and only for the time of its duration. To the extent as yet undecipherable communications are being recorded and stored, they remain available for future use following additional advances in cryptology as well as quantum leaps of computational resources for ‘brute force’ attempts. Despite the likelihood of obsolescence for its primary purpose, any recorded signal may still serve as legal or at least historical evidence. As for encryption itself, brute force attacks are put in perspective by Edward Snowden’s remark: “Assume that your adversary is capable of a trillion guesses per second.” 

In public and scholarly discourse, Big Data is for the most part viewed from a perspective of privacy. While an important concern, this must not be the litmus test for pushing technology further along. By far the greatest value of Big Data applications lay not in surveillance of individualized data but in the superstatistical analysis of large anonymized data pools. Notions like the ‘Internet of things’ and ‘networked factories’ cannot be realized without Big Data. The concept also has significant implications for life sciences and for public health.

While life expectancy rises around the world, people live longer years with more diseases. Output of new drugs has dropped measurably, also because the cost of regulation in a wider sense has multiplied: if commercializing a viable drug used to require a budget of $1 billion not too long ago, the benchmark figure has now reached about $5 billion, necessitating change. For this and other reasons, the growth rate of health care cost has surpassed the growth of GDP in virtually all developed markets. It appears that not only understanding and increasingly individualized treatment of systemic disorders such as cell malignancies require vastly increased computational tools and resources, but also the macroeconomic analysis of public health phenomena will need to rely on complex analysis of very large data pools if it is to tailor meaningful approaches to prevention and avoid misallocation of increasingly limited resources.

For example, different malignancies have very different causation and are thus not susceptible to similar approaches: while brain tumors appear to have purely genetic causes, skin cancer or Hodgkin’s lymphoma are predominantly caused by environmental factors. Still, 85-95% of the incidence of disease within a decade can now be predicted with considerable exactitude by statistical means. This permits successful prevention or at least retardation of outbreak upon a manifestation of certain known precursor constellations. With this in mind, individual privacy issues are losing ground – if all I need to know about a person is essentially public information such as age, gender, zip code and (maybe) profession to arrive at often shockingly accurate guesses of ‘privileged data,’ then the value of protecting specific details by means of privilege is certainly diminished. The alternative of perfect privacy, thought through to its logical conclusion, would mean No Data as it would engender a presumption of non-disclosure or non-use of information that is in plain view. Due process issues will remain subjects of intense debate, even with regard to ‘mere’ metadata and the form of their production.

Very similar conclusions are imaginable in the social sciences. They permit reasonably accurate forecasts of societal developments arising out of imbalances and trends, but also significantly improved structural planning of health care, energy and transportation infrastructures based on substantially reliable quantitative models. In this context, it matters little that a significant part of data is inaccurate as errors and other data flaws tend to cancel each other out due to their (overall) random nature. Yes, there is a percentage of unreliable information but it merely accounts for background static. That does not mean that attempts to improve data quality can or should be forgone, but in a lot of situations, partially flawed results still enable valuable conclusions. As Big Data advances as a field, our ability to filter, smooth and otherwise enhance input data or improve the quality of their analysis, sometimes by complexity management tools such as the laws of entropy, will continue to correlate strongly with the power of computational resources available, and with their growth rate. 

Under the umbrella of Science, Technology and Society (STS), a field of research loosely termed Critical Data Studies emerged. It deals with questions of data assemblages. Data assemblages assume a view of data as already constituted or composed within data structures that interact with society, its forms and methods of organization, and with the daily lives of individuals.[3] Critical Data Studies are based on the demonstrably accurate assumption that data are almost never simply objective, neutral, or transparent entities of information. CDS raise questions reinterpreting well-established concepts in light of data-specific analysis:

  • Causality: how should we find causes in the era of ‘data-driven science?’ Do we need a new conception of causality to fit with new practices?
  • Quality: how should we ensure that data are of good enough quality for the purposes for which we use them? What should we make of the open access movement; what kind of new technologies might be needed?
  • Security: how can we adequately secure data, while making it accessible to those who need it? How do we protect databases?
  • Uncertainty: can Big Data help with uncertainty, or does it generate new uncertainties? What technologies are essential to reduce uncertainty elements in data-driven sciences?

More about complexity management tools (that do not necessarily, or even predominantly, lead to ‘complexity reduction’ in a Big Data context) and paraconsistent logic will be forthcoming in one of the iterations of this blog. Big Data brings to our attention a sharper vision of the relative value and at the same time of a measure of insignificance of detail (and of any shielding thereof) depending on the purpose of the application. It is time we started talking about data and data structures on many levels and from multiple perspectives.

[1] Richard Woytak, Colonel Kowalewski and the Origins of Polish Code Breaking and Communication Interception, 21 East European Quarterly no. 4, Jan. 1988, at 497–500.
[2] Robert J. Hanyok, Appendix B: Before Enigma: Jan Kowalewski and the Early Days of the Polish Cipher Bureau (1919–22), in Enigma: How the Poles Broke the Nazi Code (2004).
[3] See Andrew Iliadis, Comparative Review: Big Data, 46 Communication Booknotes Quarterly no. 2, May 21, 2015, at 54-57.

1 comment:

  1. Data interpratation and handling is of the toughest tasks that people face thesedays. A lot of new tools and techniques are made for this which I got to know from this post.