Issue 1/2016 - New Materialism


"100 billion rows per second"

Culture Industry in the Early 21st Century

Lev Manovich


“Scuba is Facebook's fast slice-and-dice data store. It stores thousands of tables in about 100 terabytes in memory. It ingests millions of new rows per second and deletes just as many. Throughput peaks around 100 queries per second, scanning 100 billion rows per second, with most response times under 1 second.” Facebook Top Open Data Problems, 2014, https://research.facebook.com/blog/1522692927972019/facebook-s-top-open-data-problems/ .

“Interested parties like to explain culture industry in technological terms. Its millions of participants, they ague, demand reproduction processes that inevitably lead to the use of standard processes to meet the same needs at countless locations… In reality, the cycle of manipulation and retroactive need is unifying the system is ever more tightly.” Theodor Adorno and Max Horkheimer, “The Culture Industry: Enlightenment as Mass Deception," in Dialectic of Enlightenment, 1944, http://web.stanford.edu/dept/DLCL/files/pdf/adorno_culture_industry.pdf.

“Being able to iterate quickly on thousands of models will require being able to train and score models simultaneously. This approach allows Cisco (an H20 customer) to run 60,000 propensity to buy models every three months, or to allow Google to not only have a model for every individual, but to have multiple models for every person based on the time of the day.” Alex Woodie, “The Rise of Predictive Modeling Factories,” February 9, 2015, http://www.datanami.com/2015/02/09/rise-predictive-modeling-factories/.

One of the important aspects of “big data” revolution is how it affected media and culture industries. (Note that I am not saying “digital culture” or “digital media” because today all culture industries create digital products that are disseminated online. This includes games, movies, music, TV shows, e-books, online advertising, apps, etc. So I don’t think we need to add word “digital” anymore. )

The companies that sell cultural goods and services online (for example, Amazon, Apple, Spotify, Netflix) organize and make searchable information and knowledge (Google), provide recommendations (Yelp, TripAdvisor), enable social communication and information sharing (Facebook, QQ, WhatsApp, Twitter, etc.) and media sharing (Instagram, Pinterest, YouTube, etc.) all rely on computational analysis of massive media data sets and data streams. This includes information about online behavior (browsing pages, following links, sharing posts and “liking,” purchasing,), traces of physical activity (posting to social media network in particular place and time), records of interaction (online gameplay) and cultural “content” – songs, images, books, movies, messages, and posts. Similarly, parts of human-computer – for example using voice interface – also depend on computational analysis of countless hours of voice commands.

(I use “data sets” refers to static or historical data organized in a database prior to automatic analysis. “Historical” in industrial data analytics applications mean everything that is more than a second in the past. Data Streams refers to the data that arrives and is analyzed continuously using technologies such as Spark Streaming or Storm. In either case, data is also being stored using fast access technologies such Cassandra, HBase, and MongoDB.)

For example, to make its search service possible, Google continuously analyzes full content and markup of billions of web pages. It looks at every page on the web it can reach - its text, layout, fonts used, images and so on, using over 200 signals in total). To be able to recommend music, the streaming services analyze characteristics of millions of songs. For example, Echonest that powers Spotify used its algorithms to analyze 36,774,820 songs by 3,230,888 artists. Spam detection involves analysis of texts of numerous emails. Amazon analyzes purchases of millions of people to recommend books. Contextual advertising systems such as AdSense analyze content of web pages and automatically select the relevant ads to show. Video game companies capture gaming actions of millions of players to optimize games design. Facebook algorithm analyzes all updates by all your friends to automatically select which ones to show in your feed. And it does for every one of their 1.5 billion users. According to the estimates, in 2014 Facebook was processing 600 TB of fresh data per day.

The development of algorithms and software systems that make all this analysis possible is carried out by researchers in a number of academic fields including machine learning, data mining, computer vision, music information retrieval, computational linguistics, natural language processing, and other areas of computer science. The newer term “data science” that becomes popular after refers to professionals with advanced Computer Science degrees who know contemporary algorithms and methods for data analysis (described by overlapping umbrella terms of “data mining,” “machine learning,” and “AI”) as well as classical statistics, and can implement gathering, analysis, reporting and storage of big data using current technologies (see examples above). To speed progress of research, most top companies share many parts of its key code. For example, on November 9, 2015 Google open sourced TensorFlow, its data and media analysis system that powers many of its services. Companies also open sourced their software systems for organizing massive datasets, such as Cassandra and Hive (Facebook).

The practices of the massive analysis of content and interaction data across media and culture industries were established approximately between 1995 (early web search engines) and 2010 (when Facebook reached 500 million users). Today they are routine, with every large media companies doing this daily and increasingly in real-time.

This is the new “big data” stage of modern technological media. It follows its previous stages such as massive reproduction (1500-), broadcasting (1920-), and the web (1993-). Since the industry does not have a single term to refer to all practices we are describing, we can go ahead and coin a temporary name. Let’s call them media analytics.

To the best of my knowledge, this novel aspect of contemporary media has not yet been clearly described by media and communication scholars. After approximately 2013, we start seeing more discussions of social and political issues around use of large scale consumer and social media data and automatic algorithms such as data and law, data and privacy, data and labor, etc. (The events at NYC-based Data & Society Institute offers many examples of such discussions. See also programs of Governing Algorithms conference at NYU, 2013, and Digital Labor conference at New School for Social Research, 2014, and publications in academic journal Big Data and Society, 2014- ).

However, I have not yet seen these discussions or publication of social, law, and economic issues include the idea I am proposing here – thinking of media analytics as the new condition of culture industry and also as a new stage in media history. Algorithmic analysis of “cultural data” and customization of cultural products is at work not only in a few visible areas such as Google Search and Facebook news updates that have already been discussed – it is also at work in all platforms and services where people share, purchase, and interact with cultural goods and with each other. (At the time when Adorno and Horkemer were writing their analysis, interpersonal interactions were not yet directly part of culture industry. But in “software culture,” the have now also become “industrialized” – organized by interfaces, conventions, and tools of social networks and messaging apps, and influenced in certain ways by algorithms processing all interaction data and making decision what content, updates and information to show and when to show it.)

Why do I call it a “stage,” as opposed to just a trend or one element of contemporary culture industry? Because in many cases media analytics involves automatic computational processing and analysis of every cultural artifact in particular industry (such as music industry as represented by music streaming services) and of every user interaction inside services that hundreds of millions of people use daily (i.e., Facebook or Baidu). It’s the new logic of how media works and how it functions in society. In short, it is crucial both practically and theoretically. Any future discussion of media theory, media theory or communication has to start dealing with this situation.

(Of course, I am not saying that nothing else has happened after 1993 with media technologies. I can list many other important developments such as move from hierarchical organization of information to search, rise of social media, integration of geolocation information, mobile computing, the ubiquity of consumer computing and media capture, viewing and processing devices such as mobile phones, and the switch to supervised machine learning across media analytics applications and other areas of data analysis after 2010).

The companies that are key players in “big media” data processing are newer ones that develop with the web – Google, Amazon, Ebay, Facebook, etc. – rather than older 20th century cultural industry players such as movie studios or book publishers. Therefore, what is being analyzed and optimized between 1995 and today is mostly distribution, marketing, advertising, discovery and recommendations, i.e. the parts where customers find, purchase, and “use” cultural products. As I already noted, the same computational paradigms are also implemented in social networks. From this perspective, the users of these networks become “products” to each other. For example, Amazon algorithms analyze data about what goods people look at and what they purchase and use this analysis to provide personal recommendations to each of Amazon users. And at the same time, Facebook algorithms analyze what people do on Facebook to select what content appear in each person News Feed. (According to the current default setting, Facebook will show you only some of these posts it calls “Top Stories” and they will be selected by its algorithms. This setting can be changed by going to News Feed tab on the left and selecting “Most Recent” instead of “Top Stories.”)

Media analytics is the key aspects of “materiality” of media today. In other words: Materiality now is not only about hardware, or databases, or media authoring, publishing and sharing software as it was in early 2000s (see my own book Software Takes Command for the history and analysis of this software). It is about technologies such as Hadoop, Storm, and computing clusters, paradigms such as supervised machine learning, the particular data analysis trends such as “deep learning,” and basic machine learning algorithms such as k-means, decision trees, and kNN. Materiality is Facebook “scanning 100 billion rows per second” and Google processing 100+ TB of data per day (2014 estimate), and also automatically creating “multiple [predictive] models for every person based on the time of the day.”

At this point you, the reader, may get impatient and wonder when I would deliver what critics and media theorists suppose to deliver when they talk about contemporary life and in particular technology: a critique of what I am describing. Why I am not invoking “capitalism,” “commodity,” “fetishism,” or “resistance”? Does not the media analytics paradigm represent another step in capitalism’s rationalization of everything? Where is my moralistic judgment?

None of this is coming. Why? Because, in contrast to what media commentators like to tell you, I believe that computing and data analysis technologies are neutral. They don’t come with some built social and economical ideologies and effects, and they are hardly the tools of capitalism and profit making. Exactly the same analytics algorithms (k-means cluster analysis, Principal Component Analysis, and son) or massive data processing technologies (Cassandra, MongoDB, etc.) are used to analyze people behavior in social networks, look for cure for cancer, spy on potential terrorists, select ads that appear in your YouTube video, study the Human microbiome, motivate people to leave healthy lifestyles, get more people to vote for a particular candidate during presidential elections (think Obama in 2012) and so on. They are used by profits and non-profits, by USA, Russia, Brazil, China and everybody else, in many thousands of applications. They are used to control and to liberate, to create new knowledge and to limit what we know, to help find love and to encourage us to consume more.

This does not mean that the adoption of large-scale data processing and analysis across culture industry does not change it in any significant ways. Nor does it mean that it is now any less of an “industry,” in the sense of having distinct forms of organization. On the contrary – some of marketing and advertising techniques, ways of interaction with customers, and presenting cultural products are very new, and they are in the last few years all came to rely on big scale media analytics. The cultural (as opposed to economic or social) effects of these developments have not been yet systematically studied by either industry or academic researchers, but one thing is clear – the same data analysis methods and ways to gather data that are used in culture industry can be used to research at least some of its cultural effects. Such analysis will gradually emerge, and we already can give it a name: computational media studies.