Archive for the 'the data mining age' Category

visualizing explosion of digital data

Sunday, April 8th, 2012

The World’s Technological Capacity to Store, Communicate, and Compute Information.

Martin Hilbert1 and Priscila López.

Science, February 10, 2011.

Abstract:

We estimate the world’s technological capacity to store, communicate, and compute information, tracking 60 analog and digital technologies during the period from 1986 to 2007. In 2007, humankind was able to store 2.9 × 1020 optimally compressed bytes, communicate almost 2 × 1021 bytes, and carry out 6.4 × 1018 instructions per second on general-purpose computers. General-purpose computing capacity grew at an annual rate of 58%. The world’s capacity for bidirectional telecommunication grew at 28% per year, closely followed by the increase in globally stored information (23%). Humankind’s capacity for unidirectional information diffusion through broadcasting channels has experienced comparatively modest annual growth (6%). Telecommunication has been dominated by digital technologies since 1990 (99.9% in digital format in 2007), and the majority of our technological memory has been in digital format since the early 2000s (94% digital in 2007).

Illustration from the article in Washington Post about this research:

Rise-of-Digital-Information

Against Search

Friday, July 22nd, 2011

Lev Manovich, July 21, 2011

keywords: search, Google, knowledge discovery, digital library, database, classification, folksonomy, information retrieval, HCI, interface, information visualization, digital humanities, cultural analytics, visual analytics, software studies, Manovich

Early 21st century humanities and media studies researchers have access to unprecedented amounts of media – more than they can possibly study, let alone simply watch or even search. (For examples of large media collections, see the list of repositories made available to the participants of Digging Into Data 2011 Competition, www.diggingintodata.org). The basic method of humanities and media studies which worked fine when the number of media objects were small – see all images or video, notice patterns, and interpret them – no longer works. For example, how do you study 167,00 images on Art Now Flickr gallery, 236,000 professional design portfolios on coroflot.com (both numbers as of 7/2011), or 176,000 Farm Security Administration/Office of War Information photographs taken between 1935 and 1944 digitized by Library of Congress (http://www.loc.gov/pictures/)?

Given the size of typical contemporary digital media collections, simply seeing what’s inside them is impossible.

Although it may appear that the reasons for this are the limitations of human vision and human information processing, I think that it is actually the fault of current interface designs and web technology. Standard interfaces for massive digital media collections such as list, gallery, grid, and slide do now allow us to see the contents of a whole collection. These interfaces usually they only display a few items at a time (regardless of whether you are in a browing mode, or in a search mode). This access method does not allow us to understand the “shape” of overall collection and notice interesting patters.

The popular media access technologies of the 19th and 20th century such as slide lanterns, film projectors, microfilm readers, Moviola and Steenbeck, record players, audio and video tape recorders, VCR, and DVD players were designed to access single media items at a time at a limited range of speeds. This went hand in hand with the media distribution mechanisms: record and video stores, libraries, television and radio would all only make available a few items at a time. For instance, you could not watch more than a few TV channels at the same time, or borrow more than a few videotapes from a library. At the same time, hierarchical classification systems used in library catalogs made it difficult to browse a collection or navigate it in orders not supported by catalogs. When you walked from shelf to shelf, you were typically following a classiffication based on subjects, with books organized by author names inside each category.

Together, these distribution and classification systems encouraged 20th century media researchers to decide before hand what media items to see, hear, or read. A researcher usually started with some subject in mind – films by a particular author, works by a particular photographer, or categories such as “1950s experimental American films” and “early 20th century Paris postcards.” It was impossible to imagine navigating through all films ever made or all postcards ever printed. (One of the the first media projects which organizes its narrative around navigation of a media archive is Jean-Luck Godard’s “Histoire(s) du cinéma” which draws samples from hundreds of films. ) The popular social science method for working with larger media sets in an objective manner – content analysis, i.e. tagging of semantics in a media collection by several people using a predefined vocabulary of terms also requires that a researcher decide before hand what information would be relevant to tag.

Unfortunately, the current standard in media access – computer search – does not take us out of this paradigm. Search interface is a blank frame waiting for you to type something. Before you click on search button, you have to decide what keywords and phrases to search for. So while the search brings a dramatic increase in speed of access, it assumes is that you know beforehand something about the collection worth exploring further.

We need the techniques for efficient browsing of content and discovery of patterns in massive media collections. Consider this defintion of “browse”: “To scan, to casually look through in order to find items of interest, especially without knowledge of what to look for beforehand” (“Browse”, Wiktionary). Consider also one of the meanings of the word “exploration”: “to travel somewhere in search of discovery” (“Exploration”, Wiktionary.) How can we discover interesting things in massive media collections? I.e., how can we browse through them efficiently and effectively, without a knowledge of what we want to find?

new article “Trending: The Promises and the Challenges of Big Social Data”

Saturday, April 23rd, 2011

**************************************************************************

DOWNLOAD:

Trending: The Promises and the Challenges of Big Socia Data (PDF).

 

**************************************************************************

In this article I address some of the theoretical and practical issues raised by emerging “big data”-driven social science and humanities. My observations are based on my own experience over last three years with big data projects carried out in my lab at UCSD and Calit2 (softwarestudies.com). The issues which we will discuss include the differences between “deep data” about a few and “surface data” about the many; getting access to transactional data; and the new “data analysis divide” between data experts and the rest of us.

digital humanities++ | syllabus for Lev Manovich ’s spring 2011 course at UCSD

Monday, March 28th, 2011

digital humanities++ | syllabus

presentation at National Department of Energy Research Center (NERSC)

Tuesday, December 21st, 2010

Photo: Daniela, Lev, and Jeremy in front of one of supercomputers at NERSC (The National Energy Research Scientific Computing Center).

 Daniela, Lev and Jeremy at NERSC

the myth of user-generated content

Tuesday, November 23rd, 2010

what percentage of videos people watch on YouTube are user-generated? answer: %17

(according to tubemogul report on january 31, 2010)

youtube_stats.jpg

fashion shopping with machine learning and computer vision

Wednesday, November 17th, 2010

 From Google blog:

Boutiques uses computer vision and machine learning technology to visually analyze your taste and match it to items you would like.”

“First we partnered with taste-makers of all types. We asked them not just to curate 10-50 great items they loved, but also to teach our site their style and taste. They did this by telling us what colors, patterns, brands and silhouettes they loved and they hated. They took a visual quiz that taught the site to understand their style genre: Classic, Boho, Edgy, etc. Our machine learning algorithms use this information to enable you to shop all of the inventory in the style of that taste-maker, on top of the 50 items they’ve hand-curated.”

“We analyze the photograph of an item for its color, shape and pattern and try to help you find visually similar items.”

new york times: the next big idea in humanities is data

Wednesday, November 17th, 2010

PATRICIA COHEN, Digital Keys for Unlocking the Humanities’ Riches, New York Times, November 17, 2010:

“A history of the humanities in the 20th century could be chronicled in “isms” — formalism, Freudianism, structuralism, postcolonialism — grand intellectual cathedrals from which assorted interpretations of literature, politics and culture spread.

The next big idea in language, history and the arts? Data.

Members of a new generation of digitally savvy humanists argue it is time to stop looking for inspiration in the next political or philosophical “ism” and start exploring how technology is changing our understanding of the liberal arts. This latest frontier is about method, they say, using powerful technologies and vast stores of digitized materials that previous humanities scholars did not have.”

a tour of our Mapping Time exhibition

Friday, November 12th, 2010

visualizing “Software Takes Command”

Thursday, November 11th, 2010

I have uploaded Software Takes Command book manuscript (88000 words; released online 11/2008 under CC license ) to manyeyes, and used Phrase Net to create a network graph connecting the key concepts in the text.

Using [space] between the words:

interactive version

softbook map - using SPACE between the words

Using [”and”] between the words:

interactive version

softbook map - using AND between the words

The links above will take you to interactive versions on manyeyes - you can change how many top terms are shown, the type of connection between them, etc.

I find that this mapping is very useful - you can check if your text is actually what you think its about (are the most frequently appearing words the ones you want?). (For instance, I did not realized that “animation” and “3D” were that prominent).

The maps also are really good at summarizing the “semantic space” of your text (more objective than the writer or the critics.) I think that Phrase Net should be every writer’s essential tool.