On Learning Data Visualization: 2009

Monday, December 7, 2009

Graph Representation

The final example in Visualizing Data was a graph representation. To start off, we used a short text sample and graphed it in such a way that each word merited its own box, and a line was drawn between each pairing of words that appeared in sequence in the text. The display of the words on the page followed a physics simulation algorithm (specifically mimicking a string), causing the words to try and arrange themselves at the lowest energy state between its connections. If the user clicked and dragged on a word, this would a) fix its position b) turn the color red c) adjust the other words into a new arrangement of lowest energy.

While that doesn't look terrible with such a small data set, it is easy to see how this would quickly get to be too much. In the next iteration, we make all objects a small yellow circle, and only change them into words if selected. Example text for the next three are the first chapter of "Huckleberry Finn."

Same idea with this one, but the radius of each circle is now weighted by frequency in the text. Also, less jarring colors.

The possibilities for this type of representation are pretty big. I am excited to start playing around with it. In closing, here's a picture of how another program rendered this same data set: (Graphviz, filetype .dot)

Thursday, December 3, 2009

Treemaps

For the latest chapter, we harnessed the power of Treemaps (history here) to generate a quick-and-dirty visual comparison of objects. The magnitude of some attribute determines the area given to that object in the representation, giving a visual patchwork that conveys the relative weight of each object relative to the others in its class.

To quickly illustrate the power of this method, the chapter starts off with the example of displaying word frequency in Mark Twain's "Following the Equator":

After laying the conceptual foundation, we turn the idea of Treemaps to a more complex and useful application. The final project asks the user to select a directory to start at, and then maps the files and folders contained in that directory to their relative data usage. Opening window:

I selected my "Pictures" folder, and the first screen appears as such:

As one might gather from the picture, the boxes represented are assigned a hue based on their location from the top-left to bottom-right corners. In this screenshot, focus is given to the "2009-11" folder, which brings the brightness of this box up, and simultaneously dims the other boxes in the field. Clicking on a box causes it to display a recursive Treemap of the folder contents inside of it.

This process is demonstrated the screenshot below, which is a zoom in on the folder "Snapshots" and then a highlighting the folder "2009_05_24", which has a Treemap on all of the files found within that folder. Emphasis is on the file IMG_0087.JPG.

One final touch that I want to mention is that the value of each hue is adjusted based on the most recent modification date of each folder/file. The timescale on that, moreover, uses an algorithm that evaluates all objects displayed on the screen and computes a logarithmic approximation of the set. This is displayed in the following screenshot: (mouse emphasis on the folder "2008-12-15")

Tuesday, December 1, 2009

Zip Code Project

A couple days ago I finished reading the Processing Handbook, which gave a very solid overview of the Processing language. I now feel like I have a good feel for what the language is capable of doing, and thus have a better understanding of what all I can put to use in my own visualizations.

After the diversion into theory, I finished the Zip Code project in Visualizing Data. In this example, we create a faux-population density map by mapping the latitude and longitudes of the center of each zip code in the US. The data pre-processing that went into this exact project was quite interesting. In addition to putting the data in a friendly text format (no commas, reformatting the city names from all-caps), we had to convert the latitude and longitude points to a projective view of the US, since this is the view that most people are used to seeing. (Map below and algorithm from here)

After formatting the data, we arrive with a scatterplot of zip codes. This example gets interesting, however, by allowing user input in typing in numbers, which in turn highlight the zip codes that start with that number(s). Below is the map for "9":

If one types in all five digits of a zip code, the name of the town with that zipcode is displayed on the screen:

The final component of this code is a zoom function that can be activated by clicking on the "zoom" on the bottom right. This zooms the viewing window in to see the zip codes containing the numbers typed. Here is "4":

Wednesday, November 18, 2009

Processing Interim

I decided that since I have now created several examples of fully-fledged programs in processing, I should go and learn about the full capabilities of the language. To do so, I am taking a break to read through "Processing: A Programming Handbook," the official guide to Processing, as written by its creators.

I'm learning a lot about the fully capabilities of the language, and perhaps equally important, getting a solid review of key concepts in computer programming.

Thursday, November 12, 2009

Real World Data

The next Chapter of "Visualizing Data" address real life data, and how it can often be messy and difficult to parse. As such, most of the information in this chapter deals with background processes. Topics included: Sifting through website source code to find the files that actually contained the desired data, regular expressions to parse data files, and creating strings to store the relevant information for future use.

The project for this chapter dealt with correlating baseball teams' win-loss record to their salaries. After going through all of the data, we reached this basic sketch:

There wasn't a whole lot of work put into refining the image, but the following image does use color to differentiate the sign of the correlation, width for the magnitude, and a couple of spacing/typography improvements:

Tuesday, November 3, 2009

Time Series Graphs

Today I worked through Chapter 4 of Visualizing Data, titled "Time Series." In this Chapter, we got into the actual mechanics of creating a solid graph, in that in contains all of the necessary components to clearly illustrate the data. Half of the chapter covered nitty-gritty details, such as axis labels, tick marks on the axis, or small lines on the graph to give the viewer a sense of scale.

In addition, the chapter covered various methods of actually presenting the data in question: a series of points, a series of line segments, a smoothed line, a combination of points and lines, a solid-color area, or a bar graph.

We even added in a mouseover function to display the value of data points when moused over.

The original graphs were drawing from a table of three different subjects; Coffee, Tea, and Milk. Originally the user was able to access the different graphs by scrolling through via the "[" and "]" keys, but in the final exercise we added tabs up top that responded to the user's mouse clicks.

Wednesday, October 14, 2009

Today I Met Processing

Today I started on the O'Reilly book, "Visualizing Data," written by Ben Fry. I breezed through the first couple introductory chapters, and then started to get my feet wet in Processing. Fry's approach is to provide the reader data, files, and source code to get an example project up and running, and then to teach new aspects of programming via adding to this example. Processing is neat! Its creators were attempting to create a visual programming language that follows the form of a scripting language. I've never done anything involving scripting languages, so it was fun to see a different approach to code.

At the end of my reading today, I was up to a map of the US (provided) that had a data point plotted in the center of each state (data provided), with the data point's color and size coordinating to the sign and magnitude, respectively, of a (provided) table of random numbers. When the user mouses over a circle, it displays the value of that point, in addition to the name of the date (table provided). Neat!

Fry also explained the approach that he takes to creating a data visualization, and categorized several of the Processing functions used according to these categories. I look forward to learning more of them!

Tuesday, October 13, 2009

To Clarify, Add Detail

Today I read Edward Tufte's "Envisioning Information," which is intended as a sequel of sorts to his "Visual Display of Quantitative Information." While largely addressing the same body of information, EI tended to be less axiomatic than VDQI, and tended towards the "immersion technique" of teaching. In short, it was the coffee-table version.

Nevertheless, EI did provide a solid review of the concepts presented in VDQI, and provided me with more examples of Tufte's principles in action. For the purposes of this thesis, I imagine it will largely serve as a source of information in my visualizations. In learning Tufte's ideas a second time, I do feel like I reached a deeper understanding. Some highlights:

- The utmost importance of multivariable comparisons of data, and the way that a sparse display raises questions about the intentions of the creator. To create the most effective visualizations, I should strive to increase the data density. (I liked that this idea was presented as "escaping flatland," a nice reference to a well-known math book.)

- The role of color within a visualization. Bright colors should be used sparingly, to avoid visually overwhelming the viewer. While this was fairly intuitive, Tufte also recommends against white backgrounds and relying too heavily on black, opting instead for a neutral or grey color scheme. Bright colors, finally, should be used sparingly to provide emphasis on top of this base color scheme. I personally lean towards clean black and white designs, so this is something new that I will have to incorporate into my work. The muted color scheme does strike me as a little outdated, however, so I will also have to find a way to reconcile these two ideas.

- Given the pop nature of this book, I did find several visualizations that completely captivated me. I am intrigued by the idea of movement notation, which is a symbolic representation of dance choreography. My absolute favorite, though, was learning about Oliver Byrne's visual reinterpretation of Euclid's Elements, available online here.

Thursday, October 8, 2009

The Revelation of the Complex

While the first half of "The Visual Display of Quantitative Information" addressed data visualization in the practical realm, the second half of the book approached it from the theoretical. The chapters addressed the process of creating visualizations from the particular (eliminate Moire Vibrations, avoid grids) to the philosophical (maximize data density to working with large data sets). I wanted to point out a few particularly interesting points:

- Quartile Plots

The "box plot" is a standard sight in statistics classes, but Tufte provides an alternative way of depicting the same idea - all in the name of cutting down on non-data-ink. The alternative, a quartile plot, astounded me with its visual simplicity, while still effectively conveying all of the information.

- Different levels of depth found in graphics

"Graphics can be designed to have at least three viewing depths:

1) what is seen from a distance, on overall structure usually aggregrated from an underlying microstructure

2) what is seen up close and in detail, the fine structure of the data

3) what is seen implicitly, underlying the graphic - that which is behind the graphic"

- Shrink Principle

The idea of the shrink principle is that effective graphics can be shrunk way down and still retain their information. Tufte also included an illustration from famed visualist Bertin, who demonstrated several techniques of shrinking data while maintaining the given relationship between variables. I find this beautiful.

Monday, October 5, 2009

Data Visualization Standards

To start off my project in Data Visualization, I started reading Edward Tufte's "The Visual Display of Quantitative Information." The book is split into two halves, Graphical Practice and Theory of Data Graphics. Graphical Practice dealt with data visualizations from a practical and real-world perspective, specifically addressing their unique ability to convey complex relationships between data variables, and how often this is exploited in modern journalism. For a student who was planning on approaching this project from a purely theoretical perspective (primarily due to time restraints), I was pleasantly surprised to encounter these ideas as my introduction. Tufte lists six guidelines for data visualization integrity, as follows:

55 - Graphical Integrity - The representation of number, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities represented.

55 - Clear, detailed, and thorough labeling should be used to defeat graphical distortion and ambiguity. Write out explanations of the data on the graphic itself. Label important events in the data.

60 - Show data variation, not design variation.

67 - In time-series displays of money, deflated and standardized units of monetary measurements are nearly always better than nominal units.

70 - The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data.

73 - Graphics must not quote data out of context.

I have a friend working on his Masters in Journalism, who is currently trying to create a set of minimum standards for journalistic integrity. (Jonathan Stray, Journalism Commons) My friends' work and Tufte's guidelines dovetail nicely together, and I was pleased to have a reminder of the real-world application and repercussions of data visualization before heading into the theory.

On Learning Data Visualization