Data Sketch|es: A Visualization A Month – Shirley Wu and Nadieh Bremer

>>Thank you very much, Irene.>>Yes!>>Yes, cool. Good morning, everyone. My name is Shirley and this is�
>>Nadieh.>>And we are superexcited to be back at OpenVis
Conf this year to talk about our product, data sketchies. We met online in a data visualization in 2015
and didn’t meet each other in person until OpenVis Conf in Boston last year. Those are our happy faces. And at OpenVis Conf we had the pleasure of
giving talks and hanging out for three whole days where we hit it off superwell, so that
when two months later when Nadieh put up her OVG tutorials from our OpenVis talk, I dived
in with vigor. And I started chatting with Nadieh about the
tutorials, and that conversation led into us lamenting the fact that we hadn’t finished
as many fullblown visualization projects as we would like. And then I had an idea. And I was like, hey, Nadieh� fully expecting
to be rejected� do you want to collaborate on something? And she’s like, yes! Yay! And that’s how data sketchies was born.>>In the following week, we both liked the
idea that we would create a visualization around the same topic and do that for a year. See how two people create two visualizations
starting from the same seed, the topic, but diverging into different paths on our own
interests and history. We want to share the histories and write about
the process. It’s the three pillars that are most important. Data, sketching, and coding. And initially they thought we could pull data
sketchies off in five to six hours a week. But real life doesn’t agree with plans. Especially coding plans. Starting in July of 2016 we have clocked many,
many hours into creating a visualization each month. And during this talk wealth like to talk about
the lessons learns, the challenges, and the insights we gathered along the way.>>So let’s start with the data. We often get asked this question, do you get
the data first and then come up with your ideas, or get the idea first and go and find
the data from that? And for us, the answer is always, always idea
first. So, for example, for my November data sketchies. I wanted to look at every line in the musical,
Hamilton. I could filter by any set of characters as
well as their conversations and any themes and then be able to dig into the set of lines
from the songs that were left over. And the idea came from a question of how do
the relationships between the characters change throughout the whole musical? And then what are the reoccurring phrases
and who are they associated with? Now, as you can imagine, this data set is
not available anywhere online save for the lyrics themselves. So I had to go through all of the lyrics and
note all of the reoccurring phrases that appear across more than one song. Group them into broader themes. Go through the lyrics manually again so that
I can enter them into the computer, associating them with the right song and line numbers. And also do the same thing for the characters
and conversations. Write the script to aggregate all of that
information together to get the final data set. And the more extreme data sample, for October
data sketchies I wanted to put the emojis on the former� Mr.�and Mrs.�Obama’s
faces and built this tool for exploration where you can go through any of the videos
I found of late night talk show interviews and then go through the whole entire YouTube
video to look at the emojis that I put on their faces. And this idea came from a conversation with
Eric Cunningham, right there, where he was like, wouldn’t it be cool if you could just
run facial detection on the videos and correlate the emotions with what we are saying. Hey, Eric, you realize I only have one month
for this project. And then I was like, challenge accepted. And so, I started with, first, manually gathering
all of the late night talk show appearance off of I MDB. I then went and found all of the videos correlating
with those talk show interviews from the host channels. Used a note package to download all the videos
and the captions and get the time stamp from the captions so I can take a screenshot every
time somebody talked. Upload that to Google vision API because they
give me information about the faces and boundaries and emotions and how happy or angry they are
and if they’re wearing a hat. And I took that data and aggregated with the
caption data to get my final data set. So what these two months taught me was, if
I’m just curious� if I have a curiosity� there is some way I’m going to be able to
get my hand on the data set. Whether it’s to manually go flu and enter
them, or write a script to automate them. I think there’s a note package for literally
anything I can imagine under the sun. As long as I do all of this responsibly and
legally. [ Laughter ]
>>Well, thankfully, not every month is as data intensive as that. So for August the obvious theme was the Olympics. Especially since we’re both big fans. I decided to visualize all 5,000 medal winners
since the first Olympics. Each group was a sport, water sports or ball
sports. And each slice within a circle represents
one. The reddish background are female events,
and the bluish is male events. And each is given the color off of the continent
of the country that won the medal. America’s is red, Europe is blue, and so on. And you can see here there were no female
events in the first editions of the Olympics. But even catching up since then. Soy actually found the data from this piece
from two articles published 2012 games in London. I noticed obvious medals were missing like
hockey from 2012. It so my confidence in the data set dropped,
even coming from a respectful source like the guardian. I had to get a sense of the accuracy of the
data set. But I didn’t want to go through all 5,000
medals manually, maybe Shirley would have. So I found a proxy instead. On the Wikipedia, I could find the number
of events that happened. And I compared that to the number of gold
medals. For some of the years the horses were in the
data set winning golds. Which makes an interesting read, princess,
sissy and lady as winning gold in the Olympics. I figured out each of the adjustments to get
it to the point where I trusted it again. So my lesson here was, even if I have data
from a respectable source, I need to get a sense of accuracy and completeness. Missing data can be harder to find than wrong
data. You don’t have to find every value, but think
about sums and counts and arches and comparing that to plain sums. Or even better, a different data source. So many people dive straight from data to
final visual. But take some time and actually sit back and
sketch out your ideas on paper. We filled many pages of our notebooks in starting
because it helps us think and lay out ideas beforehand. But my sketches often are very simple. Only focusing on the main abstract shape that
I want to fit my data into. Colors and layout and design, these are things
I only vaguely think about, but don’t act on until I have the data on my screen. There’s just no use for me to think about
these things until I figure out that the data works after I have morphed into a shape. For the Olympics, I had the idea of feathers. Placing emphasis on the more recent editions. But I had no idea if that would look all right
when I placed all 5,000 together. I had to see how it would look. It took a few steps before I saw that luckily
it showed up with the data. But sometimes there’s even no use to start
sketching on paper. Although I will say that’s very rare for me. But networks are an exception. And for October I decided to dive into royalty. I have been intrigued how intermarried how
the royals are. Are they all cousins twice removed? I found a genealogy from the royal houses. It was from 1992, so I had to add one or two
more into the line of succession. Which was a fun night on Wikipedia. Not. So here’s all 3,000 people in the family tree. The biggest circles are the current rulers. And everybody is connected to the parents
or partners or children. And you can hover over everybody to see the
six degrees of separation and how they reach into the Web. But you can click on any person. Let’s see if I can get it to work here. And any other person, to see the shortest
path between these two people. Because the entire Web is connected. They’re all family in one way or another. But when I started out with this data set,
I had no idea what it contained. So I just sort of plotted everybody using
the most basic network settings. And then this happened. An explosion of points lines going out of
my screen. So I pulled in gravity a bit� could have
used D3 express� and then I ended up with a useless hairball. Call it points by year of birth. Still not happen helping. You can have gravity depend on variables. So I pulled the graph apart by year of birth
as well. Which was better, but still a rather uninsightful
bundle. And at this point I had invested several hours
playing with the network settings, adjusting the data. And I was really at a point where I thought
about giving up. Maybe a different angle, how much they’re
spending each year. But I gave it one last shot and decided to
focus on the current royal leaders. Placed in a line, and let the vertical gravity
be the ones you were closely related to. Insights, the queen of Denmark is central,
but the Prince of Monaco who lines separates from the rest 200 years ago. And it was around this time that I started
thinking about the general design aspects. And networks often remind me of constellations. And with my astronomy background, I have a
bias for all things space. So I turned it into a starry night. But I could have never designed this visualization
beforehand or sketched it. I had to go hand in hand with the actual data
and apply to the design choices to all the data simultaneously so that I could see if
the results were both interesting and engaging.>>So Nadieh gave a really great example of
when you land on the right visualization early on. But that’s not necessarily all the case. For our March data sketchies we had the opportunity
to work with Google news lab and their data back to 2004. Which, by the way, launched this morning. Please check it out. So with access to all of that data, I wanted
to look at what people were searching for and specifically what people in a country
search for around the world. So each of these blocks are a topic that the
U.S. searched for in spring. And I can toggle between all the different
weather seasons and see the topics as well as dig into a specific country and look at
top places there. And I can also expand on the topics so I can
see the search interests and seasonality for that particular topic. And the question I really started out was,
what are the top searched countries? Which turned out to be Brazil. And then I was like, who is searching for
Brazil? Can I see actually what kind of topics are
being searched for around the world in each country, including Brazil? And can I see the distribution of those searches
across the years? And then let me actually get a little bit
greedy, because I want to see the search interests also. Okay, maybe not. Okay. So let’s step back and let’s try for circles
for each of the topics and size them according to the search interest, and maybe I can show
the countries searching for each of those topics by overlapping circles. And this is kind of pretty and bubbly. So I kind of like it. But does geography play a part in who searches
for a country’s topics? So let’s maybe try sorting all of the topics
by distance for a year. And that doesn’t look really great. So maybe I can just concentrate on one topic
over all of the years. Maybe lend to a heat map. Maybe not. This is not going anywhere. I need to step back. But I did notice in an earlier exploration
that seasonality is quite common in a lot of the topics. So maybe I can keep sorting by distance. That geography is quite interesting. Maybe this time around I can start to filter
and group by the seasons. And that sounds quite promising, right? Nope. So it turns out that all the topics I searched
before across all of the seasons making all the bar graphs look exactly the same. That’s pretty sad. But wait! Then I realized that because there is seasonality,
the search interest is actually different across all of the seasons, so if I could just
size all of the heights of the blogs bit search interest, I actually start to get very interesting
insights like this one, that the U.S. searches for travel more often in the spring than in
the fall. And finally� finally� tada. My final visualization form that I’m quite
happy about where each of the topics are grouped by the country that they belong to. And there’s interesting insights like, for
example, if we just look at the topics for summer, Mexico is actually not searched for
that often, but Canada is. And the hotter countries around the world
like Thailand and the Philippines aren’t searched for that often. But if question go instead to winter, the
opposite is true. Where Mexico peaks and Canada drops. And Belize and Thailand and Philippines, the
hotter countries actually go up in search interest. So the last thing here was sometimes we don’t
get the right visual from our first switch or second or even third try. But be patient and go back and forth with
the sketch and code. That will help figure out what works, and
more importantly, what doesn’t work, so that we can go on to our next step.>>So as expected, most of our hours are actually
spent on getting the data on the screen. And here are some of our maybe less obvious
coding lessons. So in the very first month, the topic was
movies. And therefore, it was immediately clear to
me I wanted to do something with the Lord of the Rings. And I found a superinterested data set that
had the number of words spoken by each scene in all three extended editions of Lord of
the Rings. Amazing so I decided to focus on the members
of the fellowship and see how much locations they spoke in the circle. Not surprising, Gandalf speaks the most. But Boromir who is only alive in the first
movie managed to speak more than Legolas does in three. But anyway, when I looked at the sketch for
this project, I found it was very similar to a cord diagram. I could start there and slowly transition
the cord diagram to the sketch. And I wanted to flow inward, and it took less
time than anticipated, which is very rare for me in coding. But it worked. Getting rid of the excess space. And now it’s ready to handle the Lord of the
Rings data. And more appropriate colors. So we have nine members of the fellowship. Making sure the centers are ending up with
the right vertical location. This is looking squished. So I used the same information and pulled
the two halves apart. And now the cords are looking rather unnatural. I decided to dive into learning SVE paths. That took the longest. Sort of figuring out how to make the path
the look more natural. That’s how the new D3 came into existence,
mutated from the D3 core diagram. And many people have done wonderful work that
you can use. Even if you think you are creating something
new, don’t always have to start from scratch. Pick the thing that lies closer to the design
or idea and start with that. It’s out there already, maybe.>>But sometimes we dream up visuals that
are unique enough there’s no basis to move off of. For that same movie month, I wanted to look
at top summer Blockbusters in the last two decades and reimagine them as flowers. So each of the colors are associated with
a genre. And the size and number of the flowers are
their IMDB ratings. There are some really, really beautiful flowers
in here, I think, like “The Dark Knight Rises” and “Slumdog Millionaire.” But my absolute favorite is the 1997 “Batman
and Robin,” this tiny thing which I think is super cute. And I have gotten questions, how did you make
this? How long did it take? It’s really quite simple. Just takes a good grasp of SVG paths and the
cubic Bezier curve command. We start out with the starting point, in my
case, zero, zero. And draw a line between the starting point
and the end purple point. And then we take the two anchor points. The blue and green. And nudge them out until we get the curve
that we want. And then drew some of the lines, and made
the curve on the other hand, rotated the petals out. And added the colors with some motion blur
and that’s it. That’s all that it took. So the lesson here really was when we’re creating
things, really understand the tools that we’re using. That’s how we can go beyond the prescribed
examples. In particular, our favorite tool is SVG paths,
because with that under our belt, we can make any shape that we imagine up for truly unique
results.>>Now that you have seen two examples of
adjusting paths, what about their positions? Well, going back to the Olympic feathers,
all of the circles and slices they depended on each other. But they were very structured. They all followed the same concept. And at first I tried to calculate all of the
rotations of the circles and slices in JavaScript. But after having written like 30 lines of
code and still not achieving something I knew I could do in R in two lines, I just pulled
all of these preparations into R as well. So even if they were visual variables, they
have nothing to do with the data, only with how they are laid out on the screen. So, for example, I looked at the rotation
each of the circles would need to have so eventual the center would end up at the bottom. I precalculated the slices they would need
to have based on predecessors. The only variable in JavaScript to keep it
economic was the scale from the center outward. I could scale based on the screen size. And even the medal offset is something I precalculated
in R. And you can do it in networks adds well that are static and fixed. Download the final X and Y locations and the
next time place them immediately, saving your viewers from having to run and wait on a heavy
force algorithm. Even though they have nothing to do with the
data, it’s perfectly fine to precalculate visual variables and attach these to your
data set. That’s more often the case for fixed data
sets than you may think. Sometimes it’s way easier to calculate outside
of JavaScript. Or it can save you a lot of browser calculations,
making it easier to load. And as a bonus, it will make your JavaScript
file a lot easier as well.>>So far, we have talked about the initial
80%, the data visualization, the ideation, the visualization itself. But I like to think the last 20% is important
as well. So when I started thinking about the story
for Hamilton, I wanted to reach a wider audience than usual. And that meant that I wanted to make sure
they were engaged enough that they would keep scrolling down the screen. So the first thing I did was have the dots
fly into the center to form the Hamilton logo. And as the user scrolls down, the dots fly
apart and dance in the background and come back together so I can tell the reader, hey,
each of these dots are actually the lyrics. And as they go into the first section of analysis,
each of these sections actually correspond with a song. So then I highlight the correct song to tell
them� to tell them what the song is for each section. The next thing I do� or one of the other
small things I do is if the user decides they want to click on a song. That worked. Cool. I didn’t expect the sound to work. You can see I put a progress bar on the righthand
visualization. And you can’t see much else. Okay. But it gives users the context of where the
music is relative to the music itself. Another small example is for our March data
sketches where I used animations to explain how to read parts of the project. But I don’t actually trigger these animations
until the user has scrolled into that section. So that they can always start from the beginning
of the explanation, no matter how far or how fast they’re navigating. And it’s this small attention to detail and
attempts at delight that really make a piece for me, because it tells the reader that we
really care about their experience.>>And some more examples of what you can
do with the light, while on a flight back to Amsterdam I was without WiFi. So I couldn’t do anything essential. And therefore, I decided to animate the legend
I had for my visualization about fantasy books just for fun. And other things, adding animated GIFs of
the most memorable moments of Dragon Ball Z. That took like two hours, adding the GIFs. Or adding the hover for the music visual. Or turning the top ten songs into tiny vinyls. Or having annotations about weird events in
the history of the Olympic games. Like Henri Pierce having to stop for ducks
in the rowing event, but still managing to win gold. So getting the data on the screen in that
manner and making it insightful is the other things, animations, annotations, weird legends,
GIFs and more that can make it truly unique and special and more of a delight to investigate. So take some time to think about these aspects
as well.>>And now we get to what I call the soft
stuff, or what I call the best stuff. When we started talking about data sketches,
we didn’t expect the reception we have had. We thought if we had fun, maybe learned some
things, and if our friends enjoyed the project, that would be really cool. But we have gotten the most amazing responses
on both our visualizations and the value of our writeups. And we have gotten to meet incredible people
and talk to them that we have never had the opportunity otherwise. And we have gained an amazing friendship with
each other that we didn’t expect at the beginning.>>We just wanted to have fun.>>But when we step back and really thought
about the transferable lessons, we agreed that the most was, if you’re about to take
on an ambitious project, make sure to bring somebody on along the ride with you. And make sure that, if you’re not too responsible
like me, that the person that you bring along is very responsible so that you can keep each
other accountable and relatively on track for your project. Make sure that it’s someone that you really
respect. And hopefully that respect is mutual. And most importantly, that it’s someone that
you trust or can grow to trust, because that’s absolutely crucial to receiving and giving
feedback. And finally, if you’re about to do something
ambitious, like make a visualization from scratch every single month, know that it will
be hard. There have been months where we have been
creatively drained and didn’t know how to go on. But remember, you learn as you struggle. And it’s absolutely amazing the amount that
we’ve learned both technically and personally. And it’s been absolutely worth the time.>>So over the last ten months we have learned
to find data in the weirdest places. That’s not blasphemy to precalculate visual
variables. And sketching can help weed out errors, but
you can sketch with code. And SVG paths are amazing and math too. But we knew that. And small things can add a sense of delight
to your audience. And we didn’t set out to learn or� but we
set out to have fun. We have succeeded. There are times where we were coding into
the night and would have rather been watching a TV show. But it’s opened up opportunities that we weren’t
looking for but have gotten. Two or three more months? I can assure you we cannot keep on creating
visualizations at the same breakneck pace in our own time. But we want to share data sketches. And we have had great reactions, especially
about the write ups. And we will make a visualization on medium. And anyone who wants to share his or her writings
can do that here. You can do one a month, collaborate with others
on the topic, but it’s fine if it’s a standalone project. The main point is how your final visualization
is a product of iterations, mistakes, and improvements. So please let us know if you ever have anything
to contribute. While we hope you will join us in our final
two months of data mining, sketching, finding weird, fun, and overly elaborate visualizations. Thank you.>>Thank you. [ Applause ]

Leave a Reply

Your email address will not be published. Required fields are marked *