One of the biggest challenges for nontechnical and business users in producing social data visualization is deciding which visual should be used to represent the data accurately. Maybe it’s not so much the accuracy as it is the clarity. But why do it at all? As humans, we tend to think visually (at least a good number of us do). Sometimes concepts or ideas can best be described with pictures rather than just words or verbal discussions. Discussing the differences between 10 different data points is interesting, but if we can convey that thought with a simple picture, why not do that? The idea is that there might be a huge difference between point 1 and point 6 but not so much between point 1 and point 3. In this case, how much easier would it be to just see those differences?

Visualizing the information we’ve gathered allows the consumer of that information to potentially see things in our results that might have otherwise gone unnoticed. Any representation of raw data can convey information, but without a visualization, we could miss out on trends, behavior patterns, or dependencies. Visualizations give us answers faster. Looking at a graph and identifying a trend can take but an instant. However, imagine how much time it would take to scan rows of numbers and pick out that same trend.

A data analytics project isn’t complete until the results we’ve collected are packaged or presented in a way that truly helps the receiver of those insights make decisions or take action based on our work. That means we, as analysts, not only need to know what the data means but also need to be able to represent it in a way that can best convey the information we inferred from it so that our users can derive conclusions, which in turn can drive business outcomes.

The options for data visualization seem to be growing almost by the day. New technologies and techniques can turn number-intensive reports into bright, colorful 3D interactive graphics. But we must be careful. In our zeal to create the prettiest of charts or sexiest rendering, we may lose the message we are trying to convey. Don’t get us wrong: Compelling graphics are a wonderful tool in expressing findings, but overuse (or overindulgence) may push the audience from the realm of understanding into the world of confusion. The choice of how to deliver results will depend on clients’ needs.

Visualization should help tell the story, not drown it out. In this chapter, we discuss some of the different types of visualizations to consider when presenting the results of an analysis. There isn’t a single “right” answer to the question of what kind of a graph provides the clearest insight to a user. There are some best practices when it comes to selecting color or limiting information on a chart to maximize the impact of a message, but it’s ultimately about the insights or additional information that was discovered. Many analysts opt for the pretty charts or the snappy presentation, but we must always keep in mind that it’s the results that count, often more than the presentation itself.

Common Visualizations

In his Harvard Business Review article titled “The Three Elements of Successful Data Visualizations,” Jim Stikeleather [2] outlines three areas of concentration that visual designers should consider when creating graphic visualizations. These considerations are

■ The design should understand the audience.
■ It should set up a clear framework.
■ And, probably most important, it should tell a story.

What these considerations really boil down to is this: clarity. Effective visualizations should not confuse the audience but should present a story (or an insight) in a clear, simple-to-understand way that helps the audience understand the conclusions that were drawn from the data. Or better yet, these visualizations should enable them to understand the data in such a way that they can discover new insights or relationships on their own.

Let’s look at some of the more simple types of graphics first, many of which we’ve used throughout this text. The graphs we create should aim to simplify the data in a visually appealing way. The main challenge that many people have with graphs is choosing which chart type to use. We all want a visually appealing chart to present to a customer, but remember, if a chart is visually appealing, we must be careful that we don’t spend more time talking about how “cool” the chart looks as compared to the message it is trying to convey.

Let’s look at a few chart types.

Pie Charts

Pie charts are best used to illustrate the breakdown of a single dimension as it relates to the whole. Basically, when we want to look at the value of a specific dimension in relation to other values in that same dimension, we could use a pie chart to easily visualize it. Pie charts help us see, with a quick glance, which attribute in series of data is dominant or how any individual attribute or set of attributes relates to each other. Consider the graph in Figure 13.1 that depicts the number of social media data collection posts over a 24-hour period by five specific users.

In other words, it is best to use pie charts when we want to show differences in a specific group based on one variable. In the example in Figure 13.1, we collected the number of times these users used the word cloud in their social media postings. It is important to remember that pie charts should be used only with a category or dimension that combines to make up a whole. In this case, we collected 151 tweets, and clearly user 4 was the dominant communicator with a 60% coverage.

Social data visualization
What makes a pie chart useful is the quick visual comparisons that can be done. Again, without the percentages in the graph, we could quickly see that user 4 is more verbal than all of the other users combined. Or we can see that together percentages for users 5 and 1 are close to the amount of conversation initiated by user 2.

Bar Charts

Bar charts, like pie charts, are useful for comparing classes or groups of data. In bar charts, a group can have a single category of data, or it can be broken down further into multiple categories for greater depth of analysis. A bar chart is built in such a way that the heights of the different bars are proportional to the size of the category they represent. Since the x-axis (the horizontal axis) represents the categories that were measured or being represented, it tends to have no scale. The y-axis (the vertical axis) does have a scale, and this indicates the units of measurement. Figure 13.2 looks at the same set of data we used previously.

With the bar chart, like the pie chart, we are able to easily see that the most prolific social media participant was user 4. What’s slightly more difficult is understanding that together all of the other users’ posts combined don’t equal or compare to those made by user 4. That fact was much easier to see in Figure 13.1. On the other hand, a bar chart does allow us to easily see the user with the smallest number of posts or the user with the biggest number of posts a bit quicker. It must be noted that sometimes the pie slices, because they represent percentages and not actual numbers, make that a bit more difficult to discern. It is a bit easier to compare two bars (consider user 2 and user 3 in Figure 13.2) versus the corresponding pie slices in Figure 13.1. A comparison can be made, but it takes the audience a bit of time to see the difference in a pie chart versus the bar chart.

One danger of using bar charts is the comparison between different graphs or charts. While we didn’t have to scale the data in Figure 13.2, sometimes the data points might be so varied that they need to be scaled to be represented on a graph. The bottom line: Watch out for inconsistent scales across multiple graphs. If two charts are put side-by-side, ensure that they are using the same scale. If they don’t have the same scale, be aware of the differences and how they might trick the eye.

A perfect example of a scaling problem was discussed by Naomi Robbins in an online article in Forbes Magazine [3]. Consider the graphic (which we re-created) in Figure 13.3 showing the relative number of medals won, by country, in the Summer Olympics. While the graphic is interesting, it can be quite misleading. For example, if we look at Germany with 500 medals (the data reports 499, but it’s close enough), we might assume that each graphic of a medal is equivalent to 250 medals awarded. That makes sense. But if that were the case, then shouldn’t Russia show 4 medals? It would appear that Russia was awarded around 1,250 and the USA number of 1975 really should be 1,500 (250 × 6). Clearly, the scale doesn’t work for this graph, and while it was probably trying to show the relative number of medals by country, in the long run, it would probably cause more discussion and confusion when the audience tries to rationalize the numbers.

Line Charts

Line charts are similar to bar charts and at times can seem interchangeable; however, a line chart works best for continuous data, whereas bar and column charts work best for data that is categorized. Think of continuous data of the same dimension that is changing over time (the number of posts made over the past 30 days, the number of mentions of a product in a 24-hour period, and so on). Of course, we can use bar charts to show the change of values for a particular entity over time as well; that may come down to style, but generally speaking, a line chart is much more useful in discerning trends and patterns in data.

Let’s look at an example of the number of mentions of a particular product over a 24-hour period. The data we use is from Table 13.1, which lists the hour (0–23 hours) and the number of mentions made in that hour. A quick look at the table reveals nothing out of the ordinary. At first glance, this looks like a US-based audience (perhaps the Northeast) because social media posts are made throughout the day and a noticeable drop to zero occurs around 2 a.m. to 6 a.m. (when we assume users are sleeping), but no real trends.

A line chart, as shown in Figure 13.4, instantly shows an interesting trend.

It appears that around 13:00 hours and then again at 21:00 hours, discussion about the product or service is at its peak. Clearly, any kind of marketing plan or advertising should take place around these times. But these kinds of trends, while possible to see in Table 13.1, aren’t as obvious as when shown as a line chart.

Watching this data over time, perhaps over several of days, could show a repeating, and hopefully predictable, pattern that could be invaluable to those wishing to engage with potential customers or prospects.

Scatter Plots

While line charts provide a way to map independent and dependent variables that are both quantitative (that is, measurements), a scatter plot can be useful to depict a trend or the direction of the data. When both variables are quantitative on a graph, we can interpret a line that spans that data as a slope (or prediction) of future data or trends. Scatter plots are similar to line charts in that they start with mapping quantitative

In the case of Figure 13.5, we can see that the general trend for tweets made over a 24-hour period is on the rise, or generally increasing over time. Obviously, there are peaks and valleys in the data, but the amount of chatter, or conversation, is increasing.

Common Pitfalls

When creating these graphs, we should consider a few things in an effort to keep messages clear and allow the audience to focus on the story or message being delivered, not on the charts and all the pretty colors.

Information Overload

One of the most common issues we’ve seen with charts or presentations that attempt to show the results of a study is that they often contain too much information. Consider a simple chart (such as that in Figure 13.2) where the amount of information is kept to a minimum: the label for the data points along the x-axis and the values on the y-axis. There isn’t much more needed. Often analysts like to augment their graphs with notes that indicate a peak or valley in the data and perhaps why it might have occurred. Is this information really necessary? Isn’t that what a picture is for, to visually show the peak or valley?

Other times we see a pie chart (similar to Figure 13.1) with two or three other graphs that attempt to show a similar concept or provide an alternative way to represent the same concept. If one graph doesn’t describe the concept well enough, we need to ask, “Is that graph really providing value?”

While redundancy is one issue, the side effect of adding too much information to a graph is that in order to fit the additional information, font sizes tend to get smaller, approaching unreadable.

The Unintended Consequences of Using 3D

When creating graphs to depict our data, often we feel that they are sort of dull or uninteresting (how interesting can you make a bar graph that depicts users and number of message postings, anyway?). But we all want our results to look visually appealing with the thought that the presentation could keep our audience’s attention. Often we’re tempted to take a standard graph and turn it into a three-dimensional rendering in an attempt to spice up the message. But this can have at least two unintended consequences.

The representation of the results may be so compelling that the audience misses the message and focuses on the pretty picture. Nothing could be worse! Not only do they not receive the intended message, but in the long run, if this metric is utilized later (say as a descriptive statistic), they may just not understand how it’s used. In essence, they may have focused more on the delivery rather than the message.

One of the considerations we mentioned earlier is that the goal in presenting results is to do it in such a way as to avoid confusion. Nothing can derail a  discussion about the meaning of an insight or a metric more than a discussion about the validity of the data. Now let’s look at the charts in Figure 13.6.

In this figure, the same data is plotted in two dimensions (2D) versus three dimensions (3D). While the chart on the right (the 3D rendering) does look a little prettier, the audience may start to question the values on the chart. Look at user 7. In the 2D graph, the value looks to be 800, yet on the 3D graph, the value appears to be below the 800 line. Upon realizing this, viewers may turn the conversation from the value of seeing user 7 as the most prolific to “why is the value not represented correctly?” Inadvertently, this graph has now raised some doubt in the eyes of the audience to the validity of the data being presented.

Using Too Much Color

Another consideration is color. Often analysts go overboard with the use of colors (and font types for that matter) in their graphs. As with many aspects of visual perception, humans do not all perceive color in the same way. Said another way, every user’s perception of an object is influenced by the context (or color) in which it is presented. This doesn’t mean we should stay away from color, but it does mean that the use of varied amounts of color should be done sparingly.

Consider the bar chart from Figure 13.2. If each of the bars were drawn in a different color (say blue, red, green, gray, and purple), does that provide any additional value to the chart? Or does it raise the question “What do the colors represent?” and simply add a layer of confusion to the message? Then again, we have to consider that an audience member who sees a bar in a bar chart as red may assume that’s a problem or an area to concentrate on (we tend to think of red as indicative of a problem area). Don’t use color to decorate the graph. Prettying up a graph might serve a purpose in attracting attention, but from an analytics perspective, it can only distract from what’s important—the data and the insights we are attempting to draw from the data.

Alternatively, if our graph were drawn in black and white and one of the bars was coded in a color (say red), it would stand out and perhaps draw attention to that specific point, which may be the intention. In that case, the use of color adds value to the graph in calling out a specific area that merits discussion.

Visually Representing Unstructured Data

Probably the largest problem with social media data mining tools analytics is figuring out how to compute and then show relationships in the data and also visually represent topics of conversation. One of the more frequently used techniques is that of word frequency.

The frequency with which particular words are used in a set of messages can potentially tell us something meaningful about that set of text. Of course, if all of the messages were produced by a single individual, the frequencies may tell us something about the author because the choice of words an individual uses is often not random but purposeful. In our case, we tend to look at the messages from a wide variety of people in the hopes of detecting common signals or messages in that body of data. A depiction of a word frequency report may be useful if we want to determine whether the most frequently used words of a given text represent a potentially meaningful pattern that may have been overlooked in a casual glance at the data.

However, when we look more closely at the graph, two things stand out:

■ There are far too many words to show along the horizontal axis. After about 10 values, the points get placed too close together, and therefore, it becomes difficult to read. This example shows just the top 60 words in a shortened version of what we’ve used previously; typically, we like to look at the top 100 to 150 words.

■ The scale quickly becomes confusing when there are words that have a very high frequency versus those that are average to low. Remember, average to low in the top 60 words is still significant. The point is that it’s difficult to discern the subtle differences between word usage.

For these reasons, we like to use the word cloud version of this analysis (see Figure 13.8).

The word cloud quickly shows the relationship between word counts by making the more frequently used words larger. Since these words are larger and more prominently displayed, they quickly catch the audience’s eyes. But more than that, the audience can draw their own conclusions about the relative use of one word versus another. This approach becomes even more useful when we look at the frequency of word phrases (two or more words used together).

Another consideration when trying to show the relationship between words used is to remove the clutter in the data. This clutter is often referred to as stop words. For the English language, this includes words such as a, the, is, and so on. Imagine creating a word cloud of frequently used words. If we included these in our analysis, more often than not the word cloud would be dominated by the word the with a count so high that it would literally drown out any of the other words.

Another consideration (and temptation) with word clouds is trying to make them look visually appealing. Often this has the reverse effect. Remember, early in this chapter we said that the goal of analysts is to report the findings and provide facts. When we introduce fancy charts, say in 3D or with fancy fonts, we distract the audience and risk having them miss the finer points of the insights.

For example, Figure 13.9 shows the same data used in a word cloud that is generated by an online tool.

We really like this website and often recommend it to users, but with the wrong options selected, the message of the word cloud could get lost in admiration of the graphic that was produced. Remember, the point of creating the graphic is about facts and insights, not sizzle.

From what we can tell, Murphy’s Law (which in its most famous form states: “If anything can go wrong, it will”) was supposedly first coined at Edwards Air Force Base in the United States around 1949. While many seem to claim credit for the concept, it appears that it was named after Capt. Edward A. Murphy, an engineer working on a project for the Air Force to see how much sudden deceleration a pilot could withstand in a crash. As the story goes, one day, after finding that a piece of equipment was wired incorrectly, Captain Murphy cursed the technician responsible and said, “If there is any way to do it wrong, he’ll find it.” The contractor’s project manager kept a list of “laws” and upon hearing this statement added this to his list, which he simply called Murphy’s Law. Sometime afterward, an Air Force doctor, Dr. John Paul Stapp, who participated in the deceleration project, rode a sled on a track to a stop at rate of over 40 Gs (or about 392 meters per second squared). When he gave a press conference, he noted that their good safety record on the project was due to a firm belief in Murphy’s Law and in the necessity to try to circumvent it. The rest, as they say, is history.

And so it is with analytics and text analytics in particular. Regardless of the amount of planning or forethought put into a project, something always goes wrong. Sometimes it’s something small, perhaps an oversight; other times it could be as large as making some incorrect assumptions and having to start from the beginning. No matter, the objective isn’t to remove all possibilities of things going wrong, but rather simply to be prepared for them.

Recap: The Social Analytics Process

Up to this point in this book, we have discussed the various steps we need to take on the journey to analyze social media data collection content. We’ve tried to focus on the “what’s possible” rather than the “how to” because tools and data sources are constantly changing and evolving. But more importantly, we’ve tried to take a logical step-by-step discussion of what needs to be done in performing an analysis, from posing a question, to gathering and cleaning data, and then on to doing the analysis. Consider the diagram in Figure 12.1. Many of the steps in this diagram have been discussed in previous posts and should be familiar to you. The thing to consider is that an analysis isn’t a single, well-defined set of steps; it tends to be quite iterative in nature. That is, we try something, perhaps searching for a set of data; if it appears we have the correct data sources, we go on to the next step to collect data points. If we then determine we don’t have enough data or the correct data, we go back and look for new sources rather than continue with faulty information. Over time, as we iterate through the problem, the answers begin to emerge.

While Figure 12.1 represents our view of an iterative approach to an analytic analysis, it’s impossible to diagram every step of a process, especially as we make subtle changes to tools or search terms. However, we believe this represents a respectable view of the process and therefore the points where things could go wrong.

Previously, we described the process as the five Ws—Who, What, Where, When, and Why. Having a crisp description of what we are looking for may seem obvious to most (step 1 in Figure 12.1), but sometimes we can be stubborn. Often we don’t think that the question we are asking is the right one, or perhaps we need to modify it. We see this later in the examples, but for now, we express the first step as “defining the problem” or perhaps simply “posing a question to be answered.” For example, “What do teenagers think of the latest movie trailer for an upcoming new release?” or “How is our brand perceived in the European marketplace?”

social media data mining tools

Obviously, when we know the question we want to answer, our next step (labeled step 2 in Figure 12.1) is to find the data that will help us uncover that answer. This shouldn’t be a long, tedious step, but it’s important, if nothing else, as a litmus test regarding the analysis. By this, we mean that if we check for data around this topic, and we find sufficient evidence that is enough to produce a meaningful result, we can (and should) proceed for an analysis. How do we know that we have enough data? The real answer is: We don’t. At least not until we’ve done an analysis to see if there is a sufficient number of results to warrant a conclusion. But again, a quick scan of social media sites or search on Twitter can give us a gut feel that we do or do not have enough material to work with. If we don’t, perhaps, as the diagram shows, we need to modify the question we are trying to answer. When we believe there is sufficient data for our analysis, we then need to go out and collect as much as we can from a variety of social media venues (step 3). Although we may ensure that we include a particular set of sites, if we’re going to use a search engine or a message board aggregator (we tend to use Boardreader because it’s tightly integrated with IBM’s Social Media Analytics [SMA] product) but we can’t be quite sure what sites or content will be returned. This isn’t a bad thing; the problem is that there are so many social sites that could possibly contain content relevant to our question. The work done in step 2 was just a cursory check; now we put in a set of keywords and ask for all the matches.

Developing a data model (step 4) is a process or method that we use to organize data elements and standardize how the individual data elements relate to each other. This step is important because we want to run a computer program over the data; we need a way to tell the computer which words or themes are important and if certain words relate to the topic we are exploring. For example, if we were looking at job titles or descriptions, the word architect may come up. If we were specifically interested in the computer field and information technology professionals, we would want to qualify the word architect with some technical words such as network so that we could find network architect or terms such as system or computer to uncover discussion around system architect or computer architect. This is what a data model does for us: it defines our “universe” and creates relationships between words and phrases so that we can make sense of the data during our analysis.

In the analysis of our data, it’s handy to have several tools available at our disposal to gain a different perspective on discussions taking place around the topic. We like to call this “our bag of tricks.” In some cases, it may make sense to look at a word cloud of conversation, and in others, it may make sense to watch a word cloud evolve over time (so computing a time series word cloud may make sense). When we talk about step 5a, augmenting the tools, that’s really about configuring them to perform at peak for a particular task. For example, thinking about a word cloud again, if we took a large amount of data around computer professionals, say the IT architect example from the previous step, and built a word cloud, no doubt the largest word in the cloud would be architect. (Remember, in word clouds, the more frequent a word is used, the larger it becomes in relation to the other words.) So here, we might create our initial cloud, see that architect is overwhelming the cloud, and simply change the configuration to eliminate that word. Think about it: we know the data collected has to do with IT architects, so why do we need to show it in our cloud? By eliminating it, we allow other words to stand out and perhaps reveal some insights.

This analysis is also about tool usage. Some tools may do a great job at determining sentiment, whereas others may do a better job at breaking down text into grammatical form (noun, verb, adjective, and so on) that enable us to better understand the meanings and use of various words or phrases. As we said, it is difficult to enumerate each and every step to take on an analytical journey. It is very much an iterative approach (perhaps we should call it “exploratory analysis”) as there is no prescribed way of doing things. But with that said, let’s take a look at some of these steps and see where things can (and undoubtedly, will) go wrong.

Finding the Right Data

Torture the data, and it will confess to anything. —Ronald Coase, Economics, Nobel Prize Laureate [2] Data is the fundamental building block upon which we build our analysis. As we discussed previously, data is created in a wide variety of blogs, wikis, and microblog sites across the Internet. Every website we visit, every post we make in social media, every picture or video we upload—everything. Because there is so much data available to us across the Internet, sometimes finding the “right” data can be challenging. Often the data we find or gather for an analysis is directly related to the topic we are investigating; however, often the data isn’t what it seems.

In 2012, there was a scandal involving Chinese factory workers and the conditions under which they were expected to work. Several US-based firms were “called out” in the press for using these companies in the production of their products. The question we were trying to answer was: “Is IBM being associated with potential “bad press” in any of the social media venues?” And if so, where? Also, what is the perception of IBM in relation to its competitors?

A quick manual survey of various wiki and blog sites showed there was indeed conversation happening around this particular event and factory. With that in mind, our plan was to mine the social media space for any occurrence of the particular factory which included mentions of IBM and then build a more comprehensive model to understand the conversations. This model would consist of key terminologies that may (or may not) reflect the public’s perception of the factory worker’s concerns. Later revisions would then take this analysis and segment it by various competitors to see how IBM was viewed amongst its competitors.

Upon collecting all of the available data and subjecting it to our analytic model, we began to build a set of insights. One of the first metrics we produced was a count of the number of mentions This is the number of times various companies, including IBM, were mentioned in all of the social media content that we were able to collect. We also began to summarize the locations where these conversations were taking place. This is an important piece of information not only for obtaining unsolicited opinions but also for marketing and relationship building. There may be a time when we may want to participate in the conversation.

We plotted this information on a bar graph using two bars for each company (Figure 12.2). The first bar in each grouping (on the left) shows data that we first collected (labeled “no city specified”). At first glance, this looks like a major problem for IBM. The graph is showing that there are a large number of references to IBM in the context of this company and the working conditions in various Chinese factories.

A closer look at our results showed that this particular company operated many different factories across China and Asia in general. Our first interpretation was that people were associating IBM with this event. However, we came to realize that the poor working conditions were limited to a factory in a single city. Once we realized that, we limited our model to that particular city and then any mentions of IBM and its competitors. The result was the second bar (the one on the right in each grouping) in Figure 12.2 (labeled “in the context of a city”). As it turned out, there was no association of IBM and the factory in question.

The reason we received these types of results wasn’t so much that our data was flawed; but for this particular analysis, it just wasn’t scoped properly. The moral of this story: double-check your findings early and often. Had we continued to build out a full analysis based on this data, it would have been like attempting to ask the crowd at a baseball game what they thought about a movie premiere; that is, data about a topic from a completely unrelated source.

Communicating Clearly

Speak clearly, if you speak at all; carve every word before you let it fall.—Oliver Wendell Holmes, Sr. [3]

What makes social media analytics, or any kind of text analytics for the matter, difficult is that most of the data we’re looking to analyze is unstructured or maybe better said—not consistent. The problem is that we humans have different ways of saying the same thing. What we say or how we say it may make sense to us, but often it is unclear to others. The English language is funny; many people claim it’s one of the hardest languages to master. Consider the case of words that are classified as homonyms (words that have the same spelling and pronunciation but have different meanings).

We’ve seen a number of these issues when we start a new analysis. Let’s take our analysis of financial institutions as an example. We were looking at how the public views various banking establishments and what their views on savings were.

One of the banks we tried to gather social data mining discussion on was Citibank . It has a very active Twitter account, and a quick search for viable information showed a fairly high level of activity. In our collection of data, we uncovered other banks that were also generating a fair amount of social media mentions. In Table 12.1, these banks are labeled Bank-1, Bank-2, and Bank-3.

What’s interesting about this bank is that it has a number of Twitter handles for specific geographies—for example, @citibankau (Australia), @citibankIN (Citibank Indonesia), and @citibankeurope. Consequently, our assumption was that the data we were collecting was for US banks (which was the scope of our experiment). What we forgot about was the fact that Citibank also sponsors a baseball stadium for the New York Mets in New York City. So as we pulled information from Twitter, we used the search term citi (as well as a few others like @citibank, and so on). When we did the first pass at our analysis, we had some preliminary statistics, as shown in Table 12.1.

One of the topics we included in our analysis was the public’s perception of certificate of deposit (CD) rates and offerings from various banks. In this instance, when we were gathering data, we included reference to home (mortgage) loans, refinancing, lines of credit, and CDs.

Unfortunately, the term CD is another homonym:
CD, as in certificate of deposit
CD, as in compact disc

In Table 12.1, notice the large amount of discussion around bank products (CDs were grouped under Products). We had to rework our data model to qualify the word CD with the word bank. This showed a bit of improvement, and things got a little better, but a quick look at Figure 12.3 shows the kinds of tweets we were still collecting. Clearly, we still had more work to do. The good thing was that the incorrect (and the correct) use of the word CD was applied to all the banks. That just means we had less confidence in discussion around that topic.

Our bigger problem was with the references to baseball around Citibank. The odd part of this was the more than 3,000 uncategorized statements for Citibank. We expected a number of comments or tweets to be uncategorizable, but this number was off by a factor of 100× compared to the others. A closer inspection showed a large number of references to baseball and sports. This result wasn’t too unusual because we knew that many large corporations (banks included) sponsor sporting events or teams. But then we uncovered tweets such as shown in Figure 12.4.

While this was potentially an important message if we were looking only at Citibank (and the positive sentiment surrounding its backing of a baseball team), in this case, such tweets were not at all relevant to our experiment.

We had two choices at this point. Referring back to Figure 12.1, we could:
■ Rework the question we were trying to answer (step 1)
■ Look to eliminate discussion about citi field (step 4)

Choosing Filter Words Carefully

Part of IBM’s Social Media Analytics offering was not only the tools to perform the computations and analysis of unstructured data, but also a service to allow different groups inside IBM to pull data off the Twitter stream for a variety of applications. Because this was a “self-service” application, our group didn’t vet any of the search terms used by the various groups. In one case, one of the teams wanted to track the number of times a specific set of users had their tweets “favorited” or, in Facebook terms, “liked.” Unfortunately, this team didn’t understand the concept of how to collect data with our tools. Instead of detecting a like, they simply watched the Twitter stream for any use of the word like! Given the hundreds of millions of messages tweeted in a given day, you can imagine how many would use the word like in their tweets. As a result, we were delivering approximately 1 million tweets per hour to their application, 99% of which were irrelevant due to the incorrect collection rules. If left unattended, the analysis of that data would result in a useless set of insights or results.

Understanding That Sometimes Less Is More
Because we are programmatically pulling data into our analytics software, we sometimes have to be careful not only about the mislabeling of data, but in this case the duplication of data.

In one case, we were looking at a number of issues across the information technology industry, and we wanted to see where IBM ranked in the discussion along with several leaders in this space. When we pulled our data and started to group the number of mentions by company, we produced a graph that looked something like Figure 12.5.

Upon careful inspection, we discovered that many of the items in our data collection were duplicated entries. For example, the mentions around apple numbered around 347 specific references; Microsoft, about 95; and so on. Our analytics just didn’t look right as we continued our analysis, and eventually, we realized that many of the entries in the sample were duplicates.

How could this be?

Consider a URL to a specific story on the website. (Note that we are not pointing out any flaws on this site at all; this is just a random example illustrating how two new entries can appear to be two separate entities.) If we look at a news story about the prediction of the next financial crisis,
the Web URL is:

So if we pull data from this site (assuming we used the keywords hedgefund and crisis), it would be mapped to that URL as the source. But say another web page linked to it and used this URL: In this case, both would return the same exact content, but because the URLs are different (notice the missing &snaid=&step=start at the end of the second URL), they appear to come from different sites.

We implemented a data duplication algorithm in our process, and as a result, the number of unique mentions of apple went from 347 to about 81 (just as Microsoft went from 95 to 34). The number of items for the other entities also decreased significantly, forcing us to reevaluate our data sources and keywords to search for more data. Note that our process of eliminating data is different from data deduplication. The concept behind data deduplication is essentially to save on storage (and the repeated storing of the same object). Data deduplication (sometimes called single-instance storage) is a technique that is often  used to reduce storage needs by eliminating redundant data. In many cases, only a single unique instance of an item is retained in an archive; any redundant use of the same item is simply replaced with a pointer to the single copy.

Think of email as an example. How many times has someone sent a large file to a large distribution list, only to have numerous people “reply with history” and forward a copy of the large file. In the end, everyone’s email inbox gets flooded with numerous copies of the same large file. By using a data deduplication technique, only one instance of that large file would be stored, and all subsequent uses would simply point back to the original. In our case, we didn’t want that. If the item of text (or discussion) was the same, we simply wanted to remove the redundant entry from our analysis. If someone made reference to the URL pointing to the content, that was fine. But if someone cut and pasted the content from one URL to a unique URL, we would eliminate the second source.

Customizing and Modifying Tools

In many cases, we’ve found that most of the tools that we use in our work need some kind of additional configuration. Mostly, we find this when we attempt to look at things such as sentiment, where we are trying to understand if people are speaking positively or negatively about a topic. The reason for this is that most tools use a simple dictionary lookup of words to calculate if a statement is expressing a positive or negative thought. To illustrate this, consider the following tweet:

Customer Service for xyz company is the worst I’ve ever seen. Most applications will scan over the words used in a set of text and simply add up all the positive words and all the negative words. They then subtract the count of positive words from the count of negative words to produce a score:

Positive Score – Negative Score = Sentiment Score

If that score is positive (that is, there were more positive words than negative), the overall sentiment is said to be positive. If the count of negative words is higher than the count of positive words, the subtraction of the positive word count from the negative will produce a negative number, indicating an overall negative sentiment. In the preceding example, if we were to use this scoring method, we could derive: A count of zero for the positive terms (there were no clearly positive words in the text) A count of one for the negative terms (the word worst) And then using the preceding formula, we would derive an overall sentiment score of:
0 (the positive score) – 1 (the negative score) = –1

So in this case, we would obtain a score of –1, indicating that the text is basically presenting a negative sentiment. There can be many variations on this algorithm, which tends to produce several shades of gray in the answer. For example, in the tweet shown here, the word service could be considered a positive term, although perhaps not as positive as words such as good, fantastic, or excellent. So the system may choose to score that word as slightly positive and assign it a score of +0.25 rather simply considering it as being equal to other words that have a stronger connotation of positive.

As an example, consider these two tweets:
Tweet 1: I thought the movie was decent.
Tweet 2: That was a great movie!!!

With both of those tweets, if we were to use the algorithm described here (simply counting the number of positive and negative words), both of these tweets would have a positive sentiment score of +1.0. Tweet 1 would be positive because the word decent is considered to have a positive connotation, and no negative words are used. In the case of Tweet 2, no negative words are used, and the word great would be viewed as positive, which would generate a score of +1.0.

However, that may not really be representative of the true intent. Tweet 2, while representing a positive sentiment, is less enthusiastic. In this case, it would be good to give the word decent (along with phrases like not bad, passable, or acceptable) scores that differed from words like excellent, thrilling, or stellar.

In our case, we gave the word decent a positive sentiment of +0.25. So it’s considered positive in sentiment, but only one-quarter that of other words, such as great, which would get a full score of +1.0. So in these examples, Tweet 1 would be of positive sentiment with a score of +0.25, and Tweet 2 would have a positive score of +1.0, indicating the greater sense of positive connotation.

This is where our customizations come in (Step 5a in the flowchart from Figure 12.1). We may need to add (or subtract) a number of words from a dictionary based on their use. Consider the commonly used words shown in Table 12.2. How should each word be treated? Should it be treated as implying a positive sentiment? Or should it be treated as connoting a negative sentiment? The choice can seem quite arbitrary, but as long as we’re consistent in our dictionary augmentations, this can add tremendous value to our analytics.

Most of the time, when we look at microblogging content (Facebook, Twitter, LinkedIn, and so on), we find that most of the content is fairly neutral—that is, neither positive nor negative in score. Think of a Facebook post that says:
Attending a session at #IBMAmplify on the use of Social Media Analytics There may be some interesting information to be gained from this posting, but there is no particular feeling being expressed. We tend to find that a large amount of microblog content is like this (which is fine). One of the issues we run into when doing a sentiment analysis is considering the use of words such as disaster—a word that clearly indicates some kind of negative sentiment. Consider the following LinkedIn update, for example: Spending my day reading about disaster recovery products on the market today.

Using the algorithm we outlined previously, the sentiment score of this post would be computed at a –1.0 (since disaster would be found in the negative dictionary, and there are no other words found in the positive dictionary). However, when the word recovery is used in conjunction with disaster, our sentiment score should be computed as zero (or the expression of no particular feeling) because the phrase disaster recovery really isn’t positive or negative. We’ve seen this situation countless numbers of times, where the negative connotation of the word disaster essentially cancels out the positive sentiment expressed around it.

Using the Right Tool for the Right Job

As we discussed previously, in our group we maintain a number of different tools to help us with all of our analytics projects. Some of these tools are “homegrown,” or tools and utilities that our team wrote for a specific task. Many of these tools are written in Java or R and utilize many of IBM’s big data tools such as Hadoop or DB2.

We also employ a number of IBM commercial products such as IBM’s Social Media Analytics, SPSS, IBM Content Analytics, and various Watson services now available on Bluemix. Many of these tools specialize in a particular facet of the overall process. For example, SMA does an excellent job of deriving sentiment and summarizing data, whereas SPSS allows for a greater flexibility and deeper statistical analysis. What we’ve learned over time is to take the best parts of each of these tools to derive a more complete answer to the question we are trying to answer. We’ve often found ourselves sticking firmly to a specific tool rather than being open to a variety of results.

Analyzing Consumer Reaction During Hurricane Sandy

In 2012, Hurricane Sandy struck a wide area across the Atlantic Ocean from Haiti and Jamaica and then north into the eastern coast of the United States and Canada. According to many records, more than 200 deaths were attributed to the storm, with approximately 146 of those in the United States and around 98 in the Caribbean. Sandy has been labeled as one of the deadliest and most destructive hurricanes from the 2012 Atlantic hurricane season, and recent estimates indicated that it ranks second in terms of dollar value in destruction in US history. Damage estimates have placed the cost to repair and replace damages at upwards of $68 billion US.

One of the engagements we worked on was the public’s perception of various home improvement and appliance stores within the United States during this period. Specifically, we were looking at the public’s reaction to updates, news about needed supplies, and the various responses to emergencies from these companies.

In this case, our main source of data was the public microblogs (Facebook and Twitter). Almost immediately the public gravitated toward a number of hashtags, and our first pass was to focus on them as a way to gather data. Our filters included the following (there were many more in the end; this is just a sampling):


This example raises an important point about both gathering data and promoting trackable events. By using a hashtag, individuals are telling others that the comment they are making or information they are providing is directly related to a specific event. This point is important because it can really help alleviate any problems with data validity. Since it’s already tagged by topic, we can be somewhat assured that it is part of the dataset we might be interested in. So, in establishing your own social network data media campaign, having a unique hashtag makes the job of doing analytics around the campaign that much easier. Please note that if you were to use a common hashtag (say #ff— Follow Friday) because your event happened on a Friday, be prepared to do a lot of data cleansing and filtering, since much of the data will be unrelated to your event. From an analyst’s perspective, you would be increasing the amount of noise in your dataset and drowning out the signal (the relevant information). We always want to keep the signal-to-noise ratio as high as possible, or maybe better said, we have to have data about our topic (the signal) than the other nonrelated issues (the noise).

A number of interesting positive as well as negative statements were made about a number of the organizations. Specifically, they were being criticized for not donating plastic bags for some of the relief efforts or closing stores early before the storms hit (claiming they cared more about their employees than the general population that was “in need”). On the positive side, a number of comments were made on the helpfulness of the employees and the fact that some stores brought in extra inventory (such as generators or pumps) early in anticipation of the storm. Overall, there seemed to be an equal amount of positive and negative sentiment.

What we really wanted to understand, however, was what topics or issues the general public was raising during this catastrophe. Using SMA’s Evolving Topics tool, we were able to look at terms or phrases that were used over the span of time that we were investigating in an attempt to uncover frequently mentioned topics (see Figure 12.6). The Evolving Topics algorithm identifies word phrases that occur multiple times within the same document and also across multiple documents, so our hope was that this would uncover some interesting insights.

While we appreciate how SMA can do an analysis across its dataset, we were having trouble uncovering the main points of the conversations that were happening. Although we had a significant amount of time invested in the creation of our SMA models, we still weren’t getting the results we needed to answer our question. Looking back to Figure 12.1, we had all of the data we were ever going to get, so it wasn’t a matter of looking for more or different content. It really came down to how we did the analytics and what tools could “tease” those insights out of our data (steps 5a or 5b).

IBM Watson Content Analytics (then called simply IBM Content Analytics, or ICA) is a tool that breaks down content into facets (see an example of an ICA analysis in Figure 12.7). The Content Analytics engine then provides data about the frequency and correlation information for keywords of the specific facets. Facets can be nouns, noun phrases, verbs, adjectives, and so on. What we are showing in Figure 12.7 is from a different analysis, but we wanted to illustrate the power of ICA and its ability to segment text into its grammatical components. From this analysis, we can begin to derive the topics of conversations and the frequency with which they are happening.

Think of what a noun is. It’s the subject of a sentence or a comment. The verbs describe the intensity of the topic or how people are feeling. Exporting our data from SMA and then re-importing into IBM Watson Content Analytics allowed us to create a list of top topics on the minds of people during this event (see Figure 12.8).

This list was exactly what we were looking for: the topics that people were discussing during the storm as they related to various home improvement centers. We were then able to generate increased sets of metrics and insights based on these themes. For example, when the discussion of generators arose, almost 50% of the discussion was around the need and the urgency of the need for them, while 32% complained of out-of-stock issues. This seems to be an interesting (and measurable) metric that store managers may want to have in the future (hopefully, a future catastrophe isn’t as severe).

So while this example might not be a case of something going wrong, it could have been. Things could have gone wrong had we attempted to use just a single tool in our quest to answer the questions posed. Sometimes we need to take a step backward (or perhaps sideways) to make progress forward; such is the life of an analyst.

Through this analysis, we were able to uncover a number of issues raised by consumers about the various home improvement stores in the area. We were able to report that many consumers were actively looking for batteries and generators during the storm; these items were clearly understocked by many of the stores. What was actually an unanticipated finding was a backlash aimed at the employees of some of the stores. It seemed that many of them abandoned the stores as the storms hit, leaving consumers with no place to turn for those batteries or generators. Those comments were teased from the dataset based on the keywords that arose from the evolving topic graph in Figure 12.6.
An enterprise social network (ESN) is an internal, private social network used to assist communication within a business [2]. As the number of companies investing in ESNs grows, employees are discovering more and more ways to conduct business on the ESNs. IDC expects the worldwide enterprise social networks market revenue to grow from $1.46 billion in 2014 to $3.5 billion by 2019, representing a compound annual growth rate (CAGR) of 19.1%[3].

We broadly divide this domain into internal and external. A majority of our discussion so far has focused on insights that can be gleaned from content available in external public social media data collection (such as Twitter). In this post, we focus on social media content that is generated inside a company’s firewall. This, we believe, is a very rich source of information, and a variety of business insights can be generated from it. Here, we discuss the concepts and a few specific examples regarding how companies can take advantage of this ever-growing collection of social data mining inside their social networks.

Employee privacy is an important consideration when looking at analytics based on enterprise social networks. We discuss how this was addressed for the Personal Social Dashboard use case in IBM. We want to utilize people’s individual posts to help create and leverage organization knowledge, but we need to do this without compromising the privacy rights of individual employees. We also describe how this was accomplished for the Personal Social Dashboard project.

Most of the content in this chapter is based on our personal experience with an IBM project called Project Breadcrumb, and its primary use case is the IBM Personal Social Dashboard (or PSD). This project is a collaborative effort involving many people in IBM over the years. We want to specifically highlight the contributions of Marie Wallace (social media strategist), David Robinson (big data architect), Jason Plurad (lead developer), Shiri Kremer- Davidson (chief data scientist), Aroop Pandya (data architect), Santosh Borse (data acquisition), and Hardik Dave (initial CIO project lead). Marie Wallace has written and spoken about this project in a variety of external venues (including TED Talks). Most of her writing on this project can be found on her website at [4].

Social Is Much More Than Just Collaboration

Many articles have been written about the benefits that companies are deriving from their investments in ESNs [5]. IBM first deployed an ESN in 2009. ESNs were initially deployed to facilitate collaboration among employees. In our experience, they have turned out to do much more for the companies. Figure 11.1 illustrates some of the significant advantages afforded by ESNs, beyond other traditional means of collaboration and communication as described further in the section that follows.

social media data collection

Transparency of Communication

When a CEO or a senior leader writes a blog in the ESN about the latest quarterly results and encourages employees to openly respond with their comments, there is a great deal of transparency in communication . This is not possible using traditional means of communication.

Frictionless Redistribution of Knowledge

When a piece of new information, technique, or finding is shared in an ESN, it reaches many members of the network without any special extra effort on the part of the originator. We (Avinash) recently encountered an IT problem with my Lotus Notes Mail system . I worked with Help Desk and got my issue resolved. I shared this experience on my wall, and this information was available to members of my network immediately. If those people further shared this information with their networks, this timesaving tip would be available to hundreds of employees within a short span of time and without any barriers.

Deconstructing Knowledge Creation

Discussion forums are a common feature of ESNs. These forums are created to elicit comments and ideas from different people on a given topic. Many times the forums are started with a description of a problem, and during the course of the discussion, after several members contribute, a solution emerges. This process not only creates knowledge that the company can leverage in the future but also distinctly identifies knowledge creators.

Serendipitous Discovery and Innovation

By far, the biggest benefit of ESNs in my experience is accidental discovery of useful nuggets of information. In October 2014, I was in the process of preparing a presentation for a talk that I was going to give at IBM’s Insight 2014 conference in Las Vegas. As part of our ESN’s features, I get a newsfeed of posts from the people I follow. One day I came across a post that referenced a study by MIT and Deloitte directly related to the subject of my talk. I was able to quickly take this information into account and enhance my presentation.

Enterprise Social Network Is the Memory of the Organization

Employee-to-employee interactions in an ESN are obviously quite valuable. When I have a question about a new product or process, somebody else generally responds with an answer. In many ways, then, this transaction is complete .

Figure 11.2 illustrates a subset of actual transactions that can take place in an ESN. You can see that, over time, various transactions are initiated and completed in the network. As time goes on, the network becomes the memory of the organization. Even though this may not have been the intent or the goal, the ESN ends up leaving breadcrumbs to indicate what people are doing and how they are behaving in different business situations. We contend that this is a very critical but underutilized resource in companies. This information can be mined for valuable insights with the help of analytics.

Understanding the Enterprise Graph

We have now established the value of the information left behind by social interactions in an ESN. The storage technology that is optimal for this type of information is called graph storage. A specific instance of this type of graph for a specific company is called the Enterprise Graph. In an ideal implementation, the Enterprise Graph combines transactional, social, and business data and also provides a knowledge base to perform analysis such as influence, social proximity, reputation, retention, performance, expertise, and more.

The basic components of an Enterprise Graph, shown in Figure 11.3, are as follows:
■ Data Sources—This is any source of data or metadata that we are interested in including in the Enterprise Graph. This includes social interaction data from an enterprise social network. For example, consider profile status updates and comments from everybody in a person’s social network data, including likes and tags.

■ Data Services—Data Services enable business applications to contribute data to the ESN via data APIs. For example, an application that compensates employees with a reward or recognition will utilize services (APIs) to contribute to profile status updates with the information about awards.

■ Graph Store—This is the physical storage utilized to store all of the interactional data that we are interested in analyzing. Graph APIs make interactional data easily consumable by analytics and business applications. For example, for an organization of 100 people, the graph contains a node for each person in the organization and an edge for each transaction between two nodes. As a result, the graph contains information about who did with what with whom and when
(in the ESN).

■ Analytics—Analytics algorithms generate graph-centric insights such as influence, expertise, and reputation. For example, the algorithms can compute a rank-ordered list of people with the maximum number of likes on their profiles.

■ Analytics Services—Analytics services make graph analytics easily consumable by business applications via analytics APIs. For example, an Expertise Locator application can query the graph using analytics services to identify a topic expert who is also very active in the ESN.

We can implement a number of useful applications that leverage the information from the graph. In the following sections, we describe one such application called the Personal Social Dashboard.

Personal Social Dashboard: Details of Implementation

The Personal Social Dashboard (see Figure 11.4) is based on a project in IBM. Internally, IBM uses an ESN based on IBM Connections, which includes a variety of features such as profiles, communities, forums, blogs, wikis, files, and bookmarks. The ESN is available to all IBMers, and there are varying levels of adoption and use of connections across business units and across geographies. Two groups of stakeholders are served by this application: individual employees and management.

The main objective for the individual employees is to help them understand their social impact and how to effectively activate their network for maximum value. For management, the objective is to help managers and executives understand how teams collaborate and to help them increase collective value.

Before we could process metadata about IBM employees’ social activities, we needed to follow an internal privacy evaluation and approval process. We were granted the approval based on the following agreements:

1. We will share the individual scorecard only with the individual who is signed into the system.

2. The overall engagement score or individual component scores will not be made available to anybody else.

3. We can share aggregate reports utilizing overall scores and component scores for groups, divisions, or countries to management teams; however, all such reports will be anonymized.

Some benefits that can be derived from an implementation of an application like the Personal Social Dashboard include the following:

■ The scorecard evokes a sense of competition, and people tend to want to improve their scores, which improves adoption of the enterprise social network and thus improves social collaboration.

■ The component scores give employees hints about what activities will contribute to their scores (activity versus reaction). This will enable employees to focus more on creating content that is valued by others and thus improve their individual eminence, which in turn adds value to the business unit.

■ The Personal Social Dashboard also provides a very detailed (anonymized) view of average scores by country and by business unit. This has already assisted managers and executives in understanding how ESNs are being utilized in their organizations and countries.

Key Performance Indicators (KPIs)

Based on extensive research conducted by the IBM Research team in Haifa, the IBM project team settled on four KPIs to represent or measure employee engagement [8]. The four KPIs represent four social behaviors:

■ Activity Scorecard—Measures the person’s periodic social activity effort; computed based on number of activities.

■ Reaction Scorecard—Measures how the person’s content is perceived by others; focuses on quality of activity based on reactions to it.

■ Eminence Scorecard—Measures how the person is perceived by others; focuses on number of people interacting with the person or his content.

■ Network Scorecard—Measures the person’s network; initially focuses on size—both network and followers.

The main page of the dashboard shows a composite score as well as the score of the four individual subcomponents (see Figure 11.5). To illustrate the capabilities of this application, we utilize Avinash’s scores and scorecard. The scorecard shows an overall score of 81. There is also a trend graph that plots Avinash’s score over the past six months. The bottom half of the screen shows the four subcomponent scores and a comparison of Avinash’s scores against an average of the organizational unit that he belongs to. For instance, his activity score of 76 is higher than the average activity score for his organization, which is 18. Similarly, we are able to compare Avinash’s other component scores against his organization’s average score.

How Does This Score Compare Against Others?

Figure 11.6 shows that Avinash’s overall score gives him a rank of 410 within IBM, and there are 84 other people who have the same score as he does. Within his organization (T&O), he has a rank of 88, and there are 20 other people in the organization who have the same score.

Activity Scoreboard

The discussion in this and the following three sections considers the reader as an employee.

Are you, as an employee, actively engaging with people on a regular basis? This measure is calculated by analyzing your various activities in relation to other employees, placing emphasis on more involved contributions (creating content, commenting, and so on) and less emphasis on the lighter-weight ones (reading, liking, and so on). This is illustrated in Figure 11.7.

This scorecard also shows the counts of the following six additional factors that collectively influence the overall Activity score:

■ Containers You Created—Are you creating collaboration spaces where people can engage? Are you taking an active leadership in providing an environment for sharing? This drill-down provides the number of blogs, communities, forums, or wikis you have created.

■ Content You Created—Have you been making an effort to provide content to the social network? This drill-down provides a count of what you have created (files, posts, status updates, and the like).

■ Content You Shared—Have you been sharing interesting content with your colleagues? This drill-down provides a count of content you have shared (files, posts, status updates, and so on).

■ Content You Edited—Are you contributing to the content created by your colleagues? This drill-down provides a count of content you have edited (pages, files, and so on).

■ Content You Liked—Are you generous with your positive feedback? This drill-down provides a count of the content you have liked.

■ Content You Tagged—Do you like to help the system to better qualify and organize the content? Or do you give your colleagues a thumbs-up by specifically tagging them with an expertise or skill? This drill-down provides a count of the content you have tagged.

Reaction Scorecard

Does your content generate a reaction from your fellow employees? This measure (see Figure 11.8) is calculated by analyzing your colleagues’ activities on your content, placing more emphasis on engaged contributions (comments, likes) and less emphasis on the more passive ones (reads, follows).

It also shows the counts of the following six additional factors that collectively influence the overall Reaction score:

■ Commented on your content—Do people like to give you feedback on your content and share their opinions? This drill-down provides the number of comments that your content has received.

■ File downloads—Do people download and read the content in the files that you uploaded? This drill-down provides the number of file downloads you have received.

■ Liked your content—Do people like your content? This drill-down provides the number of likes your content has received.

■ Tagged your content—Do people tag your content? This drill-down provides the number of tags that people have applied to your content.

■ Shared your content—Do people like to share your content with their colleagues? This drill-down provides the number of times people have shared your content.

■ Followed your content—Do people like to keep track of your content by following it so they get updates when it is modified? This drill-down provides the number of times that people have followed your content.

Eminence Scorecard

Do people value your opinion? Do they listen when you talk? This measure is calculated by analyzing how your colleagues interact with you (as an individual) and your content (as a reflection of their opinions). This is illustrated in Figure 11.9.

It places more emphasis on actions directed at you (tagging your profile, mentioning you in status updates, or following you) and less emphasis on those directed at your content (comments, likes, and so on). It also shows the counts of the following four additional factors that collectively influence the overall Eminence score:

■ People interacted with you—Do people like to regularly engage with you directly? This drill-down provides the number of people who have interacted with you directly (posted to your board, mentioned you in a status update, tagged you, shared something with you).

■ People interacted with your content—Do people like to regularly engage with your content? This drill-down provides the number of people who commented on your content (shared their opinion).

■ People who value your content—Do people value your content? This drill-down provides the number of people who liked your content.

■ People connected with you—Do people like to be part of your network? This drill-down provides the number of people who followed or befriended you .

Network Scorecard

The Network scorecard focuses on the network size and diversity (see Figure 11.10).

It also shows the counts of the following three additional factors that collectively influence the overall Network score:

■ People you are following—How many people do you follow to keep up with what they are doing and saying? This drill-down provides the number of people that you are following.

■ People who are following you—How many people like to follow what you are doing and saying? Are you speaking into the void? This drill-down provides a count of the people following you.

■ Friends—Do you have a rich network of colleagues that you are connected to? This drill-down provides a count of the people you are friends with .

Assessing Business Benefits from Social Graph Data

An analytics team in IBM decided to study the impact of social behavior on business outcomes. Over the first half of 2014, the team analyzed IBM’s Enterprise Graph (year 2013 data), IBM patents (about 4,500 patents), and customer advocacy data (for about 4,000 people). They concluded that there is a statistically significant positive correlation between high social engagement scores, as captured in the social graph, and the number of patents. In addition, there is a statistically positive correlation between high social engagement scores and the likelihood of getting selected as a customer advocate. Additional insights can be gathered by studying the scores of groups of people. Figure 11.11 shows the hypothetical score distribution of a possible department or division in an organization.

An analysis of these scores can yield different results and insights. One such insight is that a large number of people seem to have a decent Network score, but only a small number of people are doing well in terms of other scores, as illustrated in Figure 11.12. One interpretation is that people focus on building their networks but forget to engage. By the way, this is a common pattern that we observe when we study different groups.

What’s Next for the Enterprise Graph?

The Personal Social Dashboard has been used extensively to provide platform level metrics such as the number of social activities of certain types. The team is looking to move beyond this to some metrics related to business outcomes. Can we show that increased social behavior of certain types can result in positive business outcomes at the department, business unit, or company level?

■ Sales outcomes—If we include sales-related data such as number of leads, number of opportunities, number of proposals completed, and amount of revenue generated, can we attempt to find social behaviors that contribute to positive sales performance?

■ Employee performance outcomes—If we include data such as p romotions, awards, successful project completions, and recognitions, can we attempt to find social behaviors that improve employee performance outcomes?

■ APIs—Application Programming Interfaces enable other business applications to provide new data that can be added to the Enterprise Graph and to consume and enhance (or report) graph analytics.
Discovering themes and patterns from social data mining through social media content can be a very exciting endeavor. Sometimes the questions to be answered are too complex to be looked at in real time, or they need to establish a relationship between two or more entities, or the analysis involves multiple different phases. We call this a deep analysis. In this chapter, we delve into some use cases that require a more complex type of process or analysis, and we explore what we can uncover with a more sophisticated method analysis.

Responding to Leads Identified in Social Media

Many services and products in the marketplace try to facilitate discovering leads in social media. In the following sections, we discuss an experimental project that we are working on in IBM. For the purposes of this discussion, we break it up into three phases:

■ Identifying leads
■ Qualifying/classifying leads
■ Suggested action

Identifying Leads

Our Social Listening team, led by Mila Gessner, began working with an account team within IBM. The goal of this exercise was to see if we could analyze conversations taking place within social network data venues, around the topics that were of interest to us or some pilot set of customers, to identify sales leads for our customer-facing marketing representatives. Team members identified 10 accounts across a variety of industries. They also identified lines of businesses within these companies to focus on. A list of key IBM competitors relevant to these accounts and the industries and lines of business were identified. Using the methods already identified in this post, the team completed the data identification phase and determined the type of data, the sources of data, and the duration of data to be included in the analysis. After several iterations, team members were able to settle on a set of keywords to use to search for content for this analysis on a daily basis. Figure 10.1 shows the process that was used in this phase.

social data mining
The next step was to use an IBM software package called IBM Social Media Analytics (SMA) to develop a model to perform social media data mining analysis of all the content being pulled in through the data identification phase. The sales team and social analysis team worked together iteratively to settle on the keywords that were to be used for the topic areas that we were going to focus on. During the first couple of iterations of the model-building phase, the teams realized that the list of keywords was really missing the mark. A separate effort was taken to identify the appropriate keywords for this project. While this was going on, the teams also developed a model that took in the content being provided by the initial keywords filter and classified into a broad category called “Leads.” These are considered potential sales or service opportunities based on the model. The goal was to share this with the sales and sales support teams for further steps of lead qualification and followthrough.

At the end of this phase, the teams had a strong list of keywords covering the following areas:
■ List of selected accounts—These are the names and aliases of the accounts that the account teams selected for this pilot project.
■ List of industries—This is a validated list of industries that cover the accounts we are interested in doing the listening for, such as Retail, Banking, and Education.
■ Subject categories—These are the specific areas of business solutions that we are interested in targeting such as Finance, Operations, Information Technology, and Security.

Figure 10.2 illustrates the process steps utilized in this phase.

Qualifying/Classifying Leads

As discussed in the previous section, the output of the social listening team was a set of leads that could be analyzed on a daily basis. The team receiving the insights would look at the leads to identify which could be considered as real leads to be pursued.

One challenge encountered immediately was that there were too many potential leads but the actual number of leads to be pursued was very small. The busy salesforce was spending too much time reading through a large amount of social media updated in an attempt to identify which ones required action. In essence, the team had to wade through a number of results to find the most valuable items—sort of like looking for a needle in a pile of needles. One of the major problems was that of duplication; let us illustrate this with an example.

Suppose that the CIO of a bank gives an interview in a trade journal. In this interview, the CIO specifies that his company is about to revamp the technology infrastructure for supporting mobile access. This sounds like a clear opportunity that is worth exploring. The Social Listening Model that we had developed was designed to capture such cases and flag the story for further review and analysis by the sales team. However, what happened in this particular case (and in many other cases like these) is that the article was widely referenced in a variety of other journals, news sources, and social media venues. So, instead of having one lead about this opportunity, our system flagged over 100 references.

The team wrote a utility program to take the raw leads as input and weed out the duplicates. Team members first filtered out articles that had identical titles that brought the number of posts on this topic down to 70. What they noticed was that the same article would appear under a slightly different title. The team broadened the definition of duplicates to include those articles whose title differed in only one, two, or three words. This process also involved active collaboration between the analyst and business teams to ensure that we were filtering enough but not too much! This phase is typical in social media analysis projects. Even though the software makes a lot of advances in facilitating analysis of text, in certain phases manual intervention is critical to the success of the program. In our experience, this filtering/cleanup is highly specific to each deep analysis project and could require a lot of investment in terms of people time. For the sake of this discussion, we have highlighted this simple example of deduplication . This process is illustrated in Figure 10.3.

The team would look at these leads and determine which leads to follow up on. After several months of using this approach, the team developed some logic and informal business rules for determining which leads have high potential. As of this writing, the team is looking at further automating these basic “rules of thumb.”

Suggested Action

The analytics and sales teams worked together through many situations to identify what the next step was going to be once a lead was identified. This part of the project was very subjective, but we are able to specify some suggested actions to give you an idea of how to incorporate this in your specific project.

The main suggested action in each of these potential leads was to follow through in an attempt to determine if we can nurture the lead into a sale. This follow-through can take on several different forms.

In this first example, let’s assume a user expresses an interest in a specific IBM product or service via Twitter. One suggestion is to respond back to the tweet with pointers to a specific white paper or customer testimonial providing more details about the product or service that the user showed an interest in. However, we can also perform some additional research on the Twitter handle to understand what sort of profile information we can gather about the user from the Twitter site. If, for example, we learn that this user is from the San Francisco Bay Area, we could send a personal invitation to an IBM event happening in the Bay Area of California, or wherever the user may be from. The point is, we can begin to establish an interest in providing more information and continuing the dialogue, which could perhaps lead to additional sales.

In another type of lead, we could uncover that a company has announced plans to investigate other markets or expand its product/service lines. We may have learned this through a news item or through a comment posted by the someone placed high in the company (such as the CIO) in social media channels or a blog. We could take this information and pass the lead on to the appropriate sales team in that region or city to explore potential sales opportunities further, perhaps getting the jump on some of our competitors.

In another example, we could stumble upon a person who shows interest in learning about some new technology—say, cognitive computing facilities being showcased by the IBM Watson division in the health-care industry. We may have learned this from a comment posted on a blog by an IBM subject matter expert. In this case, we would want to create a follow-up by posting a schedule of training events or conferences on the topic of cognitive computing given there seemed to be heightened interest. The hope is that when the person actually enrolls for a session, we can learn more about her specific situation to see if there are any potential sales leads to be explored.

Finally, let’s look at a specific example that we encountered on this project. In the preceding section on qualifying leads, we gave an example of a bank. In this particular instance, the CIO of the bank had announced that it was in the process of revamping its entire mobile platform and was looking for solutions in the area of fraud prevention. When this lead was discovered, IBM had recently acquired a company specializing in fraud detection in financial institutions. A number of steps were taken to follow up. First, an invitation was sent to this company for a conference where IBM was showcasing these services. Second, a specific meeting was arranged between the management team of this bank and the appropriate IBM services team, along with the subject matter expert in fraud prevention. As a result of these two simple follow-ups, which originated from a discovery made in social media, our IBM sales teams were able to close on a deal with the bank and our new fraud detection acquisition.

Support for Deep Analysis in Analytics Software

A lot of new development is happening within deep analysis in analytics software. In the following sections, we share a couple of examples of capabilities in IBM Social Media Analytics that can assist analysts in their deep analysis projects. We illustrate these capabilities in the context of another real project that we encountered.

Topic Evolution

According to the SMA product documentation, Evolving Topics is defined as follows:

Evolving Topics is a unique algorithm that will analyze social media content to discover threads of conversation emerging in social media. This is different than a general word cloud where the focus is on overall re-occurrence in social media. When you are going to analyze social media content, you specify the time frame you want to assess evolving topics. For example, discover topics that are emerging in social media over the last 30 days. This feature will help analysts identify topics that they may not have known to include in the model. In 2013, for example, there was a massive disruption of activities due to floods. A social media analytics model was developed for understanding the issues and relief plans. When the team prepared the model, team members used some commonsense knowledge to decide on what keywords to search for. For the keywords they selected, a large amount of social media text was pulled into the model for analysis. The Top Evolution view from the SMA Package showed a number of terms that were part of the conversations that were happening around the floods in Calgary, Alberta, Canada. The analyst team did further research on many of the new and unexpected terms and were able to enhance the model with richer and more relevant keywords such as yycflood and wearecalgary.

Figure 10.4 shows the Topic Evolution view for the Calgary Floods project.

In our experience, this feature has served us well. For many listening projects, we do not typically get a well-thought-out list of keywords. The listening team starts out with a beginning list of keywords and then uses the Topic Evolution function iteratively to enhance the list of keywords. The typical process followed in “keywords” development using Topic Evolution is illustrated in Figure 10.5.

Affinity Analysis in Reporting

In IBM Social Media Analytics, the reporting environment offers an Affinity measure for many reports. The Affinity measure analyzes how closely two dimensions (or attributes of a dimension) are related to each other. This helps analysts gain insight about possible strengths, weaknesses, opportunities, or threat areas based on the affinity between the dimensions or attributes.

The Affinities measure is based on a statistical method that is known as the chi-square test of independence. This method estimates how often two dimensions should occur together if they were independent (for example, products and product features). It compares the estimate with the actual count and identifies whether the difference is statistically significant (either higher or lower than expected). For the Calgary Floods project, the final model included a two-dimensional affinity matrix (see Figure 10.6).

The vertical dimension shows a variety of geographical regions or parts affected by the flood. The horizontal dimension shows a number of issues or problem areas. The color of the cell corresponding to each combination of the Region and Issue indicates how strong the relationship is between the two dimensions. The darker the color, the more statistically significant is the relationship. For example, the matrix shows that the issue about Donations is quite strong in the Mission region. Similarly, the issues about insurance are quite strong in the region called Elbow Park. A table such as this can be used to prioritize actions in response to the issues when cities are faced with limited resources and limited time.

Figure 10.7 shows an expanded image of the top-left portion of this affinity matrix.

The color of the cell indicates the level of affinity. The color red indicates a high affinity, orange is indicative of a medium affinity, and yellow indicates low affinity (gray is representative of no affinity or relationship). The number in each cell shows how many times the two dimensions occur together in the given dataset.

This example shows that there is a strong correlation between driving issues and the region BonnyBrook; this is implied by the strong red color of the cell. The same chart implies that the driving related concerns are not that relevant for the Victoria Park region. You might notice that the cell associated with downtown and clean up has a big number (435); however, based on statistical weight, the affinity between these two dimensions is considered to
be low.

Affinity analysis can be a very effective tool in understanding the relationship between different dimensions of a model built for a particular use case. It becomes a critical tool for assisting with prioritization of the follow-up actions resulting from this phase of the analysis.
Recommended Post Slide Out For Blogger